July 28, 2024

What is Database Sharding

As a DevOps engineer, database sharding affects you in two concrete ways: you are either asked to help scale a database that can no longer keep up, or you inherit a sharded system and need to operate it without fully understanding why it was designed that way. Either situation is easier to navigate if you understand what sharding actually is, what problem it solves, and what complexity it introduces.

This post covers the mechanics of sharding, the four main strategies, how to evaluate a shard key, and the operational reality of running a sharded database in production.

What Is Database Sharding?

Database sharding is the practice of splitting a large database horizontally into smaller pieces called shards. Each shard holds a unique subset of rows and lives on its own database server (called a node), but all shards share the same schema.

Key terms:

Logical shard: the partitioned subset of data.
Physical shard / node: the server storing one or more logical shards.
Shard key: the column used to determine which shard a given record belongs to.

Sharding follows a shared-nothing architecture: each node is independent and unaware of the others. A routing layer (sometimes called a proxy or query router) sits above all shards and directs queries to the correct node.

Why Sharding Comes Up

When a single database server stops being able to keep up, there are two scaling paths: vertical (bigger machine) and horizontal (more machines). Vertical scaling has a ceiling and is expensive. Sharding is the standard horizontal scaling answer for relational and some NoSQL databases.

There are three concrete reasons to shard:

1. Query performance at scale. Smaller datasets per shard mean fewer rows to scan, which translates directly to lower query latency. A table with 5 billion rows benefits significantly from being split into 10 shards of 500 million rows each.

2. Higher availability. Sharding is typically combined with replication. If one shard goes down, only the data on that shard is unavailable. The rest of the application keeps running because other shards are unaffected.

3. Horizontal scalability. Instead of buying a larger server, you add more nodes. In cloud environments this means adding more RDS instances or database cluster nodes rather than upgrading instance types.

Before committing to sharding, make sure you have exhausted simpler options: read replicas to offload read traffic, connection pooling (PgBouncer, ProxySQL), caching (Redis, Memcached), query optimization, and index tuning. Sharding is hard to undo and adds significant operational complexity. It is a last resort, not a first response to a slow database.

Sharding Strategies

Range-Based Sharding

Records are assigned to shards based on a range of the shard key value. For example, using customer_id:

Shard 1: customer_id 1 to 1,000,000
Shard 2: customer_id 1,000,001 to 2,000,000
Shard 3: customer_id 2,000,001+

Simple to implement and easy to reason about. The problem is hotspots: if most of your writes are for new customers, shard 3 absorbs all the write traffic while shards 1 and 2 sit mostly idle. Time-based range sharding has the same problem: the "current" shard gets all the writes.

Hash-Based Sharding

A hash function is applied to the shard key and the result determines which shard stores the record. For example: shard_id = hash(customer_id) % num_shards. This produces an even distribution of data across all shards regardless of the key's value distribution.

The trade-off: when you need to add more shards, the hash calculation changes, which means data needs to be redistributed. Consistent hashing mitigates this by minimising redistribution when nodes are added or removed, but it adds implementation complexity.

Directory Sharding

A lookup table maps shard key values to their target shard. Every query first hits the directory to find the right shard, then queries that shard.

This is the most flexible approach: shards can represent meaningful business segments (one shard per region, one shard per large customer), and you can migrate data between shards without changing the hashing logic. The downside is that the directory is a single point of failure and a write bottleneck. If the directory is unavailable, the entire system cannot route queries.

Geo-Sharding

Records are partitioned based on geographic location. Each region's data lives in a shard co-located with the users who access it, reducing latency. A fintech platform might shard by country to comply with data residency requirements.

Works well when access patterns are strongly location-driven, but can produce very uneven shards if your user base is not geographically balanced.

Choosing the Right Shard Key

A poor shard key is the single most common sharding mistake. The key determines data distribution, and bad distribution means hotspots, which defeats the purpose of sharding. Evaluate a candidate key on three dimensions:

Cardinality. How many distinct values does the key have? A boolean field gives you at most two shards. A boolean-derived shard key (for example, is_premium) will put all premium users in one shard. You need high cardinality for meaningful distribution.

Frequency. How evenly are values distributed? If you shard by country and 60% of your users are in one country, 60% of your data is in one shard. Query that shard enough and you have recreated the original single-database problem.

Monotonic change. Does the key tend to increase over time? Sharding by created_at or an auto-incrementing ID means all new writes go to the latest shard. This is the range-based hotspot problem again. A good key has no strong monotonic trend.

A practical example: in a multi-tenant SaaS application, tenant_id is often a good shard key. Tenants are discrete entities with roughly independent access patterns, the cardinality is high enough to distribute across many shards, and there is no strong monotonic trend.

Operational Reality

This is where sharding gets difficult, and where the decision to shard should be made very carefully.

Cross-shard queries. Once data is on different nodes, any query that spans shards requires the query router to scatter the request across multiple nodes, collect the results, and merge them in application memory. Aggregations (SUM, COUNT, AVG) across all records, joins between tables on different shards, and queries where the shard key is not in the WHERE clause all become expensive multi-node operations. Schema designs that looked clean on a single database need to be reconsidered.

Schema migrations. Running ALTER TABLE on a sharded database means running it on every shard. Coordinating a migration across 20 shards, ensuring they all complete before the application code that depends on the new schema is deployed, requires careful orchestration. Tools like gh-ost or pt-online-schema-change help with zero-downtime migrations, but they need to be run against every shard.

Monitoring. Each shard is an independent database server with its own metrics. You need Prometheus exporters, dashboards, and alerts on every shard, and you need aggregated views that can surface imbalances across shards (one shard with 3x the query rate of the others is a sign of a hotspot).

Rebalancing. When shards become uneven over time, you need to move data between shards without downtime. This is one of the most operationally expensive tasks in a sharded system and is hard to automate safely.

Managed services reduce but do not eliminate complexity. MongoDB Atlas handles sharding transparently. CockroachDB and PlanetScale are designed to shard automatically. These are significantly easier to operate than hand-rolled sharding, but you still need to understand the shard key selection problem, cross-shard query costs, and the consistency trade-offs involved.

Conclusion

Sharding is a powerful tool for horizontal scaling, but it adds complexity across every layer: application code, schema design, query patterns, migrations, monitoring, and incident response. The decision to shard should come only after simpler scaling options have been exhausted.

Strategy	Distribution	Flexibility	Main operational risk
Range-Based	Uneven	Low	Hotspots from monotonic writes
Hash-Based	Even	Low	Expensive rebalancing when adding nodes
Directory	Depends on mapping	High	Directory is a single point of failure
Geo-Sharding	Depends on user geography	Medium	Uneven distribution by region

If you do commit to sharding, the shard key decision is permanent and expensive to change. Spend the time upfront to model your access patterns, measure your data distribution, and validate your key choice before you start migrating data.