Database Sharding

What Is Sharding?

Sharding splits a large dataset across multiple database instances (shards). Each shard holds a subset of rows. Together they form the complete dataset.

Sharding Strategies

Range-Based

Shard by key range (e.g., user IDs 1–1M on shard 1, 1M–2M on shard 2). Simple, but creates hotspots if data is skewed.

Hash-Based

Hash the shard key and assign to shard by hash modulo N. Even distribution but range queries span all shards.

Directory-Based

A lookup service maps keys to shards. Flexible but adds a single point of failure.

The Hard Problems

Cross-shard joins: Avoid at the schema level. Denormalize or use application-level joins.
Cross-shard transactions: Distributed transactions are expensive. Design to avoid or use sagas.
Rebalancing: Adding shards requires moving data. Consistent hashing minimizes this.

When Sharding Is Wrong

Start with vertical scaling and read replicas. Sharding is operationally expensive. Only introduce it when you have a genuine single-node bottleneck.