All writing
·6 min read

Choosing a shard key: user, tenant, region, or time?

Shard on the unit most requests already own. Hash to spread load inside that unit; layer region or time when residency, lifecycle, or skew demand it—never pick a key because it fit a slide.

DatabasesDistributed SystemsScalability

The shard key is not a clever hash function contest. It is the answer to: “If I could only co-locate one thing so that most reads and writes stay local, what would that thing be?”

Get that wrong and you win cross-shard joins, hot spots, and pager rotations where nobody remembers why email was the partition key. If you have ever seen Vitess or a sharded Dynamo story, you know the key is half the product contract.

Shard key dimensions

Choosing a shard key overviewBy user_idB2C defaulthash(user_id) → shard A | B | CCo-locate profile, cart, prefs per userPro: simple mental model, easy debugCon: global feeds / friends can fan outConsumer apps: feed, inbox, settingsBy tenant_idB2B SaaS defaulthash(tenant_id) → isolate noisy customers“Elephant” tenant → dedicated pool / cellPro: blast radius per contract, move tenantsCon: skew + cross-tenant analytics harderMulti-tenant APIs, per-org data planesBy regionLatency + residencyEU users → EU cluster · US → US clusterInner hash(user_id) inside each region commonPro: GDPR story, low RTT for localsCon: travelers + cross-region reportingFinance, health, real-time games with localityTime + hashLogs, metrics, eventspartition(date_bucket, hash(service_id))Drop cold months; “today” stays hottest without hashPro: lifecycle + retention are first-classCon: pure time skews unless you stir with hashAudit, telemetry, append-mostly streamsPick the dimension most requests already scope to—then spread load inside it.

1. First principle: start from ownership and access patterns

Ask:

  • Who owns the row in product language? A consumer, a company account, a region, a device?
  • What scope does a typical request carry? user_id, tenant_id, country, session_id?
  • What must never fan out on the hot path? Checkout is different from “global search across all tenants.”

A good shard choice:

  • keeps the majority of transactions on one shard (or a small, predictable set),
  • spreads write load instead of funneling it,
  • matches compliance and latency stories you can defend in an audit.

2. Sharding by user_id (consumer products)

Good fit when most traffic is “this user’s cart, settings, feed slice scoped to me.”

Benefits

  • Simple mental model for engineers and support.
  • Co-locates data you usually fetch together.
  • Easy to reason about backups per cohort (with care).

Trade-offs

  • Global features (social graph, cross-user search) need fan-out or secondary indices.
  • Power users can still skew a shard if popularity is wildly uneven—mitigate with sub-sharding or virtual shards.

Heuristic: default for B2C when the product is user-centric and you are not primarily a multi-tenant control plane.


3. Sharding by tenant_id or account_id (B2B SaaS)

Good fit when the paying customer is a company, isolation is a selling point, and you bill by seat or usage per org.

Benefits

  • Noisy neighbor containment—move or throttle one tenant without guessing which user rows bled together.
  • Operational stories: export, delete-for-GDPR, per-tenant upgrades.

Trade-offs

  • Elephant tenants—one Fortune 10 can dwarf the rest; you need dedicated cells, higher quotas, or sub-partitioning inside the tenant.
  • Cross-tenant reporting belongs in the warehouse, not ad-hoc SQL across every shard at request time.

Patterns: hash tenant → shard; promote hot tenants to isolated pools; keep a catalog of where each tenant lives.


4. Sharding by region or geography

Good fit when data residency, latency to humans, or regulatory boundaries are first-class requirements.

Benefits

  • Clear story for lawyers and customers (“EU data stays in EU”).
  • Users near the region see predictable RTT.

Trade-offs

  • Humans and devices travel; define what “home region” means.
  • Cross-region aggregates and DR drills are expensive—plan pipelines, not midnight SQL.

Common pattern: region as the outer boundary, then hash(user_id) inside the region for even spread.


5. Sharding by time

Good fit when data is append-mostly and cold data ages out: logs, metrics, audit, click streams.

Benefits

  • TTL and archival become mechanical (drop partitions).
  • Queries usually filter on recent windows.

Trade-offs

  • “Today” is always hot unless you add a second dimension (hash) to stir writes.
  • Backfills and late events need explicit handling.

Typical pattern: date_bucket + hash(high_cardinality_id) so retention and load both behave.


6. Other dimensions: feature or domain slices

Sometimes the natural seam is bounded context, not user: orders vs billing vs inventory subsystems that do not need to share a shard map.

Use this when teams and failure domains already align with domain lines—still document cross-domain reads so nobody assumes a magical join across cells.


7. Hash, range, composite, and hierarchical keys

7.1 Hash sharding

Spreads keys pseudo-randomly across nodes—great for even write pressure when the key has high cardinality.

Weak for range scans unless the product query is always equality-shaped.

7.2 Range sharding

Keeps contiguous keys together—great for time-ordered or sorted access.

Risk: hot ranges (popular prefixes, monotonic IDs). Combine with salting or secondary sharding when needed.

7.3 Composite shard keys

Example: (region, user_id) or (tenant_id, hash(user_id)) when one dimension alone would skew or violate boundaries.

7.4 Hierarchical sharding

Outer dimension (region, tenant class) then inner hash—matches how organizations actually grow.

7.5 Practical rules

  • OLTP user-ish workloads: start with hash on the primary ownership key.
  • Logs / events / metrics: time bucket + hash of a high-cardinality dimension.
  • Compliance / locality: hierarchy first, hash inside.
  • Avoid shard keys that are volatile (email), low cardinality alone (country), or monotonic without mixing (raw created_at only).

Sharding is not the same thing as read replicas, but you often end up with both: partitioned primaries for write scale, replicas for read-heavy paths that still respect tenant boundaries. Sketch of that split:

Separate reads & writes

Read/write split: primary + replicasReads scale horizontally. Writes scale… carefully.Use replicas for read-heavy paths; keep write path controlled.ServicePrimarywritesReplicaReplicaReplicawritereads (scale out)replication lag is a trade-off (read-after-write may need primary)

8. How to choose between user, tenant, region, or something else

Work through:

  1. What is the most common request scope? That is your first-class shard dimension.
  2. What invariants cross scopes? If “always global,” you need async projections—not bigger scatter-gather in the request path.
  3. What fails if one shard melts? Pick boundaries that match blast radius you can explain to finance.
  4. What does ops need? Tenant moves, regional failovers, per-tenant export—make those operations possible without rewriting physics.

Rules of thumb

Product shapeFirst hop
Consumer apphash(user_id)
B2B SaaShash(tenant_id) + plan for elephants
Logs / eventstime_bucket + hash(dimension)
Regulated localityregion → inner hash

9. Anti-patterns to avoid

  • Sharding on emails or usernames that users can change—your mapping becomes a migration treadmill.
  • Sharding on metrics you do not control (current QPS per key) instead of stable ownership.
  • Monotonic-only keys for writes (naive time) without a stir—everyone piles onto the last node.
  • Pretending cross-shard joins are free because the ORM still lets you type them.

10. Interview-style summary

  1. Name the unit of ownership the business already uses.
  2. Pick hash vs range vs composite from access paths, not vibes.
  3. Add region or time when law, latency, or lifecycle demand it.
  4. Call out hot spots (elephants, “today,” monotonic IDs) and how you spread them.
  5. Admit fan-out features and solve them with async indexes, warehouses, or product scope changes—not bigger SQL hope.

Summary table

Shard keyBest forMain riskPattern
user_idB2C product surfacesglobal graphs / searchhash(user_id)
tenant_idmulti-tenant SaaSelephant tenantshash(tenant_id) + isolate big
regionresidency, RTTtravelers, cross-region BIregion → inner hash
time + hashtelemetry, audithot current bucketdate_bucket + hash(id)

Closing

The shard key is a product decision wearing a database costume.

Pick the dimension your requests already believe in, spread load inside it on purpose, and document the two queries that will hurt—because those are the ones that will page you at 3 a.m. anyway.