System Design Interview Checklist

How to Use This Checklist

This is a preparation audit tool, not a during-interview script. Run through it the night before. Any category where you can't speak fluently for 3 minutes is a gap you need to close.

The 8 categories below map to the phases where system design interviews most commonly expose weaknesses. They are not exhaustive. They are the minimum.

1. Requirements and Scope

Can you do this in 5 minutes or less?

Identify 3–5 functional requirements without over-specifying
Establish read/write ratio and dominant traffic pattern
Confirm consistency requirement: strong, eventual, or read-your-writes
Confirm availability target: 99.9% (8.7h downtime/year) vs 99.99% (52m/year)
Identify the primary user action that drives load

Gap signal: You're still asking clarifying questions at minute 8.

2. Capacity Estimation

Can you produce a defensible number in under 3 minutes?

QPS: DAU × actions/day ÷ 86,400
Storage: write QPS × payload size × retention period
Bandwidth: peak read QPS × payload size
Know your units: 1M users × 1 req/day = ~12 req/sec; 100M users = ~1,200 req/sec
Know when horizontal scale is required: single server handles ~50k req/sec for typical CRUD

Gap signal: You skip estimation and go straight to components. Every interviewer notices.

3. Data Model and Storage

Can you justify your storage choice?

SQL vs NoSQL: know the decision criteria (ACID vs scalability, structured vs document)
Primary key design: how you'll query the data drives the key
Sharding strategy: horizontal (by user_id, by hash), vertical (by feature)
Know when to use a separate cache vs in-DB caching
Know when NoSQL falls short: multi-entity transactions, complex joins

Gap signal: "I'll use a database" without specifying what kind or why.

4. API Design

Can you spec the core endpoints in 2 minutes?

REST vs GraphQL vs gRPC: know when each applies (REST: public APIs; gRPC: internal high-throughput; GraphQL: flexible client queries)
Core CRUD endpoints with request/response shape
Pagination: cursor-based vs offset-based (cursor for large/real-time datasets)
Rate limiting: where it goes (API gateway, not application layer)
Auth model: JWT, session, API key, and where it's validated

Gap signal: Designing endpoints without request/response shapes.

5. Scalability Patterns

Can you identify the bottleneck and name the fix?

Read-heavy → read replicas, CDN, cache
Write-heavy → write-behind cache, message queue, sharding
Stateful services → sticky sessions or externalise state (Redis)
Hot key problem in cache → consistent hashing, virtual nodes
Database bottleneck → connection pooling, read replicas, CQRS

Gap signal: "Add more servers" without specifying where or why.

6. Reliability and Fault Tolerance

What breaks and how does the system recover?

Single points of failure: identify and name the mitigation (replica, circuit breaker, fallback)
Cascade failure: circuit breaker prevents one service failure from taking down all callers
Data loss scenarios: what's the RPO (acceptable data loss) and RTO (acceptable downtime)
Health check strategy: active (probe) vs passive (from real traffic), and what happens on failure
Idempotency: write operations that can safely be retried

Gap signal: Designing the happy path and nothing else.

7. Consistency and Trade-offs

Can you name the consistency model and defend it?

CAP theorem: which two of three does this system prioritise?
Eventual consistency: acceptable lag, and what breaks if the lag is 5 minutes
Strong consistency: when it's required (payments, inventory, auth)
Read-your-writes: how to implement when data is replicated (sticky reads, write-through cache)
Distributed transaction pattern: 2PC vs Saga. Know when each applies.

Gap signal: "We'll use eventual consistency" without explaining the failure scenario if that assumption breaks.

8. Observability

What metrics and alerts would you add?

RED metrics: Rate (req/sec), Errors (error rate), Duration (P50/P95/P99 latency)
Business metrics: the KPI the system exists to move
Alerting thresholds: what triggers an on-call page (error rate >1%, P99 >500ms)
Distributed tracing: where you'd add trace IDs for cross-service debugging
Log strategy: what you'd log at each service boundary (request ID, user ID, latency)

Gap signal: No mention of observability. Interviewers at Staff+ level probe this directly.

Pre-Interview Audit

For each category above, rate yourself 1–3:

3: Can speak fluently for 3+ minutes with specific examples
2: Know the concepts, weak on specifics
1: Fuzzy. Need to review before the interview.

Any category rated 1 is a gap that needs 30–60 minutes of targeted review. Do not go into a loop with unresolved 1s.