System Design Interview Checklist
Eight categories that must be addressed in every system design interview. Use this as a pre-interview audit to identify gaps, not a during-interview crutch.
How to Use This Checklist
This is a preparation audit tool, not a during-interview script. Run through it the night before. Any category where you can't speak fluently for 3 minutes is a gap you need to close.
The 8 categories below map to the phases where system design interviews most commonly expose weaknesses. They are not exhaustive. They are the minimum.
1. Requirements and Scope
Can you do this in 5 minutes or less?
- Identify 3–5 functional requirements without over-specifying
- Establish read/write ratio and dominant traffic pattern
- Confirm consistency requirement: strong, eventual, or read-your-writes
- Confirm availability target: 99.9% (8.7h downtime/year) vs 99.99% (52m/year)
- Identify the primary user action that drives load
Gap signal: You're still asking clarifying questions at minute 8.
2. Capacity Estimation
Can you produce a defensible number in under 3 minutes?
- QPS: DAU × actions/day ÷ 86,400
- Storage: write QPS × payload size × retention period
- Bandwidth: peak read QPS × payload size
- Know your units: 1M users × 1 req/day = ~12 req/sec; 100M users = ~1,200 req/sec
- Know when horizontal scale is required: single server handles ~50k req/sec for typical CRUD
Gap signal: You skip estimation and go straight to components. Every interviewer notices.
3. Data Model and Storage
Can you justify your storage choice?
- SQL vs NoSQL: know the decision criteria (ACID vs scalability, structured vs document)
- Primary key design: how you'll query the data drives the key
- Sharding strategy: horizontal (by user_id, by hash), vertical (by feature)
- Know when to use a separate cache vs in-DB caching
- Know when NoSQL falls short: multi-entity transactions, complex joins
Gap signal: "I'll use a database" without specifying what kind or why.
4. API Design
Can you spec the core endpoints in 2 minutes?
- REST vs GraphQL vs gRPC: know when each applies (REST: public APIs; gRPC: internal high-throughput; GraphQL: flexible client queries)
- Core CRUD endpoints with request/response shape
- Pagination: cursor-based vs offset-based (cursor for large/real-time datasets)
- Rate limiting: where it goes (API gateway, not application layer)
- Auth model: JWT, session, API key, and where it's validated
Gap signal: Designing endpoints without request/response shapes.
5. Scalability Patterns
Can you identify the bottleneck and name the fix?
- Read-heavy → read replicas, CDN, cache
- Write-heavy → write-behind cache, message queue, sharding
- Stateful services → sticky sessions or externalise state (Redis)
- Hot key problem in cache → consistent hashing, virtual nodes
- Database bottleneck → connection pooling, read replicas, CQRS
Gap signal: "Add more servers" without specifying where or why.
6. Reliability and Fault Tolerance
What breaks and how does the system recover?
- Single points of failure: identify and name the mitigation (replica, circuit breaker, fallback)
- Cascade failure: circuit breaker prevents one service failure from taking down all callers
- Data loss scenarios: what's the RPO (acceptable data loss) and RTO (acceptable downtime)
- Health check strategy: active (probe) vs passive (from real traffic), and what happens on failure
- Idempotency: write operations that can safely be retried
Gap signal: Designing the happy path and nothing else.
7. Consistency and Trade-offs
Can you name the consistency model and defend it?
- CAP theorem: which two of three does this system prioritise?
- Eventual consistency: acceptable lag, and what breaks if the lag is 5 minutes
- Strong consistency: when it's required (payments, inventory, auth)
- Read-your-writes: how to implement when data is replicated (sticky reads, write-through cache)
- Distributed transaction pattern: 2PC vs Saga. Know when each applies.
Gap signal: "We'll use eventual consistency" without explaining the failure scenario if that assumption breaks.
8. Observability
What metrics and alerts would you add?
- RED metrics: Rate (req/sec), Errors (error rate), Duration (P50/P95/P99 latency)
- Business metrics: the KPI the system exists to move
- Alerting thresholds: what triggers an on-call page (error rate >1%, P99 >500ms)
- Distributed tracing: where you'd add trace IDs for cross-service debugging
- Log strategy: what you'd log at each service boundary (request ID, user ID, latency)
Gap signal: No mention of observability. Interviewers at Staff+ level probe this directly.
Pre-Interview Audit
For each category above, rate yourself 1–3:
- 3: Can speak fluently for 3+ minutes with specific examples
- 2: Know the concepts, weak on specifics
- 1: Fuzzy. Need to review before the interview.
Any category rated 1 is a gap that needs 30–60 minutes of targeted review. Do not go into a loop with unresolved 1s.