Design Netflix Recommendation Engine
Problem Statement
Design the recommendation system that drives Netflix content discovery. 80% of content watched on Netflix is found through recommendations. The system must personalise the homepage for 250M subscribers with updates reflecting user activity within 15 minutes.
Requirements Clarification
Functional:
- Personalise rows and ordering on each subscriber's homepage
- Rank titles by predicted play probability for that subscriber
- Reflect significant user activity (completed show, thumbed down a title) within 15 minutes
- Support A/B testing of recommendation algorithms
- Provide row labels: "Because you watched X"
Non-Functional:
- 250M subscribers
- Recommendation payload available within 200ms of homepage load
- Model retraining: daily offline; feature freshness: < 15 minutes
Feature Store
Recommendations require features for three entities: user (watch history, ratings, time-of-day patterns), item (genre, cast, popularity, freshness), and user-item pair (predicted affinity). Features are pre-computed and stored in a low-latency feature store (Netflix uses EVCache, a distributed Memcached layer). Feature lookup at serving time: ~5ms.
Two-Stage Retrieval and Ranking
With 15,000+ titles, scoring all candidates at request time is too slow. The pipeline uses two stages:
Stage 1 (candidate generation, ~50ms): Collaborative filtering via matrix factorisation produces user and item embedding vectors. ANN search retrieves the 200โ500 titles nearest to the user's embedding. Content-based and trending signals contribute additional candidates.
Stage 2 (ranking, ~100ms): A lightweight LightGBM model scores all candidates using pre-computed features. The top 50 go through a deep neural network that applies real-time contextual features (device type, time of day, current session length). The final ranking optimises for predicted long-term engagement: a title clicked but abandoned after 5 minutes scores lower than one watched to completion.
Near-Real-Time Feature Updates
User activity events (play, complete, thumb up/down) are published to Kafka. A Flink stream processing job updates the user's feature vector in EVCache within 5โ10 minutes. The next homepage load picks up the updated features.
Full model retraining runs daily offline. Real-time personalisation comes from feature freshness, not model freshness.
Key Concepts to Master
A distributed hashing scheme that minimizes key remapping when nodes are added or removed.
Asynchronous communication buffer between services. Decouples producers from consumers and provides durability during traffic spikes.
A shared cache layer across multiple nodes used to absorb read traffic from the primary database and reduce latency on hot data paths. The difference between a 2ms and a 200ms read at scale.
Further Reading
Resources that cover this problem in depth.