Design Google Search Autocomplete

Problem Statement

Design the autocomplete (type-ahead) system for Google Search. As a user types, return the top 5 most likely search completions within 100ms.

Requirements Clarification

Functional:

Return top 5 completions on every keystroke
Completions ranked by global query frequency, adjusted for recency
Support 100+ languages and regional query patterns
New trending queries appear in suggestions within 1 hour

Non-Functional:

10B queries/day: ~115k QPS average, 500k QPS at peak
P99 latency < 100ms end-to-end (including network RTT from user to edge)
Eventual consistency for updates: stale by up to 1 hour is acceptable
Suggestions must not include removed or policy-violating queries

Scale Estimation

10B queries per day. Users type 4-6 characters on average before selecting a suggestion, meaning each query generates 4-6 autocomplete requests. Total autocomplete requests: ~50B per day, ~580k per second average. This is the read QPS the serving layer must handle. Write QPS (updating the suggestion index) is comparatively tiny: one hourly batch update.

Core Data Structure: Trie

A prefix trie stores all indexed query strings. Each node represents one character. A node for the prefix "py" holds the top-5 completions for any query starting with "py" (python, python tutorial, python download, etc.), pre-computed and cached at the node to avoid subtree traversal on every request.

Trie lookup is O(k) where k is the prefix length. For a 5-character prefix, that is 5 pointer dereferences. Fast enough for any single-node operation.

The Single-Machine Problem

The full Google query trie contains billions of unique queries across 100+ languages. It does not fit on one machine. Two distribution strategies:

Shard by prefix: A-D on shard 1, E-H on shard 2, and so on. Each shard handles a predictable keyspace. The downside: some prefix ranges are far more popular than others ("w" prefixes dominate because of "weather," "walmart," "wikipedia"). Naive alphabetic sharding creates hotspots.

Shard by hash of prefix: Distribute trie nodes by consistent hashing on the prefix string. Even distribution across shards. The downside: a single query traverses multiple shards (each character may land on a different shard). Latency compounds with hop count.

In practice: Google uses geographically distributed replicas of the full trie at edge locations, not per-prefix sharding. Serving happens locally at the edge; the complexity is in keeping all edge replicas current, not in routing queries across shards.

Serving Layer Architecture

Edge deployment: Trie replicas are pre-built offline and pushed to edge PoPs globally. A user in Tokyo queries a Tokyo edge node, not a US data center. This is how 100ms P99 is achievable at global scale: the network RTT alone from Tokyo to a US origin would exceed 100ms.

In-memory serving: The trie is loaded entirely into RAM at each edge node. A 10GB trie fits comfortably on modern serving hardware. All lookups are memory operations: no disk I/O in the critical path.

Client-side caching: The client caches completions for already-typed prefixes. If the user types "py" and receives suggestions, the client reuses that response when the user later types "py" again in the same session. This eliminates a significant fraction of network requests.

Update Pipeline

Keeping suggestions fresh without disrupting live serving is the operationally complex part.

Query logs from all search traffic flow into a distributed log (Pub/Sub at Google's scale). A Dataflow batch job aggregates query frequencies hourly, computing the top-K queries per prefix across all languages. The aggregation job produces a new trie snapshot.

The new snapshot is built offline, validated (checking for policy violations, factual errors in suggestions), and pushed to edge nodes. Edge nodes perform an atomic in-memory swap: the old trie pointer is replaced with the new one in a single operation. Requests in flight during the swap complete against whichever version they started with. No downtime, no partial states.

Personalisation and Regional Bias

Global query frequency is the baseline ranking signal. Two adjustments are applied at serving time:

Regional bias: "Curry" suggests "Curry recipe" in most regions but "Stephen Curry stats" in NBA-heavy US regions. Region-specific frequency tables weight the ranking. These are pre-computed per region and stored alongside the global trie.

Session context: If the user's current query session contains searches about Python, "print" as a prefix is more likely to resolve to "print python" than "print shop near me." Session signals are applied as a lightweight re-ranking layer on top of the trie results, not baked into the trie itself.

Safety and Policy Filtering

Query completions cannot suggest harmful, illegal, or policy-violating content. A blocklist of prohibited completions is applied as a post-processing filter before results are returned. The blocklist is separate from the trie and updated independently: removing a completion from the blocklist takes effect within minutes, not waiting for the next hourly trie rebuild.

Interview Tip

Most candidates describe the trie and the push-based update pipeline correctly. The question that separates strong answers is latency: "how do you achieve 100ms P99 globally?" The expected answer is edge-deployed in-memory replicas, not a central serving cluster. Candidates who rely on a single origin with a CDN in front misunderstand the problem: CDNs cache static assets, not dynamic query results. Understanding that the trie must be replicated to the edge, updated atomically, and served from RAM is what demonstrates production-level system design thinking.