Attention Mechanism

Why RNNs Failed at Long Sequences

Before transformers, recurrent neural networks (RNNs and LSTMs) were the standard for sequence modeling. They processed tokens one at a time, maintaining a hidden state that carried information forward. The vanishing gradient problem meant that information from early tokens became exponentially diluted by the time it influenced predictions for tokens 100 or 200 positions later. An RNN asked to translate a sentence where the grammatical subject appears at position 1 and the verb at position 80 could not reliably connect the two. Long-range dependencies were the fundamental limitation.

Attention solves this by allowing every token to directly attend to every other token in the sequence, regardless of distance. A token at position 80 can directly access the representation of the token at position 1 without the information passing through 79 intermediate states.

Query, Key, Value

The attention mechanism projects each token into three vectors: a Query (Q), a Key (K), and a Value (V). The intuition without linear algebra:

The Query represents what this token is looking for in the sequence. The Key represents what this token contains or advertises to other tokens. The Value represents the information this token will contribute if selected.

Attention for a given query token: compute a dot product between the query vector and every key vector in the sequence. This produces a score representing how relevant each token is to the query. Apply softmax to get a probability distribution over all positions. Multiply each value vector by its attention weight and sum. The result is a weighted combination of all values, concentrated on the tokens most relevant to the query.

The full formula: Attention(Q, K, V) = softmax(QK^T / sqrt(d_k)) * V, where d_k is the dimension of the key vectors.

Why the Scaling Factor Matters

The dot product QK^T grows in magnitude as d_k increases. Large dot products push the softmax into saturation: the distribution becomes near-one-hot, and gradients vanish during training. Dividing by sqrt(d_k) keeps the dot products in a range where softmax produces a meaningful distribution and gradients flow properly. This is not an approximation; it is essential for training stability.

Multi-Head Attention

A single attention head learns one type of relationship. Multi-head attention runs h attention operations in parallel, each with its own Q, K, V projection matrices. The outputs are concatenated and linearly projected. Different heads specialize: one head might capture syntactic relationships (subject-verb), another semantic similarity (synonyms), another positional proximity. In BERT-large, 16 heads operate simultaneously. The total computation is not more expensive than a single attention operation of the same total dimension, because each head operates on a d_k/h dimensional subspace.

Quadratic Complexity

The core bottleneck: computing QK^T requires comparing every token to every other token, producing an n x n attention matrix where n is the sequence length. Memory and computation both grow as O(n^2). At a sequence length of 1,024 tokens, the attention matrix has 1M entries. At 128K tokens (a long document), it has 16 billion entries, which is infeasible with standard attention.

This quadratic bottleneck has driven significant research. Flash Attention reduces memory from O(n^2) to O(n) by computing attention in tiles and never materializing the full matrix, making 100K-token contexts practical. Sparse attention variants (Longformer, BigBird) attend only to a local window plus selected global tokens, reducing complexity to O(n log n) or O(n). Linear attention approximations replace the softmax with alternative kernel functions that allow factoring the computation.

Positional Encoding

Self-attention has no inherent notion of order: shuffling the input tokens produces the same attention weights (the mechanism is permutation-equivariant). Positional encoding adds position information. Original transformers used fixed sinusoidal encodings added to the token embeddings. Modern LLMs use Rotary Position Embedding (RoPE), which encodes relative positions by rotating Q and K vectors, allowing the model to generalize to sequence lengths longer than those seen during training.

Beyond Language Models

Attention appears in computer vision through Vision Transformers (ViT), which patch an image into 16x16 pixel tiles and apply self-attention across tiles. AlphaFold uses attention across amino acid sequences and between sequences and multiple sequence alignments. Diffusion models use cross-attention between text embeddings and image feature maps to condition image generation on text prompts.

Interview Tip

The Q/K/V intuition without linear algebra is a standard probe. Many candidates can recite the formula but cannot explain the query-as-search, key-as-advertisement, value-as-content framing. The second probe: why does quadratic complexity limit context length, and what approaches address it? Flash Attention (IO-aware tiling, same asymptotic complexity but 3-5x faster with 10x less memory) is the most important answer because it is what made 100K+ token contexts practical in production systems. Mentioning the distinction between flash attention (algorithmic, exact) and sparse/linear attention (approximate, changes the computation) demonstrates depth that separates L5 from L6 answers.