Embeddings
Numerical vector representations of text, images, or other data that encode semantic meaning. The translation layer that converts unstructured content into a form that can be compared mathematically.
What Embeddings Are and Why They Exist
A string comparison between "dog" and "canine" returns no match. Their embeddings will be close in vector space because an embedding model has learned that these concepts are semantically related. Embeddings convert the symbolic gap between tokens into geometric proximity.
More precisely: an embedding model maps an input (token, sentence, image, audio clip) to a fixed-length vector of floating-point numbers, typically 384 to 3072 dimensions depending on the model. The model is trained so that semantically similar inputs produce vectors with high cosine similarity.
How They Are Generated
Embedding models are typically transformer-based encoders trained with contrastive learning objectives. During training, pairs of semantically similar inputs are pushed toward each other in vector space; dissimilar pairs are pushed apart. The resulting model generalises this geometric structure to inputs it has never seen.
For text, the process: tokenise the input, run it through the encoder, pool the token-level representations into a single fixed-length vector (usually by averaging or taking the [CLS] token). For images, a vision encoder (ViT or CNN-based) maps pixel data to the same geometric space when the system is multimodal.
Choosing the Right Embedding Model
The choice determines your retrieval quality ceiling. Three dimensions matter:
Dimensionality: Higher-dimensional embeddings encode more information but cost more to store and query. text-embedding-3-large (OpenAI) produces 3072-dimensional vectors. all-MiniLM-L6-v2 (open-source, 384 dimensions) is 8x cheaper to store and query with modest quality trade-offs. For most RAG workloads, 768 dimensions is the practical sweet spot.
Context window: Embedding models have input length limits (typically 512–8192 tokens). Text exceeding this limit must be chunked before embedding. The chunking boundary affects retrieval quality: a poorly placed chunk break splits a logical unit and degrades the embedding quality for that chunk.
Domain fit: General-purpose embeddings underperform on domain-specific corpora. A legal document retrieval system will see materially better recall using a model fine-tuned on legal text versus text-embedding-3-large out of the box. The MTEB leaderboard benchmarks models across domains and tasks.
The Model Pinning Problem
All vectors in a collection must be produced by the same model at the same version. If you switch models or update the model version, every stored vector is incompatible with new query vectors. A full re-embedding of the corpus is required. At 10M documents, that is a significant cost and operational event. Pin the embedding model version before going to production and treat changes as migrations.
Embedding vs Fine-Tuning
Embeddings power retrieval (finding relevant documents). Fine-tuning adjusts the model's weights to change its generation behaviour. These are separate mechanisms targeting different failure modes: if the model retrieves correctly but generates poorly, fine-tune. If the model generates plausibly but retrieves wrong content, improve the embedding pipeline. Conflating the two is a common architectural mistake.
Interview Tip
The question worth preparing for: "Your RAG pipeline has high recall but users still get bad answers. Where do you look?" The expected answer works through the pipeline systematically: embedding model fit for the domain, chunking strategy (chunk size, overlap, boundary placement), retrieval stage (hybrid dense+sparse search, reranking), and finally the generation prompt. Candidates who jump straight to "fine-tune the LLM" have skipped three prior failure modes.
Related Concepts
A shared cache layer across multiple nodes used to absorb read traffic from the primary database and reduce latency on hot data paths. The difference between a 2ms and a 200ms read at scale.
Object storage for unstructured binary data: images, videos, documents, ML model weights. Designed for durability and throughput at scale, not low-latency random access.
A database purpose-built to store and query high-dimensional embedding vectors. The retrieval layer that makes semantic search and RAG pipelines possible at production scale.