Bloom Filter
A space-efficient probabilistic data structure that tests set membership with no false negatives and a tunable false positive rate.
The Problem It Solves
Checking whether an element exists in a set is trivial with a hash table, but hash tables require storing the elements themselves. For 1 billion URLs (averaging 100 bytes each), a hash set requires roughly 100GB of memory. At the scale of a web crawler, a CDN cache, or a distributed database, that storage cost is prohibitive. The question is whether you can answer "have I seen this URL before?" without storing the URLs.
A Bloom filter answers this question using approximately 10 bits per element at a 1% false positive rate. 1 billion URLs requires about 1.2GB, a 80x reduction. The trade-off: the filter can say "definitely not in the set" or "probably in the set." It cannot say "definitely in the set." This sounds limiting, but for most use cases (avoiding an expensive lookup), a small false positive rate is an acceptable price for the memory savings.
How It Works
A Bloom filter is a bit array of m bits, initialized to zero. It uses k independent hash functions, each producing a position in the bit array.
To insert an element: apply all k hash functions to the element, producing k positions. Set the bit at each position to 1.
To query an element: apply the same k hash functions, producing k positions. If all k bits are 1, the element is probably in the set. If any bit is 0, the element is definitely not in the set.
False positives occur because bits set by other elements can coincidentally form a positive pattern for a queried element. False negatives are impossible: if an element was inserted, all k of its bits were set to 1, and they remain 1 (bits are never cleared). A query will always find them.
False Positive Rate
The false positive rate depends on three variables: the number of hash functions k, the number of bits m, and the number of elements inserted n. The approximation is (1 - e^(-kn/m))^k. For a 1% false positive rate at 1M elements, you need roughly 9.6 bits per element and 7 hash functions. For 0.1%, it is 14.4 bits per element. The optimal number of hash functions for a given m and n is (m/n) * ln(2).
These parameters are chosen at construction time based on the expected number of elements and the tolerable false positive rate. The filter cannot be resized without rebuilding.
Cannot Delete Elements
Standard Bloom filters do not support deletion. Clearing a bit to "remove" an element would also clear bits shared with other elements, producing false negatives. The counting Bloom filter variant replaces each bit with a small counter (typically 4 bits), supporting deletions at the cost of 4x memory. This variant is rarely used in practice because it loses the primary space advantage.
Production Usage
Cassandra uses Bloom filters to avoid unnecessary disk reads. When a query arrives for a row, Cassandra first checks the Bloom filter for each SSTable. If the filter says "not present," the SSTable is skipped entirely, avoiding a disk read that would return nothing. This reduces read I/O by 50-90% for missing-key queries in typical workloads.
Web crawlers use Bloom filters for URL deduplication. Instead of querying a database for every discovered URL to check if it has been seen, the crawler checks the Bloom filter. At a 0.1% false positive rate, 0.1% of unseen URLs are incorrectly skipped, an acceptable trade for eliminating database lookups.
Interview Tip
Two things separate strong answers. First: explain why false negatives are impossible (inserted elements' bits are permanently set; a query will always find them). Second: address the space trade-off with actual numbers. "It's space-efficient" is too vague. "At a 1% false positive rate, you need approximately 10 bits per element, compared to 800 bits for a 100-byte element in a hash set" is the answer that signals production-level thinking. Bonus: name a real system that uses them (Cassandra, BigTable, and HBase all use Bloom filters to avoid disk reads for missing keys). The L5+ addition: explain the parameter choice trade-off. A lower false positive rate requires more bits and more hash functions, increasing CPU cost per lookup. The optimal k for a given m and n minimizes both false positive rate and CPU overhead simultaneously, which is why pre-computing Bloom filter parameters from expected set size and acceptable false positive rate is standard practice rather than guessing.
Related Concepts
Horizontal partitioning of a database across multiple machines to distribute load beyond a single server's capacity.
A shared cache layer across multiple nodes used to absorb read traffic from the primary database and reduce latency on hot data paths. The difference between a 2ms and a 200ms read at scale.
Log-Structured Merge Tree is a storage structure that converts random writes to sequential writes by buffering mutations in memory before flushing sorted files to disk.