Design Azure Blob Storage
Problem Statement
Design a distributed object storage system equivalent to Azure Blob Storage. The system stores arbitrary binary objects identified by a key within a container. It must provide 11 nines of durability, high availability, and throughput at exabyte scale.
Requirements Clarification
Functional:
- Upload and download blobs up to 5TB
- List blobs within a container
- Metadata on blobs (content type, custom key-value pairs)
- Multipart upload for large blobs
- Access control at blob, container, and account level
Non-Functional:
- Durability: 11 nines (99.999999999%)
- Availability: 99.99% reads, 99.9% writes
- Strong read-after-write consistency within a region
- Throughput: millions of requests/second globally
Architecture
Partition Layer
Blobs are addressed by (account, container, blob_name). The partition layer maps this triple to a storage node using consistent hashing. A distributed partition map service (backed by Paxos) tracks which storage node owns each key range. The partition layer handles routing, load balancing, and partition splits when a node becomes a hotspot.
Storage Layer
Each storage node writes blobs to local drives. Small blobs (under 64MB) write directly. Large blobs split into 4MB chunks: each chunk is written synchronously to three storage nodes before the client receives an acknowledgment. This is the durability guarantee: the write is not complete until three independent copies exist.
Geo-redundancy adds asynchronous replication to a secondary region 500+ miles away. Three synchronous copies cover node and drive failures. The geo-secondary covers data center and regional disasters.
Erasure Coding for Cold Tiers
Hot blobs use 3x replication for low-latency reads. Cold blobs transition to erasure coding (Reed-Solomon 12+4): the blob splits into 12 data fragments and 4 parity fragments stored across 16 drives. Storage overhead drops from 3x to 1.33x. Read latency increases because a failed drive requires reconstruction from 12 fragments, but cold-tier SLAs (minutes, not milliseconds) make this acceptable.
Multipart Upload
For blobs over 100MB, the client splits the blob into chunks, uploads each part with a part number, then sends a complete-multipart request. The service stores chunk pointers in a manifest without copying data. Assembly is O(1) regardless of blob size. The client can upload chunks in parallel, which saturates available bandwidth on large files.
Key Concepts to Master
A distributed hashing scheme that minimizes key remapping when nodes are added or removed.
The problem of getting distributed nodes to agree on a single value despite network failures and partial outages. The theoretical foundation behind etcd, ZooKeeper, and Kafka leader election.
Object storage for unstructured binary data: images, videos, documents, ML model weights. Designed for durability and throughput at scale, not low-latency random access.
Further Reading
Resources that cover this problem in depth.