Microsoft logo Microsoft ยท System Design

Design Azure Blob Storage

Frequency: 80/100 Scale: Exabytes of data, millions of requests/second globally

Problem Statement

Design a distributed object storage system equivalent to Azure Blob Storage. The system stores arbitrary binary objects identified by a key within a container. It must provide 11 nines of durability, high availability, and throughput at exabyte scale.

Requirements Clarification

Functional:

  • Upload and download blobs up to 5TB
  • List blobs within a container
  • Metadata on blobs (content type, custom key-value pairs)
  • Multipart upload for large blobs
  • Access control at blob, container, and account level

Non-Functional:

  • Durability: 11 nines (99.999999999%)
  • Availability: 99.99% reads, 99.9% writes
  • Strong read-after-write consistency within a region
  • Throughput: millions of requests/second globally

Architecture

Partition Layer

Blobs are addressed by (account, container, blob_name). The partition layer maps this triple to a storage node using consistent hashing. A distributed partition map service (backed by Paxos) tracks which storage node owns each key range. The partition layer handles routing, load balancing, and partition splits when a node becomes a hotspot.

Storage Layer

Each storage node writes blobs to local drives. Small blobs (under 64MB) write directly. Large blobs split into 4MB chunks: each chunk is written synchronously to three storage nodes before the client receives an acknowledgment. This is the durability guarantee: the write is not complete until three independent copies exist.

Geo-redundancy adds asynchronous replication to a secondary region 500+ miles away. Three synchronous copies cover node and drive failures. The geo-secondary covers data center and regional disasters.

Erasure Coding for Cold Tiers

Hot blobs use 3x replication for low-latency reads. Cold blobs transition to erasure coding (Reed-Solomon 12+4): the blob splits into 12 data fragments and 4 parity fragments stored across 16 drives. Storage overhead drops from 3x to 1.33x. Read latency increases because a failed drive requires reconstruction from 12 fragments, but cold-tier SLAs (minutes, not milliseconds) make this acceptable.

Multipart Upload

For blobs over 100MB, the client splits the blob into chunks, uploads each part with a part number, then sends a complete-multipart request. The service stores chunk pointers in a manifest without copying data. Assembly is O(1) regardless of blob size. The client can upload chunks in parallel, which saturates available bandwidth on large files.