Load Balancing
Distributes incoming traffic across multiple servers to prevent any single node from becoming a bottleneck. The mechanism that makes horizontal scaling functional in practice.
What Load Balancing Actually Solves
Horizontal scaling adds nodes. Load balancing makes those nodes useful. Without it, you can have 10 servers and still route 90% of traffic to one of them. The problem isn't capacity; it's distribution.
L4 vs L7: Where the Decision Happens
L4 (transport layer) load balancers route based on IP and TCP/UDP port. They're fast (sub-millisecond overhead) but blind to application content. L7 (application layer) balancers can route based on HTTP headers, URL paths, cookies, or request body content. That flexibility costs ~1–5ms per hop but unlocks path-based routing, sticky sessions, and A/B traffic splitting.
In practice: use L4 for raw TCP throughput (databases, game servers), L7 for HTTP services where you need routing intelligence.
Algorithms
Round Robin: Requests cycle across nodes in order. Works well when requests are roughly uniform in cost. Breaks badly when they aren't: a single expensive request starves the node handling it.
Least Connections: Routes to the node with the fewest active connections. Better for variable-cost workloads. Most production HTTP load balancers default to this.
Consistent Hashing: Routes requests from the same client (or with the same key) to the same node. Essential when nodes hold local state: distributed caches, session data, or WebSocket connections.
Weighted Round Robin: Assigns request fractions proportional to node capacity. Use this during rolling deploys or when nodes have heterogeneous specs.
The Thundering Herd on Failover
When a node fails and is removed from rotation, its in-flight connections drop and its traffic redistributes. If the surviving nodes were already near capacity, the redistribution causes cascade failures. I've seen this take down clusters that had technically sufficient total capacity. The issue was the rebalancing spike, not the absolute load.
Mitigations: connection draining (let in-flight requests complete before removing a node), circuit breakers on downstream services, and keeping per-node utilisation below ~60% under normal load.
Health Checks
Active health checks poll nodes on a configured interval (typically every 5–10 seconds). Passive checks detect failures from real traffic: a node that returns 5xx responses gets deprioritised without an explicit probe. Most production systems use both. The active check interval sets your recovery latency floor: if you check every 30 seconds, a failed node serves traffic for up to 30 seconds after it dies.
Interview Tip
Interviewers expect you to distinguish L4 from L7 and explain why you'd choose one over the other for the specific system being designed. The second check is whether you mention consistent hashing for stateful workloads. That's where most candidates go shallow. Mentioning health check intervals and the thundering herd failure mode puts you in the top tier of answers at L5+.
Related Concepts
A distributed hashing scheme that minimizes key remapping when nodes are added or removed.
Controls the rate of requests to a service. Common algorithms: Token Bucket, Leaky Bucket, Fixed Window, and Sliding Window.
A shared cache layer across multiple nodes used to absorb read traffic from the primary database and reduce latency on hot data paths. The difference between a 2ms and a 200ms read at scale.