Load Balancing

What Load Balancing Actually Solves

Horizontal scaling adds nodes. Load balancing makes those nodes useful. Without it, you can have 10 servers and still route 90% of traffic to one of them. The problem isn't capacity; it's distribution.

L4 vs L7: Where the Decision Happens

L4 (transport layer) load balancers route based on IP and TCP/UDP port. They're fast (sub-millisecond overhead) but blind to application content. L7 (application layer) balancers can route based on HTTP headers, URL paths, cookies, or request body content. That flexibility costs ~1–5ms per hop but unlocks path-based routing, sticky sessions, and A/B traffic splitting.

In practice: use L4 for raw TCP throughput (databases, game servers), L7 for HTTP services where you need routing intelligence.

Algorithms

Round Robin: Requests cycle across nodes in order. Works well when requests are roughly uniform in cost. Breaks badly when they aren't: a single expensive request starves the node handling it.

Least Connections: Routes to the node with the fewest active connections. Better for variable-cost workloads. Most production HTTP load balancers default to this.

Consistent Hashing: Routes requests from the same client (or with the same key) to the same node. Essential when nodes hold local state: distributed caches, session data, or WebSocket connections.

Weighted Round Robin: Assigns request fractions proportional to node capacity. Use this during rolling deploys or when nodes have heterogeneous specs.

The Thundering Herd on Failover

When a node fails and is removed from rotation, its in-flight connections drop and its traffic redistributes. If the surviving nodes were already near capacity, the redistribution causes cascade failures. I've seen this take down clusters that had technically sufficient total capacity. The issue was the rebalancing spike, not the absolute load.

Mitigations: connection draining (let in-flight requests complete before removing a node), circuit breakers on downstream services, and keeping per-node utilisation below ~60% under normal load.

Health Checks

Active health checks poll nodes on a configured interval (typically every 5–10 seconds). Passive checks detect failures from real traffic: a node that returns 5xx responses gets deprioritised without an explicit probe. Most production systems use both. The active check interval sets your recovery latency floor: if you check every 30 seconds, a failed node serves traffic for up to 30 seconds after it dies.

Interview Tip

Interviewers expect you to distinguish L4 from L7 and explain why you'd choose one over the other for the specific system being designed. The second check is whether you mention consistent hashing for stateful workloads. That's where most candidates go shallow. Mentioning health check intervals and the thundering herd failure mode puts you in the top tier of answers at L5+.