Circuit Breaker

The Cascading Failure Problem

A single slow service can bring down its callers. Consider a payment service that calls a fraud detection service with a 200ms timeout. If fraud detection starts responding in 800ms instead of 50ms, every payment call now holds a thread for 200ms before timing out. At 500 requests per second, that is 100 threads blocked at any moment. Thread pools exhaust. The payment service begins dropping requests. Its callers (the order service) start timing out too. The failure propagates upstream through the call graph. The original fault was in fraud detection; the casualty is the entire checkout flow.

A circuit breaker prevents this by making the failure fast and local. When a downstream service is degraded, the circuit breaker stops forwarding calls to it and returns an error immediately, in under 1ms, instead of waiting for a timeout.

The State Machine

A circuit breaker has three states:

Closed: normal operation. All requests pass through to the downstream service. The breaker tracks the error rate over a rolling window (for example, 10 failures in the last 20 calls, or failure rate above 50% in the last 10 seconds). When the threshold is crossed, the breaker opens.

Open: fail-fast mode. All requests are rejected immediately without contacting the downstream service. The caller receives an error or a fallback response. This state persists for a configured duration, typically 30-60 seconds.

Half-open: probe mode. After the open duration expires, the breaker allows a small number of probe requests through (commonly 1-5). If those succeed, the breaker resets to closed. If they fail, it returns to open. This is the state that separates a correct implementation from a naive one. Without half-open, the breaker either never recovers or floods a recovering service with the full request volume the moment it reopens.

Retry vs Circuit Breaker

Retries are optimistic: they assume the failure is transient and that trying again will succeed. Circuit breakers are pessimistic: they assume the service is degraded and protect both the caller and the downstream from additional load. They are complementary. Retries handle momentary blips (a packet drop, a brief GC pause). Circuit breakers handle sustained degradation (an overloaded service, a bad deployment). Using retries without circuit breakers amplifies load on a struggling service; using circuit breakers without retries abandons recoverable transient errors too aggressively.

Configuration Parameters

Failure threshold: the percentage or count of failures that triggers opening. Setting this too low causes flapping on normal variance. Too high and the breaker opens too late to prevent cascading failures.

Open duration: how long the breaker stays open before transitioning to half-open. This must exceed the expected recovery time of the downstream service.

Probe count in half-open: how many probe requests to allow before deciding to close or reopen. More probes give higher confidence but extend the recovery time.

Fallback Responses

When the circuit is open and calls are rejected, the caller needs a response. Fallback responses are degraded but functional: return a cached value, a default, or a user-facing message indicating the feature is temporarily unavailable. A product recommendation service with an open circuit might return an empty list rather than an error. A payment status check might return "status unknown, check back in a moment" rather than a 500. Fallbacks make the circuit breaker pattern user-visible but graceful rather than opaque and broken. Design fallbacks before you design the breaker.

Libraries and Infrastructure Options

Netflix built Hystrix to address exactly this problem in their microservices fleet. Hystrix is now in maintenance mode. Resilience4j is the current standard for Java services and provides circuit breakers, bulkheads, and rate limiters as composable decorators. Polly serves the .NET ecosystem with a similar policy-based API. Most service mesh sidecars (Envoy-based) implement circuit breaking at the infrastructure layer without requiring a library, which means the protection applies uniformly regardless of what language the service is written in.

Library vs Mesh

Using a library (Resilience4j) gives per-call configurability and access to the circuit state in application code (for fallback logic). Using a mesh (Envoy) applies the pattern without code changes, but the fallback behavior is limited to what the proxy supports (drop the connection, return a 503). Production systems sometimes use both: the mesh provides baseline protection; the library provides application-aware fallbacks.

Interview Tip

The half-open state is the probe most interviewers use to separate candidates. 80% of answers describe closed and open states correctly, then vaguely say "and then it recovers." The full answer explains: after the open timer expires, the breaker allows N probe requests; if they succeed, it resets to closed; if they fail, it extends the open state. The L5+ answer also addresses the threshold configuration challenge: in a high-variance traffic environment, a percentage-based threshold with a minimum request count window prevents false positives from triggering during low-traffic periods. Mentioning fallback design signals that the candidate thinks about the user experience during degradation, not just the protection mechanism.