Service Mesh

The Problem Without One

Raw HTTP calls between microservices are unreliable. A service that calls 5 downstream dependencies without a retry policy will fail whenever any dependency has a transient blip. Without mutual TLS, traffic between services is unencrypted and unauthenticated. Without consistent instrumentation, tracing a request across 8 services requires each team to have independently added the same tracing headers. These three problems (reliability, security, observability) are identical across every service in the fleet, yet without a mesh, each team solves them differently or not at all.

Sidecar Proxy Pattern

A service mesh injects a sidecar proxy (most commonly Envoy) into every service pod. The sidecar intercepts all inbound and outbound network traffic for its co-located service. The application calls localhost; the sidecar handles everything else. The application does not know retries, circuit breaking, or TLS are happening.

The total infrastructure breaks into two planes:

Data plane: the collection of all sidecar proxies. They execute the actual traffic handling: load balancing, retries, TLS termination, metric collection.

Control plane: a centralized system (Istio's Istiod, for example) that pushes configuration to all sidecars. When a policy changes (new retry budget, new mTLS certificate), the control plane distributes it without restarting any service.

mTLS for Zero-Trust Networking

A service mesh issues short-lived certificates to each workload and enforces mutual TLS on every service-to-service call. Both sides authenticate each other: the caller proves its identity, and so does the callee. This eliminates the category of lateral movement attacks where a compromised internal service impersonates another. The certificate rotation happens automatically through the control plane. Application code changes nothing.

Automatic Retries and Circuit Breaking

The mesh proxy can retry on connection failures or 5xx responses with configurable retry budgets. It can also open a circuit breaker when a destination's error rate exceeds a threshold, preventing retries from amplifying a downstream failure into a fleet-wide storm. This is the same circuit breaker logic described in dedicated patterns, but applied uniformly at the mesh layer rather than per-service.

Observability Without Code Changes

Every sidecar proxy emits request metrics (latency percentiles, error rates, request counts), traces (via OpenTelemetry-compatible headers), and access logs for every call it proxies. Because this happens at the infrastructure layer, the application developer writes no instrumentation code. A fresh service deployment immediately appears in dashboards and distributed traces. Without a mesh, adding a new service to the observability system requires the team to integrate a tracing library, configure metric emission, and ensure consistent header propagation.

Traffic Shaping and Canary Deployments

The control plane can instruct sidecars to split traffic between service versions. Route 5% of requests to v2 of a service and 95% to v1, based on header values, source service identity, or random sampling. This enables canary deployments where a new version receives a small fraction of live traffic while metrics are monitored. If error rates or latency increases, the control plane rolls back the routing weight instantly without a deployment. This capability is available to all services in the mesh without any code changes.

When Not to Use One

Istio adds approximately 20-50ms of overhead per hop due to proxy interception and xDS configuration syncing. For latency-sensitive paths under 10ms budgets, that overhead is prohibitive. The operational complexity is substantial: the control plane is a new distributed system to operate, certificate management adds surface area, and debugging traffic through sidecars is non-trivial. For deployments with fewer than 10 services, the return does not justify the investment. A lightweight approach (shared RPC library with retry logic, mutual TLS via cert-manager) is more appropriate.

Interview Tip

The distinction most candidates miss: a service mesh and an API gateway solve different boundaries. The gateway handles north-south traffic (external clients to internal services). The mesh handles east-west traffic (internal service to internal service). Saying "I'd use an API gateway for service-to-service calls" is a signal that the candidate has not operated a microservices fleet in production. The L6-level answer names both, explains what each handles, and identifies the failure mode each prevents. Naming Envoy as the dominant sidecar proxy and explaining the xDS API (the protocol the control plane uses to push configuration to proxies) signals production-level familiarity rather than blog-level knowledge.