Design a Load Balancer – System Design Walkthrough

This problem appears in multiple sheets. Depth expectations increase as you progress:

Track	What to demonstrate
Arch 25	Classic infra question — distinguish L4 vs L7 clearly, explain consistent hashing for stateful backends, health check design, connection draining, and when sticky sessions help vs hurt.
Arch 50	Add GSLB/DNS routing, TLS termination, weighted routing for canaries, and failure detection timing math.
Arch 75	Staff: discuss anycast vs DNS GSLB, SYN flood handling, and load balancer as bottleneck at 1M+ conn/sec.

Interview Prompt

Design a load balancer that distributes incoming traffic across a pool of backend servers. Support health checking, graceful connection draining during deploys, session affinity where needed, and both Layer 4 (TCP) and Layer 7 (HTTP) routing modes.

Clarifying Questions (ask before designing)

Question	Why it matters
L4 or L7 — or both?	L4 for raw throughput (gaming, DB proxy); L7 for path-based routing, TLS, HTTP/2.
Are backends stateful or stateless?	Stateful requires sticky sessions or consistent hashing; stateless allows any algorithm.
What's the connection rate and bandwidth?	1M conn/sec drives kernel bypass (DPDK) vs software LB; bandwidth drives NIC sizing.
Single region or global traffic?	GSLB (GeoDNS/anycast) needed for multi-region; single region is simpler.

Scope

In scope

L4 vs L7 load balancing
Consistent hashing for session affinity
Health checks (active/passive)
Connection draining for zero-downtime deploys
Sticky sessions (cookie-based and IP hash)
Global Server Load Balancing (GSLB)

Out of scope (state explicitly)

Building custom hardware LB ASIC
Full service mesh (Istio) — mention as L7 evolution
DDoS mitigation at edge (WAF/CDN layer)

Assumptions

100K requests/sec peak, 500K concurrent connections
Mixed stateless API (90%) and stateful WebSocket (10%)
Backend pool: 50-500 instances with autoscaling
99.99% LB availability (LB failure = total outage)

Distribute traffic: Route incoming requests across a pool of backend servers
Health checks: Detect unhealthy servers and stop routing traffic to them
Multiple algorithms: Support Round Robin, Weighted Round Robin, Least Connections, IP Hash, etc.
Session affinity (sticky sessions): Route requests from the same client to the same server
SSL/TLS termination: Decrypt HTTPS at the load balancer (offload from backends)
Auto-scaling integration: Dynamically add/remove backend servers
Rate limiting: Limit requests per client/IP
Request routing: Route by URL path, headers, or other request attributes (Layer 7)
Connection draining: Gracefully drain connections before removing a server

Metric	Calculation	Value
Requests / sec	Derived from daily volume ÷ 86400 (+ peak factor)	1M
Concurrent connections	Given	10M
Backend servers	Given	1,000
Bandwidth	Given	100 Gbps
Health check interval	Given	5 seconds
Health checks / sec	1,000 servers ÷ 5s interval	200

Loading...

Layer 4 (Transport Layer: TCP/UDP)

How: Makes routing decision based on IP address and TCP/UDP port
Does NOT inspect packet content (no HTTP headers, no URL)
Mechanism: Network Address Translation (NAT) or Direct Server Return (DSR)
Performance: Extremely fast (kernel-level, hardware-accelerated)
Use cases: High throughput TCP traffic, database connections, gaming servers
Examples: AWS NLB, F5 BIG-IP, LVS (Linux Virtual Server)

NAT-Based L4 Load Balancing:

Client → LB (VIP: 10.0.0.1:80) → LB rewrites dst IP to backend (10.0.1.5:8080)
Backend → LB → LB rewrites src IP back to VIP → Client

Client sees: 10.0.0.1 (doesn't know about backend)

Direct Server Return (DSR):

Client → LB → LB forwards packet to backend (only changes MAC, not IP)
Backend → responds DIRECTLY to client (bypasses LB on return path!)

Advantage: LB only handles inbound traffic → 10x more throughput
           (outbound is much larger than inbound for web traffic)

Layer 7 (Application Layer: HTTP/HTTPS)

How: Inspects HTTP headers, URL, cookies, body to make routing decisions
Can do: URL-based routing, header-based routing, cookie-based sticky sessions, A/B testing, request rewriting, compression, caching
Performance: Slower than L4 (must parse HTTP), but much more flexible
Use cases: Web applications, microservices, API gateways
Examples: AWS ALB, NGINX, HAProxy, Envoy

Content-Based Routing (L7):

Rule 1: /api/v1/users/*    → User Service (servers 1-5)
Rule 2: /api/v1/orders/*   → Order Service (servers 6-10)
Rule 3: /static/*          → CDN / Static Server
Rule 4: /admin/*           → Admin Service (servers 11-12)
Default:                    → Frontend Service

Load Balancing Algorithms: Deep Dive

1. Round Robin

Each request goes to the next server in rotation: S1 → S2 → S3 → S1 → ...
Pros: Simple, even distribution when servers are identical
Cons: Ignores server capacity and current load

2. Weighted Round Robin

Each server has a weight. Server with weight 3 gets 3x more traffic than weight 1
Use case: Different-sized servers (8-core vs 16-core)

Weights: S1=3, S2=1, S3=2
Sequence: S1, S1, S1, S2, S3, S3, S1, S1, S1, S2, S3, S3, ...

3. Least Connections ⭐ (recommended for long-lived connections)

Route to the server with the fewest active connections
Why: Some requests take longer than others. Least connections naturally avoids overloading slow servers
Weighted variant: effective_load = active_connections / weight

Server 1: 150 connections, weight 3 → effective = 50
Server 2: 80 connections, weight 1 → effective = 80
Server 3: 90 connections, weight 2 → effective = 45  ← route here

4. IP Hash (Consistent)

server = hash(client_IP) % num_servers
Same client always goes to the same server → natural sticky sessions
With consistent hashing: Adding/removing servers only affects minimal clients
Use case: Caching servers where cache locality matters

5. Least Response Time

Route to the server with the lowest average response time
LB tracks response times per server (exponential moving average)
Best for: Heterogeneous servers with different performance characteristics

6. Random

Randomly pick a server. Statistically even distribution at scale
Power of Two Random Choices: Pick 2 random servers → route to the one with fewer connections. Surprisingly effective: exponentially better than single random choice

7. Consistent Hashing with Bounded Loads

Consistent hashing + cap: no server gets more than (1 + ε) * average_load
If the target server is overloaded → spill to the next server on the ring
Google's approach for their internal LB

Health Checks

Active Health Checks:

Every 5 seconds:
  For each backend server:
    Send HTTP GET /health
    If response 200 within 2 seconds → healthy
    If 3 consecutive failures → mark UNHEALTHY → stop routing
    If 2 consecutive successes after unhealthy → mark HEALTHY → resume routing

Passive Health Checks:

Monitor actual request/response patterns
If server returns 5xx for > 50% of requests in last 30 seconds → circuit-break → mark unhealthy
Advantage: No extra health check traffic; detects issues faster

Combination: Use both active (guaranteed detection) and passive (faster detection)

High Availability via VRRP (Virtual Router Redundancy Protocol)

Two load balancer instances share a Virtual IP (VIP). Only the active node responds to traffic on the VIP.

LB 1 (Active): Owns the VIP (e.g. 10.0.0.1), processes all traffic, sends VRRP heartbeat to LB 2 every ~1 second.
LB 2 (Passive): Monitors heartbeats. If 3 consecutive heartbeats are missed (< 3 seconds), LB 2 assumes LB 1 has failed.
Failover: LB 2 broadcasts a gratuitous ARP to claim the VIP. Clients see no interruption because the IP address is the same.

Failover time is typically under 3 seconds. For sub-second failover, use BFD (Bidirectional Forwarding Detection) alongside VRRP.

LB 1 (Active)  ←── VRRP heartbeat ──→  LB 2 (Passive)
     │                                       │
  VIP: 10.0.0.1                        VIP: (standby)
  
If LB 1 fails:
  LB 2 detects missed heartbeat (< 3 seconds)
  LB 2 takes over VIP: 10.0.0.1
  Clients see no interruption (same IP)

Implementation: Keepalived daemon on both LB instances

Active-Active Setup:

DNS returns multiple LB IPs (DNS round robin)
Both LBs handle traffic simultaneously
If one fails, DNS health check removes it (slow: DNS TTL)
Better: Anycast → both LBs announce same IP via BGP

Connection Draining (Graceful Shutdown)

When removing a server from the pool:

Stop sending NEW connections to the server
Allow EXISTING connections to complete (with a timeout, e.g., 30 seconds)
After timeout → forcefully close remaining connections
Remove server from the pool

Backend Management

HTTP

POST /api/v1/backends
{
  "address": "10.0.1.5:8080",
  "weight": 3,
  "max_connections": 1000,
  "health_check": {
    "path": "/health",
    "interval_sec": 5,
    "timeout_sec": 2,
    "unhealthy_threshold": 3,
    "healthy_threshold": 2
  }
}

DELETE /api/v1/backends/{backend_id}?drain_timeout=30

GET /api/v1/backends
Response: 200 OK
{
  "backends": [
    {
      "address": "10.0.1.5:8080",
      "status": "healthy",
      "active_connections": 150,
      "weight": 3,
      "requests_total": 1500000,
      "avg_response_time_ms": 45
    }
  ]
}

LB Configuration

HTTP

PUT /api/v1/config
{
  "algorithm": "least_connections",
  "sticky_sessions": {
    "enabled": true,
    "cookie_name": "SERVERID",
    "ttl_seconds": 3600
  },
  "ssl": {
    "certificate": "...",
    "key": "..."
  },
  "routing_rules": [
    {"path": "/api/*", "backend_pool": "api-servers"},
    {"path": "/static/*", "backend_pool": "static-servers"}
  ]
}

Common Error Responses

400 Bad Request: invalid input, missing fields, or malformed JSON
401 Unauthorized: missing or invalid auth token or API key
403 Forbidden: authenticated but insufficient permissions
404 Not Found: resource ID does not exist
409 Conflict: duplicate write or version conflict; retry with idempotency key
422 Unprocessable Entity: valid syntax but invalid business logic
429 Too Many Requests: rate limit exceeded; honor Retry-After header
500 Internal Error: unexpected server fault; retry with idempotency key
503 Service Unavailable: dependency down or overloaded; use exponential backoff

Concern	Solution
LB failure	Active-passive pair with VRRP/Keepalived; failover < 3 seconds
Backend failure	Health check detects → removed from pool; traffic redistributed
Connection table loss	L4 NAT: connections drop on LB failover (clients reconnect). DSR: unaffected (return path doesn't go through LB)
Thundering herd on server recovery	Slow start: gradually increase traffic to recovered server (start at 10%, ramp to 100% over 30 seconds)
LB overloaded	DNS-based multi-LB, Anycast, or hardware LB for extreme scale

Specific: Graceful Handling of Backend Crashes

Timeline:
  T=0:  Server 3 crashes
  T=0:  In-flight requests to Server 3 fail → LB retries on Server 1 or 2
  T=5:  Active health check fails (1st attempt)
  T=10: Active health check fails (2nd attempt)
  T=15: Active health check fails (3rd attempt) → Server 3 marked UNHEALTHY
  T=15: All new traffic routed to Server 1 and 2 only
  
  Passive health check would detect it faster:
  T=0-2: > 50% of requests to Server 3 return errors → circuit-break → immediate UNHEALTHY

Global Server Load Balancing (GSLB)

For multi-datacenter deployments:

Loading...

DNS resolves to the LB IP in the nearest datacenter
If a DC goes down, DNS removes it → traffic shifts to surviving DCs
Also used for: disaster recovery, compliance (data sovereignty), latency optimization

TLS/SSL Termination

Client ──── HTTPS (encrypted) ──── LB ──── HTTP (unencrypted) ──── Backend

Benefits:
- Backend servers don't need to handle TLS (CPU savings)
- Centralized certificate management
- LB can inspect HTTP headers for L7 routing

Modern Load Balancers: Service Mesh

In microservices architectures, instead of a centralized LB:

Service A → Sidecar Proxy (Envoy) → Service B's Sidecar Proxy → Service B

Each service has a sidecar proxy that handles:
- Load balancing (client-side)
- Service discovery
- Circuit breaking
- Retry/timeout
- mTLS (mutual TLS)
- Observability

Istio, Linkerd: Service mesh implementations
Envoy: The sidecar proxy used by most service meshes
Eliminates the centralized LB bottleneck for east-west (service-to-service) traffic

Comparison of LB Solutions

Solution	Type	Layer	Scale	Use Case
HAProxy	Software	L4/L7	Millions of conns	General purpose
NGINX	Software	L7	100K+ req/s	Web, reverse proxy
Envoy	Software	L4/L7	Service mesh	Microservices
AWS ALB	Managed	L7	Auto-scales	Cloud-native HTTP
AWS NLB	Managed	L4	Millions of conns	TCP/UDP, ultra-low latency
F5 BIG-IP	Hardware	L4/L7	100 Gbps+	Enterprise, legacy
Maglev	Software (Google)	L3/L4	Millions of pps	Google's internal LB

Monitoring

Requests/sec per backend (detect imbalanced distribution)
Error rate per backend (detect unhealthy servers)
Latency percentiles (p50, p95, p99) per backend
Active connections per backend
LB CPU and memory (is the LB itself a bottleneck?)
Health check success rate
Connection queue depth (requests waiting for a backend connection)

Interview Walkthrough

Clarify L4 vs L7 upfront — L4 routes by IP/port (fast, no TLS termination), L7 inspects HTTP headers/cookies (sticky sessions, path routing).
Compare algorithms: round-robin (simple), least connections (handles long-lived requests), consistent hash (minimal disruption on scale-down).
Explain health checks: active HTTP probes vs passive (mark unhealthy after N consecutive 5xx) — discuss grace period during deploys.
Cover TLS termination at the LB vs passthrough — terminating at LB enables HTTP/2 and certificate centralization but adds a decrypt hop.
Discuss sticky sessions: cookie-based affinity vs shared session store — sticky breaks when backends are replaced unevenly.
Mention DSR (Direct Server Return) for L4 when LB bandwidth is the bottleneck — response bypasses the LB on the return path.
Common pitfall: round-robin to backends with unequal capacity — one powerful server gets the same share as a weak one, skewing p99 latency.

L4 vs L7: Packet Path Walkthrough

L4 Load Balancer (NAT mode) — what happens to each packet:

  Client sends:
    SRC: 198.51.100.5:4321  DST: 10.0.0.1:80 (VIP)
    
  LB receives, selects Server 2 (10.0.1.6:8080):
    Rewrites DST: 10.0.0.1:80 → 10.0.1.6:8080
    Stores mapping in connection table:
      {198.51.100.5:4321} → {10.0.1.6:8080}
    Forwards packet:
    SRC: 198.51.100.5:4321  DST: 10.0.1.6:8080
    
  Server 2 responds:
    SRC: 10.0.1.6:8080  DST: 198.51.100.5:4321
    
  LB receives, looks up connection table:
    Rewrites SRC: 10.0.1.6:8080 → 10.0.0.1:80
    Forwards to client:
    SRC: 10.0.0.1:80  DST: 198.51.100.5:4321
    
  Client sees response from 10.0.0.1:80 (VIP) → transparent
  
  Cost: 2 rewrites per packet (DNAT inbound, SNAT outbound)
  Throughput: kernel-level NAT → millions of packets/sec
  Limitation: LB sees ALL traffic (both directions) → bottleneck for 
              bandwidth-heavy responses (e.g., video streaming)

L4 Load Balancer (DSR mode) — bypasses LB on return:

  Client sends:
    SRC: 198.51.100.5:4321  DST: 10.0.0.1:80 (VIP)
    
  LB receives, selects Server 2:
    Changes L2 header (MAC address) to Server 2's MAC
    Does NOT rewrite IP addresses
    Packet arrives at Server 2 with DST still = 10.0.0.1 (VIP)
    (Server 2 must have VIP configured on loopback interface)
    
  Server 2 responds DIRECTLY to client:
    SRC: 10.0.0.1:80 (VIP)  DST: 198.51.100.5:4321
    (Response goes straight to client, NOT through LB)
    
  Advantage: LB only handles inbound traffic
    For web traffic: request = 1 KB, response = 100 KB
    LB handles 1% of total traffic → 100x more capacity than NAT mode
    
  Requirement: LB and backends must be in same L2 network (same VLAN)
    DSR doesn't work across subnets (L2 rewrite only)

L7 Load Balancer — full HTTP inspection:

  Client sends:
    TLS ClientHello → LB terminates TLS → decrypts HTTP
    LB reads: GET /api/v1/users/123  Host: api.example.com
    
  LB routing decision:
    Path matches "/api/v1/users/*" → backend pool "user-service"
    Algorithm: least connections → Server 3 (fewest active)
    
  LB opens NEW TCP connection to Server 3:
    SRC: LB_internal_IP  DST: 10.0.1.7:8080
    Sends proxied HTTP request (adds X-Forwarded-For header)
    
  Server 3 responds → LB receives → re-encrypts → sends to client
  
  Cost: full TCP termination + TLS + HTTP parse on EVERY request
  Throughput: ~100K req/sec per LB instance (vs millions for L4)
  Advantage: content-based routing, caching, compression, WAF

Connection Table Scaling

Problem: L4 NAT LB maintains a connection table entry for EVERY active connection
  10M concurrent connections × 128 bytes per entry = 1.28 GB
  
  Connection table operations:
    Insert: O(1) hash table insert per new connection
    Lookup: O(1) per packet (hash by 5-tuple: src_ip, dst_ip, src_port, dst_port, proto)
    Delete: O(1) on connection close or timeout
    
  Memory pressure:
    10M entries fits in RAM easily
    But: hash table with 10M entries → cache misses for random access
    At 1M packets/sec → 1M hash lookups/sec → CPU cache thrashing
    
  Solution: Kernel bypass (DPDK, XDP)
    Bypass the kernel network stack entirely
    Process packets in user space with poll-mode drivers
    Pre-allocate connection table in huge pages (no TLB misses)
    Pin to dedicated CPU cores (no context switches)
    Result: 10M+ packets/sec on commodity hardware

  Connection timeout:
    TCP established: 300 seconds (default)
    TCP time_wait: 120 seconds
    UDP: 30 seconds
    Aggressive timeouts free table entries faster but may drop slow clients

  Connection table failover:
    Active LB syncs connection table to standby via multicast
    On failover: standby has ~99% of connections → most clients unaffected
    Missing entries: those connections reset → client reconnects
    For DSR: no connection table needed on return path → cleaner failover

Consistent Hashing with Bounded Loads (Google Maglev)

Problem with standard consistent hashing for LB:
  Server S3 has hash position right after a large gap on the ring
  → S3 gets 40% of traffic while S1 and S2 get 30% each
  Virtual nodes help but don't guarantee bounds

Google's Maglev hashing:
  Build a lookup table of size M (prime, e.g., 65537)
  Each backend gets a "preference list" of table positions:
    preference[i] = (offset + i × skip) mod M
    offset = hash1(backend_name) mod M
    skip = hash2(backend_name) mod (M-1) + 1
    
  Fill the table round-robin by preference:
    Entry 0: Backend A (A's first preference is position 0)
    Entry 1: Backend B (B's first preference is position 1)
    Entry 2: Backend C (C's first preference is position 2)
    ...continue until all M entries filled
    
  Lookup: table[hash(5-tuple) mod M] → backend
  
  Properties:
    Disruption: adding/removing 1 of N backends moves only ~1/N of entries
    Uniformity: each backend gets exactly M/N ± 1 entries (near-perfect balance)
    Speed: single array lookup → O(1) per packet (cache-friendly)
    
  Why it's better than ring-based consistent hashing:
    Ring: 100-200 virtual nodes needed for reasonable balance → memory
    Maglev: one 65K-entry table → 512 KB → fits in L2 cache
    Ring: O(log N) binary search for closest node
    Maglev: O(1) array lookup

Bounded loads addition:
  Hard limit: no server gets more than (1 + ε) × average load
  ε = 0.25 → max 25% over average
  
  If selected server exceeds bound:
    Walk to next server in the table
    Repeat until finding one below bound
    
  Provides: strict load balancing guarantee + consistent hashing benefits
  Used by: Google (Maglev), Envoy (ring_hash with overprovisioning_factor)

The LB as Single Point of Failure: Defense in Depth

Layer 1: Active-Passive (VRRP/Keepalived)
  Two LB instances share a VIP
  Active handles all traffic
  Passive monitors Active via heartbeat (every 1 second)
  Active dies → Passive claims VIP via gratuitous ARP (< 3 seconds)
  ✓ Simple, proven
  ✗ Passive wastes resources (idle), 50% hardware utilization
  ✗ Failover gap: 1-3 seconds of dropped connections

Layer 2: Active-Active (DNS round-robin)
  DNS returns multiple LB IPs: [10.0.0.1, 10.0.0.2]
  Both LBs handle traffic simultaneously
  If one fails → DNS health check removes it
  ✓ 100% hardware utilization
  ✗ DNS TTL means clients cache stale IP → traffic to dead LB for TTL duration
  ✗ Uneven distribution (DNS round-robin is not load-aware)

Layer 3: Anycast (BGP)
  Both LBs advertise the SAME IP via BGP
  Network routes packets to the NEAREST LB (by BGP path)
  If one LB fails → BGP withdraws route → traffic converges to other LB
  ✓ Automatic, network-level failover (< 30 seconds)
  ✓ Natural geographic load distribution
  ✗ BGP convergence can take 10-30 seconds
  ✗ Requires BGP-capable network infrastructure
  Used by: Cloudflare, Google, AWS NLB

Layer 4: Distributed LB (no single LB)
  Service mesh: every service has a sidecar proxy (Envoy)
  Each sidecar does its own load balancing (client-side LB)
  No centralized LB → no SPOF
  ✓ Eliminates LB as bottleneck entirely
  ✗ Every service needs sidecar → operational complexity
  ✗ Sidecar resource overhead (CPU/memory per pod)
  Used by: Kubernetes with Istio/Linkerd

SLOs & Error Budgets

Metric	Target	Rationale
LB availability	99.99%	LB down = entire service down
Request routing latency overhead	< 1ms p99	LB adds hop — must be negligible vs app latency
Unhealthy backend detection time	< 30s	Faster = less traffic to dead nodes
Connection drain success rate	100%	Zero dropped connections during deploys

Incident Scenarios (2am reality)

Scenario	How you detect	Mitigation
Backend flapping — alternates healthy/unhealthy	Backend join/leave events > 10/min; error rate oscillates	Increase health check threshold (3→5 failures); increase interval (5→10s); investigate backend OOM/GC pauses causing slow health response
All traffic routed to one backend (sticky session storm)	One backend at 100% CPU, others idle; corporate NAT IP identified	Switch from IP hash to cookie-based sticky; or disable sticky and migrate state to Redis; add backends to consistent hash ring
Regional LB failure during peak	GSLB health check fails; DNS still resolving to dead region	Remove region from GSLB pool; TTL-dependent failover to secondary region; pre-warm secondary region capacity for 2× overflow

Cost Drivers (Staff lens)

LB instances: 4-8 large instances for HA ≈ $5-10K/month
Cross-AZ traffic: LB ↔ backend bandwidth (often free within AZ, charged cross-AZ)
TLS certificates and HSM for key management

Multi-Region & DR

GSLB routes to nearest healthy region. Each region has independent LB + backend pool. Failover: GSLB removes unhealthy region; clients retry (DNS TTL delay). Active-active preferred over active-passive for LB tier — no cold standby delay.