This problem appears in multiple sheets. Depth expectations increase as you progress:
Interview Prompt
Design Circuit Breaker.
Clarifying Questions (ask before designing)
| Question | Why it matters |
|---|---|
| Which of these is highest priority: State machine (closed → open → half-open), Failure rate thresholds, Bulkhead pattern? | Forces scope negotiation — senior candidates trim before drawing boxes. |
| What scale should we design for — DAU, QPS, data volume? | Drives every capacity decision; shows structured thinking. |
| What are the read vs write patterns on the critical path? | Determines caching, DB choice, and replication topology. |
| What consistency and durability guarantees are required? | Separates strong-consistency paths from eventual ones — a senior differentiator. |
Scope
In scope
- State machine (closed → open → half-open)
- Failure rate thresholds
- Bulkhead pattern
- Fallback strategies
- Integration with service mesh
- Capacity estimation with shown math
Out of scope (state explicitly)
- Full service mesh control plane
- Application business logic in downstream services
- Building a custom service registry from scratch when managed options exist
Assumptions
- Clarify scale (DAU, QPS, data volume) for circuit breaker in the first 5 minutes
- Standard reliability target 99.9%–99.99% unless problem implies higher (payments, booking)
- Managed cloud services (RDS, S3, Kafka, Redis) are acceptable building blocks
These foundational concepts underpin the patterns used in this problem. Review them before deep-diving into component-level trade-offs.
- Monitor health of downstream service calls (success/failure rates)
- Automatically stop sending requests to unhealthy services (circuit OPEN)
- Periodically probe to check if service has recovered (HALF-OPEN state)
- Resume traffic when service recovers (circuit CLOSED)
- Configurable thresholds: failure rate, slow-call rate, minimum call volume
- Support per-service, per-endpoint, per-client circuit breakers
- Dashboard showing circuit breaker states across all services
- Integration with service mesh (Envoy/Istio sidecar) or application library
- Fallback responses when circuit is open (cached response, default value, degraded mode)
- Manual override: force-open or force-close circuits
- Ultra-Low Overhead: < 1µs per call decision (in-process check)
- Fast Detection: Detect failures within 5–10 seconds
- Fast Recovery: Resume traffic within seconds of downstream recovery
- No SPOF: Circuit breaker itself must not become a single point of failure
- Consistency: All instances of a service should have similar circuit state (eventual)
- Observability: Emit metrics for open/close transitions, fallback invocations
| Metric | Calculation | Value |
|---|---|---|
| Services in production | Given | 500 |
| Inter-service RPC calls / sec | Derived from daily volume ÷ 86400 (+ peak factor) | 10M |
| Circuit breakers (per service × endpoint) | Given | ~5,000 |
| State transitions / min (normal) | Given | < 10 |
| Metric data points / sec | Derived from daily volume ÷ 86400 (+ peak factor) | 50K |
An in-process circuit breaker library wrapping calls to downstream services. The state machine transitions between CLOSED (normal operation), OPEN (rejecting requests), and HALF-OPEN (probing recovery). Metrics are emitted to Prometheus for dashboarding.
State Machine Deep Dive
CLOSED → OPEN: Triggered when failure_rate > threshold (e.g., 50%) over sliding window AND minimum_calls met (e.g., 20). Sliding window types: count-based (last N calls) or time-based (last T seconds). Failure types counted: exceptions/HTTP 5xx, timeouts, and slow calls (> slow_call_threshold).
OPEN → HALF-OPEN: After wait_duration (e.g., 30 seconds) to give downstream time to recover.
HALF-OPEN → CLOSED: If permitted probe calls succeed (e.g., 5/5). Reset failure metrics, resume normal traffic.
HALF-OPEN → OPEN: If any probe fails. Restart wait_duration timer. Optimization: exponential backoff (30s → 60s → 120s → max 300s).
In-Process Sliding Window (Count-Based)
Implemented as a ring buffer: O(1) per call, fixed memory. Each slot stores outcome: 0=success, 1=failure, 2=slow. On each call result, update ring buffer head, increment/decrement counters, check threshold, and possibly trip to OPEN.
Distributed Circuit Breaker State
Redis can be used for cross-instance CB state sharing: HSET cb:payment:/charge state "OPEN". Each instance publishes local metrics every 5 seconds. Trade-off: adds Redis latency to hot path. Recommendation: keep CB in-process, share state for visibility only.
Configuration API
PUT /api/circuit-breakers/{service}/{endpoint}
{
"failure_rate_threshold": 50,
"slow_call_rate_threshold": 80,
"slow_call_duration_ms": 5000,
"sliding_window_type": "COUNT",
"sliding_window_size": 100,
"minimum_calls": 20,
"wait_duration_open_ms": 30000,
"permitted_calls_half_open": 5,
"fallback_type": "CACHE",
"manual_override": null
}
GET /api/circuit-breakers/status → All CB states
POST /api/circuit-breakers/{service}/{endpoint}/override
{ "state": "FORCE_OPEN" }Metrics API
GET /api/circuit-breakers/metrics?service=payment-service
{
"circuit_state": "OPEN",
"failure_rate": 67.5,
"total_calls": 1500,
"successful_calls": 487,
"failed_calls": 1013,
"not_permitted_calls": 2300,
"fallback_calls": 2300,
"state_transitions": [
{"from": "CLOSED", "to": "OPEN", "at": "2026-03-14T10:05:00Z"}
]
}Common Error Responses
400 Bad Request: invalid input, missing fields, or malformed JSON 401 Unauthorized: missing or invalid auth token or API key 403 Forbidden: authenticated but insufficient permissions 404 Not Found: resource ID does not exist 409 Conflict: duplicate write or version conflict; retry with idempotency key 422 Unprocessable Entity: valid syntax but invalid business logic 429 Too Many Requests: rate limit exceeded; honor Retry-After header 500 Internal Error: unexpected server fault; retry with idempotency key 503 Service Unavailable: dependency down or overloaded; use exponential backoff
Ring Buffer Implementation (Count-Based)
class CircuitBreaker {
enum State { CLOSED, OPEN, HALF_OPEN }
State state = CLOSED;
int[] outcomes; // ring buffer: 0=success, 1=failure, 2=slow
int head = 0;
int totalFailures = 0;
int totalCalls = 0;
long openedAt;
Config config;
Result execute(Supplier<Result> call, Supplier<Result> fallback) {
if (state == OPEN) {
if (System.nanoTime() - openedAt > config.waitDuration) {
state = HALF_OPEN;
halfOpenPermits = config.permittedCallsHalfOpen;
} else {
metrics.increment("not_permitted");
return fallback.get();
}
}
if (state == HALF_OPEN && halfOpenPermits <= 0) {
return fallback.get();
}
try {
long start = System.nanoTime();
Result r = call.get();
long duration = System.nanoTime() - start;
recordSuccess(duration);
return r;
} catch (Exception e) {
recordFailure();
return fallback.get();
}
}
}Prometheus Metrics
# Circuit breaker state (0=closed, 1=open, 2=half-open)
circuit_breaker_state{service="payment", endpoint="/charge"} 1
# Call outcomes
circuit_breaker_calls_total{service="payment", outcome="success"} 487
circuit_breaker_calls_total{service="payment", outcome="failure"} 1013
circuit_breaker_calls_total{service="payment", outcome="not_permitted"} 2300
# State transitions
circuit_breaker_transitions_total{service="payment", from="closed", to="open"} 3Cascading Failure Prevention
Without CB: Service C is down → B gets timeouts → B's thread pool exhausted → A's thread pool exhausted → TOTAL SYSTEM DOWN. With CB: C is down → B's CB trips after 50% failures → B returns fallback immediately → B's thread pool stays healthy → A stays healthy. Key insight: CB prevents resource exhaustion by failing fast.
Bulkhead Pattern (Complementary to CB)
Even with CB, a slow service can consume all threads. Bulkhead isolates thread pools per downstream service (e.g., 20 threads for B, 10 for C, 15 for D). If C is slow, only 10 threads blocked. Implementations: Hystrix-style thread pool isolation, semaphore isolation (lighter weight), Envoy max_connections per upstream.
Timeout Strategy
Layered timeouts (defense in depth): connection timeout (1s), read timeout (5s), CB slow-call threshold (3s), retry budget (max 3, total 10s), deadline propagation. Without layered timeouts: client timeout of 30s waits 30s for each failing call. With CB + timeouts: after 20 failures (each 5s timeout) → CB trips → instant fallback.
Distributed CB Coordination
Problem: 10 instances may each see different failure rates. Options: 1) No coordination (simplest, eventually converge). 2) Shared metrics via Redis (consistent but adds dependency). 3) Control plane gossip (eventual consistency, no critical-path dependency). Recommendation: Option 1 for most cases, Option 3 for critical paths.
Retry vs Circuit Breaker Interaction
Bad: retry(3, circuit_breaker(call)): retries waste attempts even when CB is open. Good: circuit_breaker(retry(3, call)): CB wraps retries. When open, no retries attempted. When persistent failures → CB trips, stops all attempts.
Real-World Configurations
| Service | Failure Threshold | Wait Duration | Fallback |
|---|---|---|---|
| Low-risk (catalog) | 70% | 10s | cached catalog data |
| High-risk (payment) | 30% | 60s | queue for later |
| Internal microservice | 50% | 30s | return error |
Testing Circuit Breakers
Chaos engineering: kill downstream service → verify CB trips, inject latency → verify slow-call detection. Load testing: verify CB doesn't trip under normal load, trips quickly under failure. Integration test: configure low thresholds, send mix of success/failure, assert state transitions.
Interview Walkthrough
- Frame as fail-fast protection against cascading failures — a slow downstream must not exhaust the caller's thread pool and take down upstream services.
- Walk through the three states: CLOSED (normal) → OPEN (reject + fallback after threshold breach) → HALF-OPEN (probe recovery with limited calls).
- Explain the sliding window: count-based ring buffer or time-based window with
minimum_callsbefore tripping — avoids false positives on cold start. - Count failures broadly: exceptions, HTTP 5xx, timeouts, and slow calls exceeding
slow_call_duration_ms. - Pair with Bulkheads (isolated thread pools per downstream) and layered timeouts — CB alone does not prevent resource exhaustion from slow calls below the trip threshold.
- Critical ordering: wrap retries inside the circuit breaker (
circuit_breaker(retry(call))), not the reverse — retries must not fire when the breaker is open. - Common pitfall: sharing circuit breaker state via Redis on the hot path — adds latency and introduces a new failure dependency; keep CB in-process, share metrics for visibility only.
Circuit Breaker vs Bulkhead vs Rate Limiter vs Timeout
| Pattern | Problem | Solution |
|---|---|---|
| Timeout | Downstream hangs → threads pile up | Set max wait time (e.g., 500ms). ALWAYS use this. |
| Retry | Transient failures | Retry with exponential backoff + jitter. Don't retry 4xx. |
| Circuit Breaker | Consistent failures → wasted resources | Trip breaker → fast fail for 30-60s → probe recovery |
| Bulkhead | One slow dependency exhausts thread pool | Separate thread pools per dependency |
| Rate Limiter | Caller overloads the callee | Limit calls per second to downstream |
Half-Open State: The Critical Recovery Mechanism
Without Half-Open: once OPEN, circuit never recovers. With Half-Open: wait duration elapses → allow probe → if succeeds, CLOSED; if fails, OPEN again. Two approaches:
Conservative (1 probe at a time): minimal risk but slow recovery if traffic is low.
Aggressive (sliding window, e.g., 10% of requests): faster recovery, more risk. Resilience4j default: 10 permitted calls in half-open. Best practice: set wait_duration slightly longer than P95 downstream latency.
Service Mesh vs Application Code
Application code (Resilience4j, Hystrix): fine-grained per method, custom fallbacks in app logic, zero latency check. Cons: per-language implementation, code coupling.
Service Mesh (Istio/Envoy): language-agnostic, centralized observability, no code changes. Cons: coarser granularity, network-level only, ~1ms overhead.
Best practice: service mesh for basic CB + library for complex fallback logic. Istio handles connection-level failures; Resilience4j handles business logic exceptions.
Staff interviews expect you to articulate how the system evolves under real growth — not jump straight to the final architecture.
Phase 1: MVP (0 to 100K users)
Monolith or minimal services proving core circuit breaker flows. Optimize for shipping speed and correctness over scale.
Key components: Single region · Primary DB + Redis cache · Synchronous core path · Basic monitoring
Move to next phase when: p99 latency exceeds SLO or DB CPU sustained above 70%
Phase 2: Growth (100K to 10M users)
Split read/write paths, introduce async processing for non-critical work, add caching layers and horizontal scaling.
Key components: Read replicas or CQRS · Message queue for async work · CDN / edge caching · Service-level SLOs
Move to next phase when: Hot keys, fan-out bottlenecks, or ops toil from manual scaling
Phase 3: Scale (10M+ users)
Shard data plane, multi-region active-active or active-passive, formal DR runbooks, cost optimization.
Key components: Database sharding / partitioning · Multi-region replication · Auto-scaling + chaos testing · Dedicated platform/SRE ownership
Move to next phase when: Regional failure domain risk, compliance data residency, or linear cost growth unsustainable
SLOs & Error Budgets
| Metric | Target | Rationale |
|---|---|---|
| Core user-facing availability | 99.95% | Budget for planned maintenance + unplanned failures without user-visible outage. |
| p99 latency (critical path) | Problem-specific — state target early and tie to capacity math | Interview credibility comes from connecting SLO to architecture choices. |
| Error rate (5xx) | < 0.1% | Distinguishes transient blips from systemic failure requiring rollback. |
| Data durability | 99.999999999% (11 nines) for committed writes | Define which operations require fsync/quorum vs async replication. |
Incident Scenarios (2am reality)
| Scenario | How you detect | Mitigation |
|---|---|---|
| Primary database unavailable | Health check failures, connection pool exhaustion alerts, elevated 5xx | Failover to replica / promote standby; enable read-only degraded mode if writes impossible; queue writes if async path exists |
| Traffic spike (10× normal) | RPS anomaly alert, autoscaling lag, latency SLO burn rate | Rate limit non-critical endpoints; scale read path horizontally; pre-warm caches; shed load on expensive operations |
| Bad deploy causing elevated errors | Canary metric regression, error budget burn, deployment correlation | Automated rollback within 5 minutes; feature flag kill switch; maintain N-1 compatibility |
Cost Drivers (Staff lens)
- Egress bandwidth and CDN (often dominates media/data-heavy systems)
- Database storage + IOPS at scale (plan compaction, TTL, tiering)
- Compute for async pipelines (right-size workers, spot instances for batch)
- Managed service premiums vs operational headcount trade-off
Multi-Region & DR
Start single-region with cross-AZ redundancy. Add read replicas in secondary region for DR. Move to active-active only when latency SLO or data residency requires it — accept conflict resolution complexity explicitly.