Design a Circuit Breaker

This problem appears in multiple sheets. Depth expectations increase as you progress:

Track	What to demonstrate
Arch 50	Show domain depth beyond the baseline: async pipelines, consistency semantics, and operational SLOs.
Arch 75	Staff angles: partition behavior, cost drivers, and MVP → production evolution with clear triggers.

Interview Prompt

Design Circuit Breaker.

Clarifying Questions (ask before designing)

Question	Why it matters
Which of these is highest priority: State machine (closed → open → half-open), Failure rate thresholds, Bulkhead pattern?	Forces scope negotiation — senior candidates trim before drawing boxes.
What scale should we design for — DAU, QPS, data volume?	Drives every capacity decision; shows structured thinking.
What are the read vs write patterns on the critical path?	Determines caching, DB choice, and replication topology.
What consistency and durability guarantees are required?	Separates strong-consistency paths from eventual ones — a senior differentiator.

Scope

In scope

State machine (closed → open → half-open)
Failure rate thresholds
Bulkhead pattern
Fallback strategies
Integration with service mesh
Capacity estimation with shown math

Out of scope (state explicitly)

Full service mesh control plane
Application business logic in downstream services
Building a custom service registry from scratch when managed options exist

Assumptions

Clarify scale (DAU, QPS, data volume) for circuit breaker in the first 5 minutes
Standard reliability target 99.9%–99.99% unless problem implies higher (payments, booking)
Managed cloud services (RDS, S3, Kafka, Redis) are acceptable building blocks

Monitor health of downstream service calls (success/failure rates)
Automatically stop sending requests to unhealthy services (circuit OPEN)
Periodically probe to check if service has recovered (HALF-OPEN state)
Resume traffic when service recovers (circuit CLOSED)
Configurable thresholds: failure rate, slow-call rate, minimum call volume
Support per-service, per-endpoint, per-client circuit breakers
Dashboard showing circuit breaker states across all services
Integration with service mesh (Envoy/Istio sidecar) or application library
Fallback responses when circuit is open (cached response, default value, degraded mode)
Manual override: force-open or force-close circuits

Metric	Calculation	Value
Services in production	Given	500
Inter-service RPC calls / sec	Derived from daily volume ÷ 86400 (+ peak factor)	10M
Circuit breakers (per service × endpoint)	Given	~5,000
State transitions / min (normal)	Given	< 10
Metric data points / sec	Derived from daily volume ÷ 86400 (+ peak factor)	50K

An in-process circuit breaker library wrapping calls to downstream services. The state machine transitions between CLOSED (normal operation), OPEN (rejecting requests), and HALF-OPEN (probing recovery). Metrics are emitted to Prometheus for dashboarding.

Loading...

Configuration API

PUT /api/circuit-breakers/{service}/{endpoint}
{
  "failure_rate_threshold": 50,
  "slow_call_rate_threshold": 80,
  "slow_call_duration_ms": 5000,
  "sliding_window_type": "COUNT",
  "sliding_window_size": 100,
  "minimum_calls": 20,
  "wait_duration_open_ms": 30000,
  "permitted_calls_half_open": 5,
  "fallback_type": "CACHE",
  "manual_override": null
}

GET /api/circuit-breakers/status → All CB states
POST /api/circuit-breakers/{service}/{endpoint}/override
  { "state": "FORCE_OPEN" }

Metrics API

JSON

GET /api/circuit-breakers/metrics?service=payment-service
{
  "circuit_state": "OPEN",
  "failure_rate": 67.5,
  "total_calls": 1500,
  "successful_calls": 487,
  "failed_calls": 1013,
  "not_permitted_calls": 2300,
  "fallback_calls": 2300,
  "state_transitions": [
    {"from": "CLOSED", "to": "OPEN", "at": "2026-03-14T10:05:00Z"}
  ]
}

Common Error Responses

400 Bad Request: invalid input, missing fields, or malformed JSON
401 Unauthorized: missing or invalid auth token or API key
403 Forbidden: authenticated but insufficient permissions
404 Not Found: resource ID does not exist
409 Conflict: duplicate write or version conflict; retry with idempotency key
422 Unprocessable Entity: valid syntax but invalid business logic
429 Too Many Requests: rate limit exceeded; honor Retry-After header
500 Internal Error: unexpected server fault; retry with idempotency key
503 Service Unavailable: dependency down or overloaded; use exponential backoff

Ring Buffer Implementation (Count-Based)

JAVA

class CircuitBreaker {
    enum State { CLOSED, OPEN, HALF_OPEN }

    State state = CLOSED;
    int[] outcomes;          // ring buffer: 0=success, 1=failure, 2=slow
    int head = 0;
    int totalFailures = 0;
    int totalCalls = 0;
    long openedAt;
    Config config;

    Result execute(Supplier<Result> call, Supplier<Result> fallback) {
        if (state == OPEN) {
            if (System.nanoTime() - openedAt > config.waitDuration) {
                state = HALF_OPEN;
                halfOpenPermits = config.permittedCallsHalfOpen;
            } else {
                metrics.increment("not_permitted");
                return fallback.get();
            }
        }

        if (state == HALF_OPEN && halfOpenPermits <= 0) {
            return fallback.get();
        }

        try {
            long start = System.nanoTime();
            Result r = call.get();
            long duration = System.nanoTime() - start;
            recordSuccess(duration);
            return r;
        } catch (Exception e) {
            recordFailure();
            return fallback.get();
        }
    }
}

Prometheus Metrics

# Circuit breaker state (0=closed, 1=open, 2=half-open)
circuit_breaker_state{service="payment", endpoint="/charge"} 1

# Call outcomes
circuit_breaker_calls_total{service="payment", outcome="success"} 487
circuit_breaker_calls_total{service="payment", outcome="failure"} 1013
circuit_breaker_calls_total{service="payment", outcome="not_permitted"} 2300

# State transitions
circuit_breaker_transitions_total{service="payment", from="closed", to="open"} 3

Real-World Configurations

Service	Failure Threshold	Wait Duration	Fallback
Low-risk (catalog)	70%	10s	cached catalog data
High-risk (payment)	30%	60s	queue for later
Internal microservice	50%	30s	return error

Testing Circuit Breakers

Chaos engineering: kill downstream service → verify CB trips, inject latency → verify slow-call detection. Load testing: verify CB doesn't trip under normal load, trips quickly under failure. Integration test: configure low thresholds, send mix of success/failure, assert state transitions.

Interview Walkthrough

Frame as fail-fast protection against cascading failures — a slow downstream must not exhaust the caller's thread pool and take down upstream services.
Walk through the three states: CLOSED (normal) → OPEN (reject + fallback after threshold breach) → HALF-OPEN (probe recovery with limited calls).
Explain the sliding window: count-based ring buffer or time-based window with minimum_calls before tripping — avoids false positives on cold start.
Count failures broadly: exceptions, HTTP 5xx, timeouts, and slow calls exceeding slow_call_duration_ms.
Pair with Bulkheads (isolated thread pools per downstream) and layered timeouts — CB alone does not prevent resource exhaustion from slow calls below the trip threshold.
Critical ordering: wrap retries inside the circuit breaker (circuit_breaker(retry(call))), not the reverse — retries must not fire when the breaker is open.
Common pitfall: sharing circuit breaker state via Redis on the hot path — adds latency and introduces a new failure dependency; keep CB in-process, share metrics for visibility only.

Circuit Breaker vs Bulkhead vs Rate Limiter vs Timeout

Pattern	Problem	Solution
Timeout	Downstream hangs → threads pile up	Set max wait time (e.g., 500ms). ALWAYS use this.
Retry	Transient failures	Retry with exponential backoff + jitter. Don't retry 4xx.
Circuit Breaker	Consistent failures → wasted resources	Trip breaker → fast fail for 30-60s → probe recovery
Bulkhead	One slow dependency exhausts thread pool	Separate thread pools per dependency
Rate Limiter	Caller overloads the callee	Limit calls per second to downstream

Half-Open State: The Critical Recovery Mechanism

Without Half-Open: once OPEN, circuit never recovers. With Half-Open: wait duration elapses → allow probe → if succeeds, CLOSED; if fails, OPEN again. Two approaches:

Conservative (1 probe at a time): minimal risk but slow recovery if traffic is low.

Aggressive (sliding window, e.g., 10% of requests): faster recovery, more risk. Resilience4j default: 10 permitted calls in half-open. Best practice: set wait_duration slightly longer than P95 downstream latency.

Service Mesh vs Application Code

Application code (Resilience4j, Hystrix): fine-grained per method, custom fallbacks in app logic, zero latency check. Cons: per-language implementation, code coupling.

Service Mesh (Istio/Envoy): language-agnostic, centralized observability, no code changes. Cons: coarser granularity, network-level only, ~1ms overhead.

Best practice: service mesh for basic CB + library for complex fallback logic. Istio handles connection-level failures; Resilience4j handles business logic exceptions.

SLOs & Error Budgets

Metric	Target	Rationale
Core user-facing availability	99.95%	Budget for planned maintenance + unplanned failures without user-visible outage.
p99 latency (critical path)	Problem-specific — state target early and tie to capacity math	Interview credibility comes from connecting SLO to architecture choices.
Error rate (5xx)	< 0.1%	Distinguishes transient blips from systemic failure requiring rollback.
Data durability	99.999999999% (11 nines) for committed writes	Define which operations require fsync/quorum vs async replication.

Incident Scenarios (2am reality)

Scenario	How you detect	Mitigation
Primary database unavailable	Health check failures, connection pool exhaustion alerts, elevated 5xx	Failover to replica / promote standby; enable read-only degraded mode if writes impossible; queue writes if async path exists
Traffic spike (10× normal)	RPS anomaly alert, autoscaling lag, latency SLO burn rate	Rate limit non-critical endpoints; scale read path horizontally; pre-warm caches; shed load on expensive operations
Bad deploy causing elevated errors	Canary metric regression, error budget burn, deployment correlation	Automated rollback within 5 minutes; feature flag kill switch; maintain N-1 compatibility

Cost Drivers (Staff lens)

Egress bandwidth and CDN (often dominates media/data-heavy systems)
Database storage + IOPS at scale (plan compaction, TTL, tiering)
Compute for async pipelines (right-size workers, spot instances for batch)
Managed service premiums vs operational headcount trade-off

Multi-Region & DR

Start single-region with cross-AZ redundancy. Add read replicas in secondary region for DR. Move to active-active only when latency SLO or data residency requires it — accept conflict resolution complexity explicitly.

Interview Prompt

Clarifying Questions (ask before designing)

Scope

In scope

Out of scope (state explicitly)

Assumptions

State Machine Deep Dive

In-Process Sliding Window (Count-Based)

Distributed Circuit Breaker State

Configuration API

Metrics API

Common Error Responses

Ring Buffer Implementation (Count-Based)

Prometheus Metrics

Cascading Failure Prevention

Bulkhead Pattern (Complementary to CB)

Timeout Strategy

Distributed CB Coordination

Retry vs Circuit Breaker Interaction