This problem appears in multiple sheets. Depth expectations increase as you progress:
Interview Prompt
Design Stock Exchange Matching Engine.
Clarifying Questions (ask before designing)
| Question | Why it matters |
|---|---|
| Which of these is highest priority: Order book data structure (price-time priority), Matching algorithm, Lock-free queues? | Forces scope negotiation — senior candidates trim before drawing boxes. |
| What scale should we design for — DAU, QPS, data volume? | Drives every capacity decision; shows structured thinking. |
| What are the read vs write patterns on the critical path? | Determines caching, DB choice, and replication topology. |
| What consistency and durability guarantees are required? | Separates strong-consistency paths from eventual ones — a senior differentiator. |
Scope
In scope
- Order book data structure (price-time priority)
- Matching algorithm
- Lock-free queues
- Nanosecond latency
- Market data dissemination
- Regulatory audit trail
Out of scope (state explicitly)
- Retail brokerage mobile app
- Regulatory reporting to SEC/FINRA
- Blockchain settlement layer
Assumptions
- Clarify scale (DAU, QPS, data volume) for stock exchange matching engine in the first 5 minutes
- Standard reliability target 99.9%–99.99% unless problem implies higher (payments, booking)
- Managed cloud services (RDS, S3, Kafka, Redis) are acceptable building blocks
These foundational concepts underpin the patterns used in this problem. Review them before deep-diving into component-level trade-offs.
- Order types: Limit orders, market orders, stop orders, stop-limit orders, IOC, FOK, GTC
- Order matching: Match buy and sell orders by price-time priority (FIFO within same price)
- Order book: Maintain real-time order book per instrument (bids and asks sorted by price)
- Trade execution: Execute matched orders; generate trade records with unique trade IDs
- Order lifecycle: New, acknowledged, partially filled, filled, cancelled, expired, rejected
- Market data: Publish real-time L1 (BBO), L2 (depth), L3 (full order book)
- Order modification: Amend quantity (reduce only) or price (loses time priority on price change)
- Cancel: Cancel resting orders; cancel-on-disconnect for algo traders
- Opening/closing auctions: Batch matching at market open and close (call auction)
- Pre-trade risk checks: Position limits, order rate limits, fat-finger checks
- Ultra-Low Latency: Order-to-acknowledgement in < 10 microseconds (p99.9)
- Deterministic: Same input sequence always produces same output (for regulatory replay)
- Throughput: 1M+ orders/sec across all instruments; 10K+/sec per hot instrument
- Fairness: Strict price-time priority (FIFO at each price level)
- Durability: Every order and trade persisted to WAL before acknowledgement
- Availability: 99.99% during market hours (9:30 AM – 4:00 PM ET)
- Consistency: No phantom fills, no double fills, no lost orders
| Metric | Calculation | Value |
|---|---|---|
| Instruments | Given (assumption documented in value) | 10,000 |
| Orders / sec (peak) | From Orders / day ÷ 86400 (+ peak factor in value) | 1M across all instruments |
| Orders per instrument / sec | From Orders per instrument / day ÷ 86400 (+ peak factor in value) | 100 avg, 10K for hot stocks |
| Trades / sec | From Trades / day ÷ 86400 (+ peak factor in value) | 100K |
| Cancels / sec | From Cancels / day ÷ 86400 (+ peak factor in value) | 500K |
| Order book depth | Given | ~1,000 price levels per side |
| Market data updates / sec | From Market data updates / day ÷ 86400 (+ peak factor in value) | 5M |
| WAL write throughput | 1M events/sec × ~200 bytes | 200 MB/sec |
| Daily trade records | Given | ~500M |
Critical Path Latency Budget
| Step | Latency |
|---|---|
| FIX parse + normalize | ~1-2 μs |
| Risk checks (shared memory) | ~0.5 μs |
| Sequencer (assign + NVMe WAL write) | ~1-2 μs |
| Ring buffer publish + consume | ~0.1 μs |
| Matching engine (lookup + match) | ~0.5-1 μs |
| Execution report publish | ~0.5-1 μs |
| FIX response serialize + send | ~1-2 μs |
| Total | ~5-9 μs |
Order-to-Trade Flow
1. Matching Algorithm: Price-Time Priority
Order book is split into Bid and Ask sides. Within the same price level, priority is strictly determined by arrival time.
Order Book for AAPL: BUY (Bids) SELL (Asks) Price Qty Time Price Qty Time $150.10 100 09:30:01.001 $150.15 200 09:30:01.002 $150.10 50 09:30:01.005 $150.20 300 09:30:01.003 Data structure: Bids: sorted map (price DESC → doubly-linked list of orders FIFO) Asks: sorted map (price ASC → doubly-linked list of orders FIFO) orders: HashMap<OrderId, pointer-to-node-in-linked-list> Cancel by order_id: O(1) — critical since algos cancel 90% of orders
2. Sequencer: Total Ordering of Events
Fairness is paramount. The order that reaches the sequencer first must be processed first. Single-threaded assignment ensures exact sequencing.
Why total ordering? Fairness. The order that arrives first MUST be processed first.
Implementation:
All gateways → Sequencer (single thread, single machine per partition)
For each incoming order:
1. Assign seq_num = atomic_counter++
2. Write {seq_num, order_data} to WAL on local NVMe SSD (~1μs)
3. Replicate WAL entry to standby server
4. Publish to ring buffer for matching engine consumption
Throughput: ~10M seq assignments/sec
Latency: ~1-2 μs per assignment3. Why Single-Threaded Matching (Not Multi-Threaded)
Rather than complex locking or CAS operations inside the hot path, single-threading per partition provides unmatched performance.
1. No locking overhead: Single-threaded = zero lock contention 2. Deterministic replay: Same input → same output, always 3. CPU cache efficiency: Hot order book data fits in L1/L2 cache 4. Good enough throughput: Single core processes 500K+ ops/sec per instrument Per-instrument parallelism: 10,000 instruments across 64 partitions Each partition on dedicated hardware Core pinning: Thread for AAPL always runs on core 4 (no context switches)
4. Pre-Trade Risk Engine: Sub-Microsecond Checks
Shared memory allows non-blocking, sub-100ns checks across gateway threads.
- Fat-Finger Detection: if order.price > 2× last_traded_price → REJECT
- Position Limit: net exposure tracker updated on every fill, validated on new orders
- Order Rate Throttle: token bucket per firm to prevent DDoS spikes
- Self-Trade Prevention (STP): automatically skips or cancels matched same-firm orders
- Kill Switch: operational switch to halt all trading for an individual firm
5. Deterministic Order Book Recovery
If the engine crashes, we reboot from the last snapshot and replay the sequenced WAL. Pure in-memory replay hits up to 10M events/second.
Matching engine crashes at 10:30:15 AM. How to recover? 1. Sequencer WAL contains: every order, cancel, and amend event with seq_num WAL persisted to local NVMe SSD + replicated to standby server. 2. Recovery process: a. Start fresh matching engine with empty order books b. Replay ALL events from WAL starting from market open (9:30:00 AM) c. Each event replayed in strict seq_num order d. Same events → same state (deterministic guarantee) e. Order book fully reconstructed to exact pre-crash state 3. Recovery time: 1 hour of events ≈ 3.6M events (at 1K/sec per instrument) Replay speed: 10M events/sec (pure in-memory, no I/O, no network) Recovery time: < 1 second 4. Gap detection: After recovery, matching engine announces its last processed seq_num. If any client's last seen seq_num > engine's → client detects gap → reconnects. 5. Snapshot optimization: Periodically (every 5 minutes): snapshot entire order book state to disk. On recovery: load snapshot + replay only events SINCE snapshot. Reduces replay from 6.5 hours (full day) to < 5 minutes of events.
Event Bus Design (Kafka)
Topic: stock_exchange_matching_engine-events Partitions: 64 (scale consumers horizontally) Partition key: entity_id (user_id / order_id — preserves per-entity ordering) Retention: 7 days (compliance) or 24h (high-volume telemetry) Replication factor: 3, min.insync.replicas: 2 Producer: idempotent producer enabled (enable.idempotence=true) Consumer: consumer group "stock_exchange_matching_engine-processors" - At-least-once delivery + idempotent handlers (dedup by event_id) - DLQ topic: stock_exchange_matching_engine-events-dlq (poison messages after 3 retries) - Lag alert: consumer lag > 60s → scale workers Design a Stock Exchange Matching Engine: async side effects MUST NOT block the synchronous API response. Sync path: validate → persist source of truth → publish event → return 201 Async path: consumers update caches, indexes, notifications, aggregates
FIX Protocol (Institutional / Algo)
The standard protocol for low-latency institutional trading. Text-tag value pairs parsed at the gateway, or optimized via Simple Binary Encoding (SBE).
FIX 4.4 / 5.0 — ASCII tag-value protocol; SBE binary encoding for sub-millisecond communication. New Order: MsgType=D, ClOrdID, Symbol, Side, OrdType, Price, OrderQty, TimeInForce Cancel: MsgType=F, OrigClOrdID Execution Report: MsgType=8, ExecType=0 (New), F (Fill), 4 (Cancelled), 8 (Rejected)
REST/WebSocket (Retail)
Gateways transform JSON/HTTP requests into the internal binary format before sending to the sequencer.
POST /api/v1/orders
{ "instrument": "AAPL", "side": "buy", "type": "limit", "price": 150.10, "quantity": 100 }
→ 201 { "order_id": "ord-uuid", "status": "new", "seq_num": 4523781 }
DELETE /api/v1/orders/{order_id}
→ 200 { "status": "cancelled", "remaining_qty": 50 }
GET /api/v1/orderbook/AAPL?depth=10
→ { "bids": [...], "asks": [...], "last_trade": {"price":150.10,"qty":100} }Common Error Responses
400 Bad Request: invalid input, missing fields, or malformed JSON 401 Unauthorized: missing or invalid auth token or API key 403 Forbidden: authenticated but insufficient permissions 404 Not Found: resource ID does not exist 409 Conflict: duplicate write or version conflict; retry with idempotency key 422 Unprocessable Entity: valid syntax but invalid business logic 429 Too Many Requests: rate limit exceeded; honor Retry-After header 500 Internal Error: unexpected server fault; retry with idempotency key 503 Service Unavailable: dependency down or overloaded; use exponential backoff
1. In-Memory Order Book (per instrument, hot path)
Must be fully pre-allocated to avoid Garbage Collection (GC) pauses or memory allocation overhead on the hot path.
OrderBook {
bids: TreeMap<PriceTicks, DoublyLinkedList<Order>> // price DESC
asks: TreeMap<PriceTicks, DoublyLinkedList<Order>> // price ASC
best_bid / best_ask: PriceTicks // cached
orders: HashMap<OrderId, NodePointer> // O(1) cancel
stop_buys / stop_sells: TreeMap<TriggerPrice, List<StopOrder>>
last_trade_price: PriceTicks
state: ENUM(pre_open, auction, continuous, halted, closed)
}
Order {
order_id: UUID, firm_id: String, side: BUY | SELL
price: PriceTicks (integer, e.g. 15010 = $150.10)
remaining_qty: int, original_qty: int
time_in_force: DAY | GTC | IOC | FOK
seq_num: long, timestamp_ns: long
}2. PostgreSQL: Durable Store (async write-behind)
PostgreSQL partitions the trades and orders tables daily to handle high-write volumes without index fragmentation.
CREATE TABLE orders (
order_id UUID PRIMARY KEY, instrument VARCHAR(10),
firm_id VARCHAR(20), side ENUM('buy','sell'),
order_type ENUM('limit','market','stop','stop_limit','ioc','fok'),
price BIGINT, original_qty INT, filled_qty INT,
status ENUM('new','partial','filled','cancelled','expired','rejected'),
seq_num BIGINT UNIQUE NOT NULL,
created_at TIMESTAMPTZ NOT NULL
) PARTITION BY RANGE (created_at);
CREATE TABLE trades (
trade_id UUID PRIMARY KEY, order_id_buy UUID NOT NULL, order_id_sell UUID NOT NULL,
instrument VARCHAR(10), price BIGINT, quantity INT,
seq_num BIGINT UNIQUE NOT NULL, created_at TIMESTAMPTZ NOT NULL
) PARTITION BY RANGE (created_at);3. Shared Memory: Risk Engine State
Ensures Gateway threads access current position thresholds in lock-free, sub-100ns operations.
struct FirmRiskState {
char firm_id[20];
std::atomic<long> position[10000]; // net shares per instrument (+buy, -sell)
std::atomic<long> order_count_window; // order counter in current second
std::atomic<bool> kill_switch; // emergency halt flag
long max_position; // configured net threshold
int max_order_rate; // max orders/sec limit
};
// Shared memory segment mapped across all Gateway process memory layouts.
// Updated on every fill (atomic add). Checked on new orders (atomic read).4. Kafka: Async Downstream Fan-out
Used strictly for non-blocking downstream processes. Kafka is never on the critical transaction path.
Topic: market-data-raw (raw order book changes, trades — analytics) Topic: trade-reports (executed fills — clearing, settlement, drop copy) Topic: audit-trail (comprehensive event records — SEC Rule 613 compliance) Retention: 7 years. Tiered storage: recent in Kafka, archived in S3.
| Concern | Solution |
|---|---|
| Matching engine crash | Replay from sequencer WAL; rebuild order book deterministically; recovery < 1 second |
| Sequencer failure | Hot standby with replicated WAL; automatic failover in < 1 sec; bounded event loss (~10μs) |
| Network partition | Reject orders during partition (safety over availability during market hours): CP, not AP |
| Duplicate orders | Seq_num dedup; each order processed exactly once; ClOrdID dedup at gateway level |
| Market data loss | Clients request snapshot + subscribe to stream; gap detection via seq_num; automatic re-sync |
| Split-brain | Fencing token: new sequencer epoch_id embedded in seq_num; old sequencer's messages rejected |
| Clock skew | PTP (Precision Time Protocol) for microsecond accuracy; GPS-synchronized clocks on all servers |
| Data center failover | Active-passive DC pair; manual failover by ops during market hours; DR site replays WAL |
| Algo gone rogue | Kill switch per firm; circuit breaker per instrument (LULD); order rate throttle |
Specific recovery detail: Deterministic Replay Guarantee
Why deterministic replay is non-negotiable:
- Regulatory requirement: Exchanges must prove any historical state can be reconstructed.
- Audit auditability: "What was the exact order book state at 10:15:23.456789 AM?" can be resolved by replaying the WAL to that exact seq_num.
- SEC Rule 613 (Consolidated Audit Trail) strictly mandates this functionality.
Interview Walkthrough
- Lead with the latency requirement: order-to-ACK in single-digit microseconds — this dictates single-threaded, lock-free architecture.
- Walk through the sequencer: gateway → pre-trade risk checks → single-threaded matching engine → WAL append → ACK to client.
- Explain price-time priority matching: best price first, then FIFO at each level — implemented as a sorted book per instrument.
- Cover the WAL as the source of truth: every event gets a monotonic seq_num for deterministic replay and regulatory audit (SEC CAT).
- Mention why Kafka is wrong for the hot path (2-5 ms p99) but right for async downstream market data and clearing feeds.
- Discuss pre-trade risk gates (15c3-5): max order size, fat-finger collars, and self-trade prevention before the order hits the book.
- Common pitfall: multi-threaded matching with locks — cancel/fill race conditions produce ambiguous state that regulators cannot reconstruct.
1. Regulatory Compliance
- SEC Rule 613 (CAT): Every single event must be logged with nanosecond timestamping, participant credentials, and absolute sequence numbers.
- Reg NMS: Quotations must be routed to the exchange displaying the National Best Bid/Offer (NBBO) or rejected/routed away.
- MiFID II (EU): Enforces strict clock synchronization to 100μs granularity across all servers.
- Market Access Rule (15c3-5): Brokers must have financial and regulatory pre-trade risk gates in place before orders reach the book.
2. Co-Location and Fairness
High-Frequency Trading (HFT) firms co-locate their servers inside the exchange’s physical data center (e.g., Mahwah for NYSE). The exchange provides equal-length fiber cables to all racks to neutralize the physical speed-of-light advantage. Some venues (like IEX) introduce a 350-microsecond coiled fiber "speed bump" to neutralize latency arbitrage.
3. Dark Pools
Off-exchange venues with zero pre-trade transparency. Matching operates via a midpoint priority model (matching bids and asks at the mid of the public NBBO) to prevent market impact for massive block trades.
4. Multi-Asset Class Support
Equities run on standard Price-Time priority. Options include expiry, strike price, put/call tags, and complex multi-leg combinations. Futures require real-time margin adjustments and daily mark-to-market. The infrastructure keeps gateways and sequencers uniform, but runs dedicated, specialized matching logic containers per asset class.
5. Market Making and Liquidity Programs
Designated Market Makers (DMMs) receive trading fee rebates (~$0.002 per share) in exchange for maintaining continuous, two-sided bid/ask quotes within spread limits. Takers pay a fee (~$0.003 per share) to offset this rebate.
6. Monitoring & Alerting
Systems continuously track latency percentiles (p50, p99, p99.99) for order-to-ACK times. CPU core usage is pinned and kept under 30% to prevent dynamic frequency/thermal throttling, which would introduce microsecond jitter.
7. End-of-Day and Corporate Actions
At 4:00 PM, all resting DAY orders are systematically cancelled. overnight corporate actions (stock splits, ex-dividend price reductions) are processed while matching engines are offline between the closing and opening auctions.
1. The LMAX Disruptor Pattern
The industry benchmark for trading architecture. Rather than multi-threaded queues with locks, LMAX utilizes a lock-free, single-threaded consumer spinning over a pre-allocated ring buffer.
Pre-allocated ring buffer: Circular array of size 2^N (allows fast bitwise masking) Slot index = seq_num & (buffer_size - 1) // Bitwise masking instead of modulo Consumer features: - Cache-line padding: prevents false sharing between thread cores - Zero-allocation hot path: pre-instantiated event objects, no runtime GC - Busy-spin wait: never yields to the OS scheduler, avoiding context switch lag - sub-1μs delivery time in Java/C++
2. Why Not Kafka for Order Sequencing?
Kafka is designed for high throughput and durability, but its 2-5ms p99 latency profile is thousands of times too slow for a matching engine critical path requiring <10μs bounds.
Kafka Jitter Sources: 1. TCP round-trip and broker batching delay. 2. ISR (in-sync replicas) sync acknowledgments over WAN. 3. OS page cache flush cycles causing microsecond stalls. Strategy: Use custom Sequencer with NVMe storage for hot path. Use Kafka only for async downstream.
3. Circuit Breakers: LULD Mechanics
A rogue algorithm or flash crash can wipe out billions. Limit Up-Limit Down (LULD) price bands protect market stability.
def after_trade(trade):
# Band width: e.g. 5% for large-cap stocks
upper_band = reference_price * (1 + band_width)
lower_band = reference_price * (1 - band_width)
if trade.price > upper_band or trade.price < lower_band:
set_instrument_state(LIMIT_STATE)
start_timer(15_seconds) # Halt if no clearance in 15 seconds
def on_timer_expire():
if still_in_limit_state:
set_instrument_state(HALTED)
start_timer(5_minutes) # 5-minute cooling-off halt4. Market Order Edge Case: Sweeping the Book
A thin order book combined with an aggressive market order can lead to disastrous execution prices (e.g. paying $200/share instead of $150).
- Market Order Collar: Execution halts if the filled price deviates by more than 5% from the NBBO midpoint.
- Synthetic Limit Conversion: The exchange converts market orders internally to aggressive limit orders capped at the collar boundary.
- Notional Value Cap: Restricts market orders exceeding $1M, requiring explicit limit parameters.
5. Order Cancel and Amend Race Conditions
In a multi-threaded system, an incoming Cancel request and a matching execution Fill trigger complex race conditions. Single-threaded serialization resolves this cleanly.
Race 1: Cancel arrives while Fill is executing - Because sequencer enforces seq_num total ordering, whichever event receives the seq_num first wins. - If Fill seq_num < Cancel seq_num: order is matched, and Cancel gets a NACK (too late to cancel). - If Cancel seq_num < Fill seq_num: order is canceled, and incoming trade matches against the next level. Amend Price: - Amending price always results in: Cancel old order + insert new order at new price. - This loses time priority. - Amending quantity down maintains the existing queue priority.
6. Self-Trade Prevention (STP) Modes
Algorithmic firms have hundreds of bots crossing bids/asks. Matching same-firm orders represents illegal wash trading.
- Cancel Newest (CN): The incoming order is instantly cancelled.
- Cancel Oldest (CO): The resting order in the book is cancelled.
- Decrement and Cancel (DC): Reduces both order quantities by the overlapping size, preserving priority on the remainder.
7. Market Data Distribution: Multicast UDP vs TCP
Distributing 5M updates/second to 10K clients requires optimized network routing.
TCP Unicast: 5M updates × 10,000 clients = 50B packets/sec (overwhelming, doesn't scale). Multicast UDP: - Exchange publishes 1 packet to a multicast group (e.g. 224.1.1.1 per partition). - Network switches replicate the packet at the hardware layer. - Clients detect gaps using sequential ITCH/SBE message numbers. - Missed packets are recovered out-of-band via TCP retransmission servers.
8. Why Java/C++ (Not Go/Rust/Python) for the Engine
- Python: GIL + dynamic memory allocation + interpreter overhead makes it ~1000x too slow.
- Go: Mandatory Garbage Collector pauses (even if sub-500μs) introduce unacceptable tail-latency jitter.
- Rust: Offers microsecond speed and memory safety, but suffers from a smaller pool of battle-tested financial libraries and HFT engineering talent.
- C++: The absolute standard. Provides kernel bypass (DPDK, Solarflare OpenOnload), manual struct cache alignment, and placement-new memory pools.
- Java: Utilized in custom environments (LMAX) with the Zing/Graal VM, off-heap buffers, and zero-allocation codebases to eliminate GC pauses entirely.
9. Clearing and Settlement: T+1
Clearing aggregates all executed trades from Kafka asynchronously, calculating net settlement positions at the end of the day. Because clearing occurs overnight, it resides entirely outside the microsecond critical path.
10. Order Book Data Structure: TreeMap vs Array
- TreeMap (Red-Black Tree): Offers O(log P) inserts for new price levels. Poor L1/L2 cache locality due to pointer chasing. Best for general instruments.
- Direct Array: Direct index mapping (\`Array[price_in_cents]\`). Provides absolute O(1) performance and contiguous cache alignment, but wastes gigabytes of memory on sparse order books.
11. Storage Choice: NVMe WAL vs BBRAM vs PMEM
SATA SSD fsync: 50-200 μs (Unusable). NVMe SSD fsync: 5-20 μs (Intel Optane: 7-10 μs) — Good baseline. Battery-Backed RAM (BBRAM): 0.1-0.5 μs (Ideal, ultra-low latency, expensive).
Staff interviews expect you to articulate how the system evolves under real growth — not jump straight to the final architecture.
Phase 1: MVP (0 to 100K users)
Monolith or minimal services proving core stock exchange matching engine flows. Optimize for shipping speed and correctness over scale.
Key components: Single region · Primary DB + Redis cache · Synchronous core path · Basic monitoring
Move to next phase when: p99 latency exceeds SLO or DB CPU sustained above 70%
Phase 2: Growth (100K to 10M users)
Split read/write paths, introduce async processing for non-critical work, add caching layers and horizontal scaling.
Key components: Read replicas or CQRS · Message queue for async work · CDN / edge caching · Service-level SLOs
Move to next phase when: Hot keys, fan-out bottlenecks, or ops toil from manual scaling
Phase 3: Scale (10M+ users)
Shard data plane, multi-region active-active or active-passive, formal DR runbooks, cost optimization.
Key components: Database sharding / partitioning · Multi-region replication · Auto-scaling + chaos testing · Dedicated platform/SRE ownership
Move to next phase when: Regional failure domain risk, compliance data residency, or linear cost growth unsustainable
SLOs & Error Budgets
| Metric | Target | Rationale |
|---|---|---|
| Core user-facing availability | 99.95% | Budget for planned maintenance + unplanned failures without user-visible outage. |
| p99 latency (critical path) | Problem-specific — state target early and tie to capacity math | Interview credibility comes from connecting SLO to architecture choices. |
| Error rate (5xx) | < 0.1% | Distinguishes transient blips from systemic failure requiring rollback. |
| Data durability | 99.999999999% (11 nines) for committed writes | Define which operations require fsync/quorum vs async replication. |
Incident Scenarios (2am reality)
| Scenario | How you detect | Mitigation |
|---|---|---|
| Primary database unavailable | Health check failures, connection pool exhaustion alerts, elevated 5xx | Failover to replica / promote standby; enable read-only degraded mode if writes impossible; queue writes if async path exists |
| Traffic spike (10× normal) | RPS anomaly alert, autoscaling lag, latency SLO burn rate | Rate limit non-critical endpoints; scale read path horizontally; pre-warm caches; shed load on expensive operations |
| Bad deploy causing elevated errors | Canary metric regression, error budget burn, deployment correlation | Automated rollback within 5 minutes; feature flag kill switch; maintain N-1 compatibility |
Cost Drivers (Staff lens)
- Egress bandwidth and CDN (often dominates media/data-heavy systems)
- Database storage + IOPS at scale (plan compaction, TTL, tiering)
- Compute for async pipelines (right-size workers, spot instances for batch)
- Managed service premiums vs operational headcount trade-off
Multi-Region & DR
Start single-region with cross-AZ redundancy. Add read replicas in secondary region for DR. Move to active-active only when latency SLO or data residency requires it — accept conflict resolution complexity explicitly.