Design a Stock Exchange Matching Engine – System Design Walkthrough

This problem appears in multiple sheets. Depth expectations increase as you progress:

Track	What to demonstrate
Arch 50	Show domain depth beyond the baseline: async pipelines, consistency semantics, and operational SLOs.
Arch 75	Staff angles: partition behavior, cost drivers, and MVP → production evolution with clear triggers.

Interview Prompt

Design Stock Exchange Matching Engine.

Clarifying Questions (ask before designing)

Question	Why it matters
Which of these is highest priority: Order book data structure (price-time priority), Matching algorithm, Lock-free queues?	Forces scope negotiation — senior candidates trim before drawing boxes.
What scale should we design for — DAU, QPS, data volume?	Drives every capacity decision; shows structured thinking.
What are the read vs write patterns on the critical path?	Determines caching, DB choice, and replication topology.
What consistency and durability guarantees are required?	Separates strong-consistency paths from eventual ones — a senior differentiator.

Scope

In scope

Order book data structure (price-time priority)
Matching algorithm
Lock-free queues
Nanosecond latency
Market data dissemination
Regulatory audit trail

Out of scope (state explicitly)

Retail brokerage mobile app
Regulatory reporting to SEC/FINRA
Blockchain settlement layer

Assumptions

Clarify scale (DAU, QPS, data volume) for stock exchange matching engine in the first 5 minutes
Standard reliability target 99.9%–99.99% unless problem implies higher (payments, booking)
Managed cloud services (RDS, S3, Kafka, Redis) are acceptable building blocks

Order types: Limit orders, market orders, stop orders, stop-limit orders, IOC, FOK, GTC
Order matching: Match buy and sell orders by price-time priority (FIFO within same price)
Order book: Maintain real-time order book per instrument (bids and asks sorted by price)
Trade execution: Execute matched orders; generate trade records with unique trade IDs
Order lifecycle: New, acknowledged, partially filled, filled, cancelled, expired, rejected
Market data: Publish real-time L1 (BBO), L2 (depth), L3 (full order book)
Order modification: Amend quantity (reduce only) or price (loses time priority on price change)
Cancel: Cancel resting orders; cancel-on-disconnect for algo traders
Opening/closing auctions: Batch matching at market open and close (call auction)
Pre-trade risk checks: Position limits, order rate limits, fat-finger checks

Metric	Calculation	Value
Instruments	Given (assumption documented in value)	10,000
Orders / sec (peak)	From Orders / day ÷ 86400 (+ peak factor in value)	1M across all instruments
Orders per instrument / sec	From Orders per instrument / day ÷ 86400 (+ peak factor in value)	100 avg, 10K for hot stocks
Trades / sec	From Trades / day ÷ 86400 (+ peak factor in value)	100K
Cancels / sec	From Cancels / day ÷ 86400 (+ peak factor in value)	500K
Order book depth	Given	~1,000 price levels per side
Market data updates / sec	From Market data updates / day ÷ 86400 (+ peak factor in value)	5M
WAL write throughput	1M events/sec × ~200 bytes	200 MB/sec
Daily trade records	Given	~500M

Critical Path Latency Budget

Step	Latency
FIX parse + normalize	~1-2 μs
Risk checks (shared memory)	~0.5 μs
Sequencer (assign + NVMe WAL write)	~1-2 μs
Ring buffer publish + consume	~0.1 μs
Matching engine (lookup + match)	~0.5-1 μs
Execution report publish	~0.5-1 μs
FIX response serialize + send	~1-2 μs
Total	~5-9 μs

Order-to-Trade Flow

1. Matching Algorithm: Price-Time Priority

Order book is split into Bid and Ask sides. Within the same price level, priority is strictly determined by arrival time.

Order Book for AAPL:
BUY (Bids)                          SELL (Asks)
Price    Qty    Time                 Price    Qty    Time
$150.10  100    09:30:01.001        $150.15  200    09:30:01.002
$150.10  50     09:30:01.005        $150.20  300    09:30:01.003

Data structure:
  Bids: sorted map (price DESC → doubly-linked list of orders FIFO)
  Asks: sorted map (price ASC → doubly-linked list of orders FIFO)
  orders: HashMap<OrderId, pointer-to-node-in-linked-list>
  
  Cancel by order_id: O(1) — critical since algos cancel 90% of orders

2. Sequencer: Total Ordering of Events

Fairness is paramount. The order that reaches the sequencer first must be processed first. Single-threaded assignment ensures exact sequencing.

Why total ordering? Fairness. The order that arrives first MUST be processed first.

Implementation:
  All gateways → Sequencer (single thread, single machine per partition)
  For each incoming order:
    1. Assign seq_num = atomic_counter++
    2. Write {seq_num, order_data} to WAL on local NVMe SSD (~1μs)
    3. Replicate WAL entry to standby server
    4. Publish to ring buffer for matching engine consumption

Throughput: ~10M seq assignments/sec
Latency: ~1-2 μs per assignment

3. Why Single-Threaded Matching (Not Multi-Threaded)

Rather than complex locking or CAS operations inside the hot path, single-threading per partition provides unmatched performance.

1. No locking overhead: Single-threaded = zero lock contention
2. Deterministic replay: Same input → same output, always
3. CPU cache efficiency: Hot order book data fits in L1/L2 cache
4. Good enough throughput: Single core processes 500K+ ops/sec per instrument

Per-instrument parallelism:
  10,000 instruments across 64 partitions
  Each partition on dedicated hardware
  Core pinning: Thread for AAPL always runs on core 4 (no context switches)

4. Pre-Trade Risk Engine: Sub-Microsecond Checks

Shared memory allows non-blocking, sub-100ns checks across gateway threads.

Fat-Finger Detection: if order.price > 2× last_traded_price → REJECT
Position Limit: net exposure tracker updated on every fill, validated on new orders
Order Rate Throttle: token bucket per firm to prevent DDoS spikes
Self-Trade Prevention (STP): automatically skips or cancels matched same-firm orders
Kill Switch: operational switch to halt all trading for an individual firm

5. Deterministic Order Book Recovery

If the engine crashes, we reboot from the last snapshot and replay the sequenced WAL. Pure in-memory replay hits up to 10M events/second.

Matching engine crashes at 10:30:15 AM. How to recover?

1. Sequencer WAL contains: every order, cancel, and amend event with seq_num
   WAL persisted to local NVMe SSD + replicated to standby server.

2. Recovery process:
   a. Start fresh matching engine with empty order books
   b. Replay ALL events from WAL starting from market open (9:30:00 AM)
   c. Each event replayed in strict seq_num order
   d. Same events → same state (deterministic guarantee)
   e. Order book fully reconstructed to exact pre-crash state

3. Recovery time:
   1 hour of events ≈ 3.6M events (at 1K/sec per instrument)
   Replay speed: 10M events/sec (pure in-memory, no I/O, no network)
   Recovery time: < 1 second

4. Gap detection:
   After recovery, matching engine announces its last processed seq_num.
   If any client's last seen seq_num > engine's → client detects gap → reconnects.

5. Snapshot optimization:
   Periodically (every 5 minutes): snapshot entire order book state to disk.
   On recovery: load snapshot + replay only events SINCE snapshot.
   Reduces replay from 6.5 hours (full day) to < 5 minutes of events.

Event Bus Design (Kafka)

Topic: stock_exchange_matching_engine-events
  Partitions: 64 (scale consumers horizontally)
  Partition key: entity_id (user_id / order_id — preserves per-entity ordering)
  Retention: 7 days (compliance) or 24h (high-volume telemetry)
  Replication factor: 3, min.insync.replicas: 2

Producer: idempotent producer enabled (enable.idempotence=true)
Consumer: consumer group "stock_exchange_matching_engine-processors"
  - At-least-once delivery + idempotent handlers (dedup by event_id)
  - DLQ topic: stock_exchange_matching_engine-events-dlq (poison messages after 3 retries)
  - Lag alert: consumer lag > 60s → scale workers

Design a Stock Exchange Matching Engine: async side effects MUST NOT block the synchronous API response.
  Sync path: validate → persist source of truth → publish event → return 201
  Async path: consumers update caches, indexes, notifications, aggregates

FIX Protocol (Institutional / Algo)

The standard protocol for low-latency institutional trading. Text-tag value pairs parsed at the gateway, or optimized via Simple Binary Encoding (SBE).

FIX 4.4 / 5.0 — ASCII tag-value protocol; SBE binary encoding for sub-millisecond communication.
  New Order: MsgType=D, ClOrdID, Symbol, Side, OrdType, Price, OrderQty, TimeInForce
  Cancel: MsgType=F, OrigClOrdID
  Execution Report: MsgType=8, ExecType=0 (New), F (Fill), 4 (Cancelled), 8 (Rejected)

REST/WebSocket (Retail)

Gateways transform JSON/HTTP requests into the internal binary format before sending to the sequencer.

HTTP

POST /api/v1/orders
{ "instrument": "AAPL", "side": "buy", "type": "limit", "price": 150.10, "quantity": 100 }
→ 201 { "order_id": "ord-uuid", "status": "new", "seq_num": 4523781 }

DELETE /api/v1/orders/{order_id}
→ 200 { "status": "cancelled", "remaining_qty": 50 }

GET /api/v1/orderbook/AAPL?depth=10
→ { "bids": [...], "asks": [...], "last_trade": {"price":150.10,"qty":100} }

Common Error Responses

400 Bad Request: invalid input, missing fields, or malformed JSON
401 Unauthorized: missing or invalid auth token or API key
403 Forbidden: authenticated but insufficient permissions
404 Not Found: resource ID does not exist
409 Conflict: duplicate write or version conflict; retry with idempotency key
422 Unprocessable Entity: valid syntax but invalid business logic
429 Too Many Requests: rate limit exceeded; honor Retry-After header
500 Internal Error: unexpected server fault; retry with idempotency key
503 Service Unavailable: dependency down or overloaded; use exponential backoff

1. In-Memory Order Book (per instrument, hot path)

Must be fully pre-allocated to avoid Garbage Collection (GC) pauses or memory allocation overhead on the hot path.

OrderBook {
  bids: TreeMap<PriceTicks, DoublyLinkedList<Order>>   // price DESC
  asks: TreeMap<PriceTicks, DoublyLinkedList<Order>>    // price ASC
  best_bid / best_ask: PriceTicks   // cached
  orders: HashMap<OrderId, NodePointer>   // O(1) cancel
  stop_buys / stop_sells: TreeMap<TriggerPrice, List<StopOrder>>
  last_trade_price: PriceTicks
  state: ENUM(pre_open, auction, continuous, halted, closed)
}

Order {
  order_id: UUID, firm_id: String, side: BUY | SELL
  price: PriceTicks (integer, e.g. 15010 = $150.10)
  remaining_qty: int, original_qty: int
  time_in_force: DAY | GTC | IOC | FOK
  seq_num: long, timestamp_ns: long
}

2. PostgreSQL: Durable Store (async write-behind)

PostgreSQL partitions the trades and orders tables daily to handle high-write volumes without index fragmentation.

SQL

CREATE TABLE orders (
    order_id UUID PRIMARY KEY, instrument VARCHAR(10),
    firm_id VARCHAR(20), side ENUM('buy','sell'),
    order_type ENUM('limit','market','stop','stop_limit','ioc','fok'),
    price BIGINT, original_qty INT, filled_qty INT,
    status ENUM('new','partial','filled','cancelled','expired','rejected'),
    seq_num BIGINT UNIQUE NOT NULL,
    created_at TIMESTAMPTZ NOT NULL
) PARTITION BY RANGE (created_at);

CREATE TABLE trades (
    trade_id UUID PRIMARY KEY, order_id_buy UUID NOT NULL, order_id_sell UUID NOT NULL,
    instrument VARCHAR(10), price BIGINT, quantity INT,
    seq_num BIGINT UNIQUE NOT NULL, created_at TIMESTAMPTZ NOT NULL
) PARTITION BY RANGE (created_at);

3. Shared Memory: Risk Engine State

Ensures Gateway threads access current position thresholds in lock-free, sub-100ns operations.

CPP

struct FirmRiskState {
  char firm_id[20];
  std::atomic<long> position[10000];     // net shares per instrument (+buy, -sell)
  std::atomic<long> order_count_window;  // order counter in current second
  std::atomic<bool> kill_switch;         // emergency halt flag
  long max_position;                      // configured net threshold
  int max_order_rate;                     // max orders/sec limit
};
// Shared memory segment mapped across all Gateway process memory layouts.
// Updated on every fill (atomic add). Checked on new orders (atomic read).

4. Kafka: Async Downstream Fan-out

Used strictly for non-blocking downstream processes. Kafka is never on the critical transaction path.

Topic: market-data-raw       (raw order book changes, trades — analytics)
Topic: trade-reports         (executed fills — clearing, settlement, drop copy)
Topic: audit-trail           (comprehensive event records — SEC Rule 613 compliance)
  Retention: 7 years. Tiered storage: recent in Kafka, archived in S3.

Concern	Solution
Matching engine crash	Replay from sequencer WAL; rebuild order book deterministically; recovery < 1 second
Sequencer failure	Hot standby with replicated WAL; automatic failover in < 1 sec; bounded event loss (~10μs)
Network partition	Reject orders during partition (safety over availability during market hours): CP, not AP
Duplicate orders	Seq_num dedup; each order processed exactly once; ClOrdID dedup at gateway level
Market data loss	Clients request snapshot + subscribe to stream; gap detection via seq_num; automatic re-sync
Split-brain	Fencing token: new sequencer epoch_id embedded in seq_num; old sequencer's messages rejected
Clock skew	PTP (Precision Time Protocol) for microsecond accuracy; GPS-synchronized clocks on all servers
Data center failover	Active-passive DC pair; manual failover by ops during market hours; DR site replays WAL
Algo gone rogue	Kill switch per firm; circuit breaker per instrument (LULD); order rate throttle

Specific recovery detail: Deterministic Replay Guarantee

Why deterministic replay is non-negotiable:

Regulatory requirement: Exchanges must prove any historical state can be reconstructed.
Audit auditability: "What was the exact order book state at 10:15:23.456789 AM?" can be resolved by replaying the WAL to that exact seq_num.
SEC Rule 613 (Consolidated Audit Trail) strictly mandates this functionality.

Interview Walkthrough

Lead with the latency requirement: order-to-ACK in single-digit microseconds — this dictates single-threaded, lock-free architecture.
Walk through the sequencer: gateway → pre-trade risk checks → single-threaded matching engine → WAL append → ACK to client.
Explain price-time priority matching: best price first, then FIFO at each level — implemented as a sorted book per instrument.
Cover the WAL as the source of truth: every event gets a monotonic seq_num for deterministic replay and regulatory audit (SEC CAT).
Mention why Kafka is wrong for the hot path (2-5 ms p99) but right for async downstream market data and clearing feeds.
Discuss pre-trade risk gates (15c3-5): max order size, fat-finger collars, and self-trade prevention before the order hits the book.
Common pitfall: multi-threaded matching with locks — cancel/fill race conditions produce ambiguous state that regulators cannot reconstruct.

1. Regulatory Compliance

SEC Rule 613 (CAT): Every single event must be logged with nanosecond timestamping, participant credentials, and absolute sequence numbers.
Reg NMS: Quotations must be routed to the exchange displaying the National Best Bid/Offer (NBBO) or rejected/routed away.
MiFID II (EU): Enforces strict clock synchronization to 100μs granularity across all servers.
Market Access Rule (15c3-5): Brokers must have financial and regulatory pre-trade risk gates in place before orders reach the book.

2. Co-Location and Fairness

High-Frequency Trading (HFT) firms co-locate their servers inside the exchange’s physical data center (e.g., Mahwah for NYSE). The exchange provides equal-length fiber cables to all racks to neutralize the physical speed-of-light advantage. Some venues (like IEX) introduce a 350-microsecond coiled fiber "speed bump" to neutralize latency arbitrage.

3. Dark Pools

Off-exchange venues with zero pre-trade transparency. Matching operates via a midpoint priority model (matching bids and asks at the mid of the public NBBO) to prevent market impact for massive block trades.

4. Multi-Asset Class Support

Equities run on standard Price-Time priority. Options include expiry, strike price, put/call tags, and complex multi-leg combinations. Futures require real-time margin adjustments and daily mark-to-market. The infrastructure keeps gateways and sequencers uniform, but runs dedicated, specialized matching logic containers per asset class.

5. Market Making and Liquidity Programs

Designated Market Makers (DMMs) receive trading fee rebates (~$0.002 per share) in exchange for maintaining continuous, two-sided bid/ask quotes within spread limits. Takers pay a fee (~$0.003 per share) to offset this rebate.

6. Monitoring & Alerting

Systems continuously track latency percentiles (p50, p99, p99.99) for order-to-ACK times. CPU core usage is pinned and kept under 30% to prevent dynamic frequency/thermal throttling, which would introduce microsecond jitter.

7. End-of-Day and Corporate Actions

At 4:00 PM, all resting DAY orders are systematically cancelled. overnight corporate actions (stock splits, ex-dividend price reductions) are processed while matching engines are offline between the closing and opening auctions.

1. The LMAX Disruptor Pattern

The industry benchmark for trading architecture. Rather than multi-threaded queues with locks, LMAX utilizes a lock-free, single-threaded consumer spinning over a pre-allocated ring buffer.

Pre-allocated ring buffer: Circular array of size 2^N (allows fast bitwise masking)
  Slot index = seq_num & (buffer_size - 1)  // Bitwise masking instead of modulo

Consumer features:
  - Cache-line padding: prevents false sharing between thread cores
  - Zero-allocation hot path: pre-instantiated event objects, no runtime GC
  - Busy-spin wait: never yields to the OS scheduler, avoiding context switch lag
  - sub-1μs delivery time in Java/C++

2. Why Not Kafka for Order Sequencing?

Kafka is designed for high throughput and durability, but its 2-5ms p99 latency profile is thousands of times too slow for a matching engine critical path requiring <10μs bounds.

Kafka Jitter Sources:
  1. TCP round-trip and broker batching delay.
  2. ISR (in-sync replicas) sync acknowledgments over WAN.
  3. OS page cache flush cycles causing microsecond stalls.

Strategy: Use custom Sequencer with NVMe storage for hot path. Use Kafka only for async downstream.

3. Circuit Breakers: LULD Mechanics

A rogue algorithm or flash crash can wipe out billions. Limit Up-Limit Down (LULD) price bands protect market stability.

PYTHON

def after_trade(trade):
    # Band width: e.g. 5% for large-cap stocks
    upper_band = reference_price * (1 + band_width)
    lower_band = reference_price * (1 - band_width)
    
    if trade.price > upper_band or trade.price < lower_band:
        set_instrument_state(LIMIT_STATE)
        start_timer(15_seconds) # Halt if no clearance in 15 seconds

def on_timer_expire():
    if still_in_limit_state:
        set_instrument_state(HALTED)
        start_timer(5_minutes) # 5-minute cooling-off halt

4. Market Order Edge Case: Sweeping the Book

A thin order book combined with an aggressive market order can lead to disastrous execution prices (e.g. paying $200/share instead of $150).

Market Order Collar: Execution halts if the filled price deviates by more than 5% from the NBBO midpoint.
Synthetic Limit Conversion: The exchange converts market orders internally to aggressive limit orders capped at the collar boundary.
Notional Value Cap: Restricts market orders exceeding $1M, requiring explicit limit parameters.

5. Order Cancel and Amend Race Conditions

In a multi-threaded system, an incoming Cancel request and a matching execution Fill trigger complex race conditions. Single-threaded serialization resolves this cleanly.

Race 1: Cancel arrives while Fill is executing
  - Because sequencer enforces seq_num total ordering, whichever event receives the seq_num first wins.
  - If Fill seq_num < Cancel seq_num: order is matched, and Cancel gets a NACK (too late to cancel).
  - If Cancel seq_num < Fill seq_num: order is canceled, and incoming trade matches against the next level.

Amend Price:
  - Amending price always results in: Cancel old order + insert new order at new price.
  - This loses time priority.
  - Amending quantity down maintains the existing queue priority.

6. Self-Trade Prevention (STP) Modes

Algorithmic firms have hundreds of bots crossing bids/asks. Matching same-firm orders represents illegal wash trading.

Cancel Newest (CN): The incoming order is instantly cancelled.
Cancel Oldest (CO): The resting order in the book is cancelled.
Decrement and Cancel (DC): Reduces both order quantities by the overlapping size, preserving priority on the remainder.

7. Market Data Distribution: Multicast UDP vs TCP

Distributing 5M updates/second to 10K clients requires optimized network routing.

TCP Unicast:
  5M updates × 10,000 clients = 50B packets/sec (overwhelming, doesn't scale).

Multicast UDP:
  - Exchange publishes 1 packet to a multicast group (e.g. 224.1.1.1 per partition).
  - Network switches replicate the packet at the hardware layer.
  - Clients detect gaps using sequential ITCH/SBE message numbers.
  - Missed packets are recovered out-of-band via TCP retransmission servers.

8. Why Java/C++ (Not Go/Rust/Python) for the Engine

Python: GIL + dynamic memory allocation + interpreter overhead makes it ~1000x too slow.
Go: Mandatory Garbage Collector pauses (even if sub-500μs) introduce unacceptable tail-latency jitter.
Rust: Offers microsecond speed and memory safety, but suffers from a smaller pool of battle-tested financial libraries and HFT engineering talent.
C++: The absolute standard. Provides kernel bypass (DPDK, Solarflare OpenOnload), manual struct cache alignment, and placement-new memory pools.
Java: Utilized in custom environments (LMAX) with the Zing/Graal VM, off-heap buffers, and zero-allocation codebases to eliminate GC pauses entirely.

9. Clearing and Settlement: T+1

Clearing aggregates all executed trades from Kafka asynchronously, calculating net settlement positions at the end of the day. Because clearing occurs overnight, it resides entirely outside the microsecond critical path.

10. Order Book Data Structure: TreeMap vs Array

TreeMap (Red-Black Tree): Offers O(log P) inserts for new price levels. Poor L1/L2 cache locality due to pointer chasing. Best for general instruments.
Direct Array: Direct index mapping (\`Array[price_in_cents]\`). Provides absolute O(1) performance and contiguous cache alignment, but wastes gigabytes of memory on sparse order books.

11. Storage Choice: NVMe WAL vs BBRAM vs PMEM

SATA SSD fsync: 50-200 μs (Unusable).
NVMe SSD fsync: 5-20 μs (Intel Optane: 7-10 μs) — Good baseline.
Battery-Backed RAM (BBRAM): 0.1-0.5 μs (Ideal, ultra-low latency, expensive).

SLOs & Error Budgets

Metric	Target	Rationale
Core user-facing availability	99.95%	Budget for planned maintenance + unplanned failures without user-visible outage.
p99 latency (critical path)	Problem-specific — state target early and tie to capacity math	Interview credibility comes from connecting SLO to architecture choices.
Error rate (5xx)	< 0.1%	Distinguishes transient blips from systemic failure requiring rollback.
Data durability	99.999999999% (11 nines) for committed writes	Define which operations require fsync/quorum vs async replication.

Incident Scenarios (2am reality)

Scenario	How you detect	Mitigation
Primary database unavailable	Health check failures, connection pool exhaustion alerts, elevated 5xx	Failover to replica / promote standby; enable read-only degraded mode if writes impossible; queue writes if async path exists
Traffic spike (10× normal)	RPS anomaly alert, autoscaling lag, latency SLO burn rate	Rate limit non-critical endpoints; scale read path horizontally; pre-warm caches; shed load on expensive operations
Bad deploy causing elevated errors	Canary metric regression, error budget burn, deployment correlation	Automated rollback within 5 minutes; feature flag kill switch; maintain N-1 compatibility

Cost Drivers (Staff lens)

Egress bandwidth and CDN (often dominates media/data-heavy systems)
Database storage + IOPS at scale (plan compaction, TTL, tiering)
Compute for async pipelines (right-size workers, spot instances for batch)
Managed service premiums vs operational headcount trade-off

Multi-Region & DR

Start single-region with cross-AZ redundancy. Add read replicas in secondary region for DR. Move to active-active only when latency SLO or data residency requires it — accept conflict resolution complexity explicitly.