Interview Prompt
Design Cryptocurrency Exchange.
Clarifying Questions (ask before designing)
| Question | Why it matters |
|---|---|
| Which of these is highest priority: Variant of stock exchange + wallet, Hot/cold wallet, Blockchain confirmation? | Forces scope negotiation — senior candidates trim before drawing boxes. |
| What scale should we design for — DAU, QPS, data volume? | Drives every capacity decision; shows structured thinking. |
| What are the read vs write patterns on the critical path? | Determines caching, DB choice, and replication topology. |
| What consistency and durability guarantees are required? | Separates strong-consistency paths from eventual ones — a senior differentiator. |
Scope
In scope
- Variant of stock exchange + wallet
- Hot/cold wallet
- Blockchain confirmation
- Order matching
- Capacity estimation with shown math
Out of scope (state explicitly)
- Fraud ML model training (#75) — rules engine is enough unless asked
- Merchant onboarding / KYC workflows
- Building a PSP or bank from scratch
Assumptions
- Strong consistency required on money/inventory paths — clarify idempotency early
- External PSP or bank APIs exist; design integration boundaries only
- 99.99% availability target for the commit/authorize path
These foundational concepts underpin the patterns used in this problem. Review them before deep-diving into component-level trade-offs.
- Order placement: limit, market, stop-loss orders for crypto pairs
- Order matching engine: price-time priority
- Wallet management: deposit, withdraw crypto (on-chain) and fiat
- Hot wallet (online, for fast withdrawals) and cold wallet (offline, for security)
- Real-time order book and trade feed (WebSocket)
- Portfolio: view balances, P&L, transaction history
- KYC/AML: identity verification, transaction monitoring
- Multi-factor authentication (2FA, hardware keys)
- Trading fees with tiered pricing (maker/taker)
- Staking and lending features
- Low Latency: Order matching < 1ms; API response < 50ms
- High Availability: 99.99% (downtime during volatility = massive losses)
- Security: Cold wallet for 95% of assets; HSM for key management
- Consistency: Account balances must NEVER be wrong
- Auditability: Every transaction traceable for regulatory compliance
- Throughput: 100K orders/sec peak (during market events)
| Metric | Calculation | Value |
|---|---|---|
| Registered users | Given | 50M |
| DAU | Given | 5M |
| Trading pairs | Given | 500 |
| Orders / sec (peak) | Derived from daily volume ÷ 86400 (+ peak factor) | 100K |
| Trades / sec (peak) | Derived from daily volume ÷ 86400 (+ peak factor) | 50K |
| WebSocket connections | Given | 2M concurrent |
| Wallet transactions / day | 1M ÷ 86400 | 1M |
| Hot wallet balance | Given | 5% of total assets |
Order Lifecycle
1. User submits order: POST /api/orders {pair:BTC_USDT, side:buy, price:50000, qty:0.5}
2. Order Service:
a. Validate: user authenticated, pair exists, qty > min
b. Reserve funds: Lua script in Redis → USDT_available -= 25000
c. If Redis reserve fails → reject immediately
d. Persist order (status=open) to PostgreSQL
e. Route to matching engine for this pair
3. Matching Engine (single-threaded, in-memory per pair):
a. Insert order into order book (TreeMap of price levels)
b. Match against opposite side: buy $50K vs best ask $49,950 → MATCH at $49,950
c. Generate trade event. Write to WAL before returning.
d. Publish trade event to Kafka
4. Post-Trade Settlement (async from Kafka):
a. Update balances, deduct fees (maker 0.1%, taker 0.15%)
b. All inside PostgreSQL transaction
5. WebSocket broadcast: order book update, trade feed, user eventsMatching Engine Deep Dive
Data structure: Two TreeMaps (Red-Black Trees) per pair Bids: sorted by price DESC (highest first), then time ASC Asks: sorted by price ASC (lowest first), then time ASC Price-Time Priority: same price → earlier order fills first (FIFO) Single-threaded: ONE thread per trading pair No locks needed → predictable µs latency 100 pairs ? 100 independent threads WAL: Every match written to append-only file BEFORE publish On crash → replay WAL → rebuild order book state Performance: 100K+ matches/sec per pair (in-memory, single-threaded)
Deposit & Withdrawal Flow
DEPOSIT: 1. Generate HD wallet address (unique per user per chain) 2. Blockchain Watcher detects incoming transaction 3. Wait for N confirmations (BTC:3, ETH:12, SOL:32) 4. Credit user: UPDATE balances SET available = available + amount 5. Edge case: chain reorg → reverse credit if tx disappears WITHDRAWAL: 1. User submits withdrawal (2FA required) 2. Deduct from available balance (PG transaction) 3. Hot wallet service signs transaction (HSM multi-sig 2-of-3) 4. Broadcast to blockchain 5. Monitor for confirmation 6. If hot wallet balance < withdrawal amount: queue + trigger cold?hot replenishment
Race Conditions
1. Double-Spend: User has 1 BTC, submits two sell orders simultaneously Solution: Redis Lua atomic check-and-decrement (single-threaded, no race) PostgreSQL: authoritative backup with SELECT FOR UPDATE 2. Order Cancel During Matching: Solution: Single-threaded engine → cancel queued behind match If order partially filled before cancel ? fill matched, cancel remainder 3. Withdrawal After Balance Used in Trade: Solution: Withdrawal deducts from available, trade from reserved ? different pool PostgreSQL transaction ensures atomicity
Blockchain Reorg Handling
Wait for N confirmations before crediting (BTC:3 ˜ 30min). Blockchain Watcher compares block hashes continuously. If reorg detected deeper than N blocks ? reverse pending credits. NEVER let users trade unconfirmed deposits.
Event Bus Design (Kafka)
Topic: cryptocurrency_exchange-events Partitions: 64 (scale consumers horizontally) Partition key: entity_id (user_id / order_id — preserves per-entity ordering) Retention: 7 days (compliance) or 24h (high-volume telemetry) Replication factor: 3, min.insync.replicas: 2 Producer: idempotent producer enabled (enable.idempotence=true) Consumer: consumer group "cryptocurrency_exchange-processors" - At-least-once delivery + idempotent handlers (dedup by event_id) - DLQ topic: cryptocurrency_exchange-events-dlq (poison messages after 3 retries) - Lag alert: consumer lag > 60s → scale workers Design a Cryptocurrency Exchange: async side effects MUST NOT block the synchronous API response. Sync path: validate → persist source of truth → publish event → return 201 Async path: consumers update caches, indexes, notifications, aggregates
POST /api/orders → Place order {pair, side, type, quantity, price}
DELETE /api/orders/{id} → Cancel order
GET /api/orders?status=open → Open orders
GET /api/orderbook/{pair} → Order book snapshot
GET /api/trades/{pair} → Recent trades
GET /api/wallet/balances → Portfolio balances
POST /api/wallet/withdraw → Initiate withdrawal (2FA required)
# WebSocket streams
WS /ws/orderbook/{pair} → Real-time order book
WS /ws/trades/{pair} → Real-time trades
WS /ws/user → User order updates, balance changesCommon Error Responses
400 Bad Request: invalid input, missing fields, or malformed JSON 401 Unauthorized: missing or invalid auth token or API key 403 Forbidden: authenticated but insufficient permissions 404 Not Found: resource ID does not exist 409 Conflict: duplicate write or version conflict; retry with idempotency key 422 Unprocessable Entity: valid syntax but invalid business logic 429 Too Many Requests: rate limit exceeded; honor Retry-After header 500 Internal Error: unexpected server fault; retry with idempotency key 503 Service Unavailable: dependency down or overloaded; use exponential backoff
PostgreSQL (Ledger: Source of Truth)
ACID for financial data. DECIMAL(28,18) handles crypto precision (18 decimals for ETH). SELECT FOR UPDATE prevents double-spend.
CREATE TABLE balances (
user_id UUID, currency TEXT,
available DECIMAL(28,18) NOT NULL DEFAULT 0 CHECK (available >= 0),
reserved DECIMAL(28,18) NOT NULL DEFAULT 0 CHECK (reserved >= 0),
PRIMARY KEY (user_id, currency)
);
CREATE TABLE orders (
order_id UUID PRIMARY KEY, user_id UUID NOT NULL,
pair TEXT NOT NULL, side TEXT NOT NULL,
order_type TEXT NOT NULL, price DECIMAL(28,18),
quantity DECIMAL(28,18) NOT NULL,
filled_qty DECIMAL(28,18) DEFAULT 0,
status TEXT DEFAULT 'open',
created_at TIMESTAMPTZ DEFAULT NOW()
);
CREATE TABLE trades (
trade_id UUID PRIMARY KEY, pair TEXT NOT NULL,
buyer_id UUID NOT NULL, seller_id UUID NOT NULL,
price DECIMAL(28,18) NOT NULL,
quantity DECIMAL(28,18) NOT NULL,
executed_at TIMESTAMPTZ DEFAULT NOW()
);Redis (Hot Balances + Order Book Cache)
HSET balance:{user_id} BTC_available "1.234" BTC_reserved "0.5"
# Atomic reservation via Lua script:
local avail = tonumber(redis.call('HGET', KEYS[1], ARGV[1]))
if avail >= tonumber(ARGV[2]) then
redis.call('HINCRBYFLOAT', KEYS[1], ARGV[1], -tonumber(ARGV[2]))
redis.call('HINCRBYFLOAT', KEYS[1], ARGV[3], tonumber(ARGV[2]))
return 1
end
return 0
# Order book snapshot cache
SET orderbook:BTC_USDT '{bids:[...],asks:[...]}' EX 1Exchange Hack Prevention
1. Hot wallet = 5% of assets → limits blast radius 2. Multi-sig: hot=2-of-3, cold=3-of-5 → no single key compromise 3. HSM (Hardware Security Module) → keys never in software memory 4. Withdrawal anomaly detection → auto-freeze + human review 5. IP whitelisting for withdrawal addresses (user-configurable) 6. 24h withdrawal lock after password change 7. Circuit breaker: unusual trading volume → halt market 8. Proof of Reserves: Merkle tree of all balances → public audit
Proof of Reserves
Merkle tree: each leaf = hash(user_id, balance). Root published monthly. Users can verify their inclusion without revealing other users' balances. Third-party auditor verifies total on-chain assets = tree total.
Market Manipulation Detection
Wash trading: same entity buying from themselves ? detect via IP/device/timing correlation. Spoofing: large order placed and cancelled quickly ? detect via order lifetime analysis. Circuit breaker: price moves > 10% in 5min ? halt trading for 5min cooldown.
vs Stock Exchange (#76)
Stock exchange: regulated, T+2 settlement, central clearing house. Crypto exchange: self-custody, instant settlement, on-chain withdrawal. Stock: assets held by DTCC ? no custody risk. Crypto: exchange holds private keys ? hack risk ? hot/cold wallet architecture. Crypto: multi-currency, 24/7/365 operation.
Interview Walkthrough
- Split hot wallet (online trading) from cold wallet (majority of funds offline) — security architecture first.
- Explain order matching engine as single-threaded per trading pair with in-memory order book.
- Cover deposit/withdrawal pipeline with blockchain confirmation thresholds and reconciliation.
- Discuss KYC/AML as gating before account activation, not inline on every trade.
- Mention idempotent order placement and balance locking during open orders.
- Common pitfall: keeping all funds in a hot wallet — one breach drains everything; cold storage is mandatory.
Matching Engine: In-Memory vs Database-Backed
Option 1: Database-backed (naive) Order book in PostgreSQL: SELECT top bid, ask WHERE bid >= ask ✓ Durable (ACID) ✗ Throughput: 10K TPS max (DB lock contention) ✗ Latency: 5-20ms per match Option 2: In-memory matching engine ? (Binance, Kraken) Bids: max-heap by price DESC. Asks: min-heap by price ASC. ✓ Microsecond latency, 1M+ orders/sec → State lost on crash Mitigation: WAL to Kafka + periodic snapshots to S3 Option 3: Hybrid (production-grade) In-memory for active order book PostgreSQL for trade records, order history Kafka as durable event log (source of truth for recovery)
Order Types Implementation
Market Order: Match immediately against best available asks.
Limit Order: If matching ask exists, execute; otherwise add to book.
Stop-Loss: Stored in separate stop order book; when triggered, converted to market order.
Regulatory Compliance: KYC/AML Architecture
KYC levels determine trading limits. AML: rules-based (structuring, pass-through, velocity) + ML layer (transaction graph analysis, pattern detection, network clustering). Suspicious Activity Report (SAR) filed with FinCEN. Travel Rule (FATF): share sender/receiver KYC for transfers > $3,000.
Staff interviews expect you to articulate how the system evolves under real growth — not jump straight to the final architecture.
Phase 1: MVP (0 to 100K users)
Monolith or minimal services proving core cryptocurrency exchange flows. Optimize for shipping speed and correctness over scale.
Key components: Single region · Primary DB + Redis cache · Synchronous core path · Basic monitoring
Move to next phase when: p99 latency exceeds SLO or DB CPU sustained above 70%
Phase 2: Growth (100K to 10M users)
Split read/write paths, introduce async processing for non-critical work, add caching layers and horizontal scaling.
Key components: Read replicas or CQRS · Message queue for async work · CDN / edge caching · Service-level SLOs
Move to next phase when: Hot keys, fan-out bottlenecks, or ops toil from manual scaling
Phase 3: Scale (10M+ users)
Shard data plane, multi-region active-active or active-passive, formal DR runbooks, cost optimization.
Key components: Database sharding / partitioning · Multi-region replication · Auto-scaling + chaos testing · Dedicated platform/SRE ownership
Move to next phase when: Regional failure domain risk, compliance data residency, or linear cost growth unsustainable
SLOs & Error Budgets
| Metric | Target | Rationale |
|---|---|---|
| Core user-facing availability | 99.95% | Budget for planned maintenance + unplanned failures without user-visible outage. |
| p99 latency (critical path) | Problem-specific — state target early and tie to capacity math | Interview credibility comes from connecting SLO to architecture choices. |
| Error rate (5xx) | < 0.1% | Distinguishes transient blips from systemic failure requiring rollback. |
| Data durability | 99.999999999% (11 nines) for committed writes | Define which operations require fsync/quorum vs async replication. |
Incident Scenarios (2am reality)
| Scenario | How you detect | Mitigation |
|---|---|---|
| Primary database unavailable | Health check failures, connection pool exhaustion alerts, elevated 5xx | Failover to replica / promote standby; enable read-only degraded mode if writes impossible; queue writes if async path exists |
| Traffic spike (10× normal) | RPS anomaly alert, autoscaling lag, latency SLO burn rate | Rate limit non-critical endpoints; scale read path horizontally; pre-warm caches; shed load on expensive operations |
| Bad deploy causing elevated errors | Canary metric regression, error budget burn, deployment correlation | Automated rollback within 5 minutes; feature flag kill switch; maintain N-1 compatibility |
Cost Drivers (Staff lens)
- Egress bandwidth and CDN (often dominates media/data-heavy systems)
- Database storage + IOPS at scale (plan compaction, TTL, tiering)
- Compute for async pipelines (right-size workers, spot instances for batch)
- Managed service premiums vs operational headcount trade-off
Multi-Region & DR
Start single-region with cross-AZ redundancy. Add read replicas in secondary region for DR. Move to active-active only when latency SLO or data residency requires it — accept conflict resolution complexity explicitly.