Design a Distributed Banking Ledger System

This problem appears in multiple sheets. Depth expectations increase as you progress:

Track	What to demonstrate
Arch 50	Show domain depth beyond the baseline: async pipelines, consistency semantics, and operational SLOs.
Arch 75	Staff angles: partition behavior, cost drivers, and MVP → production evolution with clear triggers.

Interview Prompt

Design Distributed Banking Ledger System.

Clarifying Questions (ask before designing)

Question	Why it matters
Authorize-only vs capture later? Refunds and chargebacks in scope?	Sets idempotency, ledger, and reconciliation boundaries.
What scale should we design for — DAU, QPS, data volume?	Drives every capacity decision; shows structured thinking.
What are the read vs write patterns on the critical path?	Determines caching, DB choice, and replication topology.
What consistency and durability guarantees are required?	Separates strong-consistency paths from eventual ones — a senior differentiator.

Scope

In scope

Double-entry bookkeeping
Immutable append-only ledger
ACID at scale
Balance snapshot optimization
Regulatory audit
Cross-ledger settlements

Out of scope (state explicitly)

Fraud ML model training (#75) — rules engine is enough unless asked
Merchant onboarding / KYC workflows
Building a PSP or bank from scratch

Assumptions

Strong consistency required on money/inventory paths — clarify idempotency early
External PSP or bank APIs exist; design integration boundaries only
99.99% availability target for the commit/authorize path

Double-entry bookkeeping: Every transaction creates debit and credit entries that sum to zero
Account management: Create accounts (checking, savings, loan, revenue, expense)
Post transactions: Record financial transactions atomically
Balance inquiry: Real-time balance for any account
Statement generation: Account statement for any date range
Reconciliation: Verify all entries balance (total debits = total credits)
Multi-currency: Support transactions in multiple currencies with exchange rates
Immutable audit trail: Entries cannot be modified or deleted, only corrected via reversals

Metric	Calculation	Value
Accounts	Given	100M
Ledger entries / day	Given (assumption documented in value)	1B
Balance queries / sec	Derived from daily volume ÷ 86400 (+ peak factor)	100K
Posting requests / sec	Derived from daily volume ÷ 86400 (+ peak factor)	12K
Storage / day	1B entries × ~500 bytes	500 GB
Storage / year	Given	~180 TB

Loading...

HTTP

POST /api/v1/ledger/post
Idempotency-Key: "txn-uuid-123"
{
  "transaction_id": "txn-123",
  "description": "Transfer A to B",
  "entries": [
    { "account_id": "acct-A", "type": "debit", "amount": 500.00, "currency": "USD" },
    { "account_id": "acct-B", "type": "credit", "amount": 500.00, "currency": "USD" }
  ]
}
-> 200 { "transaction_id": "txn-123", "posted_at": "...", "status": "posted" }

GET /api/v1/accounts/{account_id}/balance
-> { "account_id": "acct-A", "balance": 2450.00, "currency": "USD", "as_of": "..." }

GET /api/v1/accounts/{account_id}/statement?from=2026-03-01&to=2026-03-14
-> { "entries": [...], "opening_balance": 2950.00, "closing_balance": 2450.00 }

Common Error Responses

400 Bad Request: invalid input, missing fields, or malformed JSON
401 Unauthorized: missing or invalid auth token or API key
403 Forbidden: authenticated but insufficient permissions
404 Not Found: resource ID does not exist
409 Conflict: duplicate write or version conflict; retry with idempotency key
422 Unprocessable Entity: valid syntax but invalid business logic
429 Too Many Requests: rate limit exceeded; honor Retry-After header
500 Internal Error: unexpected server fault; retry with idempotency key
503 Service Unavailable: dependency down or overloaded; use exponential backoff
402 Payment Required: insufficient funds
502 Bad Gateway: payment provider timeout; poll status endpoint

PostgreSQL: Ledger (Partitioned by Month)

SQL

CREATE TABLE ledger_entries (
    entry_id UUID PRIMARY KEY, transaction_id UUID NOT NULL,
    account_id UUID NOT NULL, entry_type ENUM('debit','credit') NOT NULL,
    amount DECIMAL(18,2) NOT NULL CHECK (amount > 0),
    currency CHAR(3) NOT NULL DEFAULT 'USD',
    balance_after DECIMAL(18,2) NOT NULL,
    description TEXT, reverses_entry UUID,
    idempotency_key VARCHAR(64), posted_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
) PARTITION BY RANGE (posted_at);

CREATE TABLE ledger_entries_2026_03 PARTITION OF ledger_entries
  FOR VALUES FROM ('2026-03-01') TO ('2026-04-01');

CREATE TABLE account_balances (
    account_id UUID PRIMARY KEY,
    balance DECIMAL(18,2) NOT NULL DEFAULT 0,
    currency CHAR(3) DEFAULT 'USD', updated_at TIMESTAMPTZ DEFAULT NOW()
);

Concern	Solution
Partial posting	All entries in single DB transaction; all-or-nothing
Duplicate posting	Idempotency key with UNIQUE constraint
Balance drift	Nightly reconciliation: recompute all balances from entries
Data loss	Synchronous replication; WAL archiving; point-in-time recovery
Immutability violation	No UPDATE/DELETE permissions; DB user has INSERT-only privilege
Regulatory audit	7-year retention; partitioned tables with cold storage

SLOs & Error Budgets

Metric	Target	Rationale
Core user-facing availability	99.95%	Budget for planned maintenance + unplanned failures without user-visible outage.
p99 latency (critical path)	Problem-specific — state target early and tie to capacity math	Interview credibility comes from connecting SLO to architecture choices.
Error rate (5xx)	< 0.1%	Distinguishes transient blips from systemic failure requiring rollback.
Data durability	99.999999999% (11 nines) for committed writes	Define which operations require fsync/quorum vs async replication.

Incident Scenarios (2am reality)

Scenario	How you detect	Mitigation
Primary database unavailable	Health check failures, connection pool exhaustion alerts, elevated 5xx	Failover to replica / promote standby; enable read-only degraded mode if writes impossible; queue writes if async path exists
Traffic spike (10× normal)	RPS anomaly alert, autoscaling lag, latency SLO burn rate	Rate limit non-critical endpoints; scale read path horizontally; pre-warm caches; shed load on expensive operations
Bad deploy causing elevated errors	Canary metric regression, error budget burn, deployment correlation	Automated rollback within 5 minutes; feature flag kill switch; maintain N-1 compatibility

Cost Drivers (Staff lens)

Egress bandwidth and CDN (often dominates media/data-heavy systems)
Database storage + IOPS at scale (plan compaction, TTL, tiering)
Compute for async pipelines (right-size workers, spot instances for batch)
Managed service premiums vs operational headcount trade-off

Multi-Region & DR

Start single-region with cross-AZ redundancy. Add read replicas in secondary region for DR. Move to active-active only when latency SLO or data residency requires it — accept conflict resolution complexity explicitly.

Interview Prompt

Clarifying Questions (ask before designing)

Scope

In scope

Out of scope (state explicitly)

Assumptions

Double-Entry Transaction Posting

Balance Computation: Running Balance vs Calculated

Immutability: How to "Fix" Errors

Common Error Responses

PostgreSQL: Ledger (Partitioned by Month)

Interview Walkthrough

Why PostgreSQL (Not Blockchain/DynamoDB)

Cross-Shard Transfers

Idempotent Posting: Preventing Double Charges

Race Condition: Concurrent Withdrawals

Phase 1: MVP (0 to 100K users)

Phase 2: Growth (100K to 10M users)

Phase 3: Scale (10M+ users)

SLOs & Error Budgets

Incident Scenarios (2am reality)

Cost Drivers (Staff lens)

Multi-Region & DR