This problem appears in multiple sheets. Depth expectations increase as you progress:
| Track | What to demonstrate |
|---|---|
| Arch 75 | Staff level: multi-region, cost at scale, migration path, and production metrics. |
Interview Prompt
Design Digital Wallet System.
Clarifying Questions (ask before designing)
| Question | Why it matters |
|---|---|
| Authorize-only vs capture later? Refunds and chargebacks in scope? | Sets idempotency, ledger, and reconciliation boundaries. |
| What scale should we design for — DAU, QPS, data volume? | Drives every capacity decision; shows structured thinking. |
| What are the read vs write patterns on the critical path? | Determines caching, DB choice, and replication topology. |
| What consistency and durability guarantees are required? | Separates strong-consistency paths from eventual ones — a senior differentiator. |
Scope
In scope
- Balance consistency (serializable transactions)
- Top-up/withdraw flows
- P2P transfer
- Fraud checks
- KYC integration
- Capacity estimation with shown math
Out of scope (state explicitly)
- Fraud ML model training (#75) — rules engine is enough unless asked
- Merchant onboarding / KYC workflows
- Building a PSP or bank from scratch
Assumptions
- Strong consistency required on money/inventory paths — clarify idempotency early
- External PSP or bank APIs exist; design integration boundaries only
- 99.99% availability target for the commit/authorize path
These foundational concepts underpin the patterns used in this problem. Review them before deep-diving into component-level trade-offs.
- Wallet balance: Maintain a monetary balance per user
- Top-up: Add funds via bank transfer, credit card, or external payment
- Send money (P2P): Transfer funds between wallet users instantly
- Pay merchants: Pay at checkout using wallet balance
- Transaction history: View all credits, debits, and transfers with details
- Withdraw: Cash out wallet balance to bank account
- Multi-currency: Hold balances in multiple currencies with conversion
- Rewards/cashback: Credit rewards directly to wallet
- Spending limits: Configurable daily/monthly transaction limits
- Strong Consistency (ACID): Balance must NEVER go negative; no double-spend
- Effectively-once posting: At-least-once delivery with idempotency keys: duplicate requests return the same ledger entry
- Low Latency: P2P transfer completes in < 500 ms
- High Availability: 99.999%: wallet is the user's money
- Auditability: Every balance change has an immutable audit trail (double-entry ledger)
- Security: PCI DSS, encryption at rest/transit, fraud detection
- Scale: 100M+ wallets, 50M+ transactions/day
| Metric | Calculation | Value |
|---|---|---|
| Total wallets | Given | 100M |
| Active wallets (monthly) | Given | 30M |
| Transactions / day | Given (assumption documented in value) | 50M |
| Transactions / sec (peak) | Derived from daily volume ÷ 86400 (+ peak factor) | ~600 (peak 5K) |
| Ledger entries / day | 50M txns × 2 (double-entry) | 100M (double-entry) |
| Ledger storage / year | Given | ~18 TB |
| Redis keys (active) | Given | ~65M |
Double-Entry Ledger: The Foundation
Every transaction creates TWO ledger entries: Debit from one account, Credit to another. Sum of all entries = 0. P2P Transfer: Alice sends $50 to Bob Entry 1: Alice's wallet DEBIT -$50 Entry 2: Bob's wallet CREDIT +$50 Sum: 0 ? Top-up: Alice adds $100 from bank Entry 1: Alice's wallet CREDIT +$100 Entry 2: External_bank DEBIT -$100 Sum: 0 ? Merchant Payment: Alice pays $25 to Merchant M Entry 1: Alice's wallet DEBIT -$25 Entry 2: Merchant M's wallet CREDIT +$24.25 Entry 3: Platform fee account CREDIT +$0.75 (3% commission) Sum: 0 ?
Why double-entry? Self-validating (SUM=0), complete audit trail, regulatory requirement, reconciliation.
Atomic Balance Update: Preventing Double-Spend
Critical: Two concurrent requests to spend Alice's last $50
Solution: PostgreSQL serializable transaction + row lock:
BEGIN TRANSACTION ISOLATION LEVEL SERIALIZABLE;
SELECT balance FROM wallets WHERE user_id = 'alice' FOR UPDATE;
IF balance >= 50 THEN
UPDATE wallets SET balance = balance - 50 WHERE user_id = 'alice';
INSERT INTO ledger_entries (wallet_id, type, amount, ...) VALUES (...);
ELSE
RAISE 'Insufficient funds';
END IF;
COMMIT;Idempotent Transactions
User clicks "Send $50" but network timeout. Client retries.
Every API call includes idempotency_key (generated client-side per action):
POST /api/v1/wallet/transfer
Idempotency-Key: "txn-uuid-abc-123"
{ "to_user_id": "bob", "amount": 50.00 }
Server:
1. Check: SELECT * FROM transactions WHERE idempotency_key = 'txn-uuid-abc-123'
2. If not exists -> process transfer -> store result with idempotency_key
3. Redis: SET idempotency:{key} {result} EX 86400Three layers of defense: (1) Redis fast-path < 1ms, (2) PostgreSQL UNIQUE constraint, (3) Application-level FOR UPDATE within transaction.
Multi-Currency Wallet
User can hold balances in multiple currencies (USD, EUR, GBP, INR). wallet_balances table has ONE row per (user_id, currency) pair. Cross-currency transfers use FX rate service cached in Redis with 30s TTL.
Settlement and Reconciliation
Daily batch process: Net position calculation, bank settlement via ACH/SEPA, reconciliation matching bank confirmations against ledger, merchant settlement weekly/daily.
Event Bus Design (Kafka)
Topic: digital_wallet_system-events Partitions: 64 (scale consumers horizontally) Partition key: entity_id (user_id / order_id — preserves per-entity ordering) Retention: 7 days (compliance) or 24h (high-volume telemetry) Replication factor: 3, min.insync.replicas: 2 Producer: idempotent producer enabled (enable.idempotence=true) Consumer: consumer group "digital_wallet_system-processors" - At-least-once delivery + idempotent handlers (dedup by event_id) - DLQ topic: digital_wallet_system-events-dlq (poison messages after 3 retries) - Lag alert: consumer lag > 60s → scale workers Design a Digital Wallet System: async side effects MUST NOT block the synchronous API response. Sync path: validate → persist source of truth → publish event → return 201 Async path: consumers update caches, indexes, notifications, aggregates
POST /api/v1/wallet/top-up
Idempotency-Key: "topup-uuid"
{ "amount": 100.00, "currency": "USD", "source": "bank_transfer", "source_ref": "bank-txn-id" }
? 200 { "transaction_id": "txn-uuid", "new_balance": 150.00, "status": "pending" }
POST /api/v1/wallet/transfer
Idempotency-Key: "transfer-uuid"
{ "to_user_id": "bob-uuid", "amount": 50.00, "currency": "USD", "note": "Lunch" }
? 200 { "transaction_id": "txn-uuid", "new_balance": 100.00 }
POST /api/v1/wallet/withdraw
Idempotency-Key: "withdraw-uuid"
{ "amount": 200.00, "currency": "USD", "bank_account_id": "ba-uuid" }
? 200 { "transaction_id": "txn-uuid", "new_balance": 75.00, "status": "pending" }
GET /api/v1/wallet/balance
? { "balances": [{"currency": "USD", "available": 75.00, "pending": 0.00}] }Common Error Responses
400 Bad Request: invalid input, missing fields, or malformed JSON 401 Unauthorized: missing or invalid auth token or API key 403 Forbidden: authenticated but insufficient permissions 404 Not Found: resource ID does not exist 409 Conflict: duplicate write or version conflict; retry with idempotency key 422 Unprocessable Entity: valid syntax but invalid business logic 429 Too Many Requests: rate limit exceeded; honor Retry-After header 500 Internal Error: unexpected server fault; retry with idempotency key 503 Service Unavailable: dependency down or overloaded; use exponential backoff 402 Payment Required: insufficient funds 502 Bad Gateway: payment provider timeout; poll status endpoint
PostgreSQL: Source of Truth
CREATE TABLE wallets (
wallet_id UUID PRIMARY KEY,
user_id UUID UNIQUE NOT NULL,
status ENUM('active','frozen','closed','pending_kyc') NOT NULL DEFAULT 'active',
kyc_level ENUM('none','basic','verified','enhanced') DEFAULT 'none',
daily_limit DECIMAL(10,2) DEFAULT 5000.00,
monthly_limit DECIMAL(12,2) DEFAULT 50000.00,
created_at TIMESTAMPTZ DEFAULT NOW(),
updated_at TIMESTAMPTZ DEFAULT NOW()
);
CREATE TABLE wallet_balances (
wallet_id UUID NOT NULL REFERENCES wallets,
currency CHAR(3) NOT NULL,
balance DECIMAL(15,2) NOT NULL DEFAULT 0 CHECK (balance >= 0),
pending_in DECIMAL(15,2) DEFAULT 0,
pending_out DECIMAL(15,2) DEFAULT 0,
updated_at TIMESTAMPTZ DEFAULT NOW(),
PRIMARY KEY (wallet_id, currency)
);
CREATE TABLE ledger_entries (
entry_id BIGSERIAL PRIMARY KEY,
transaction_id UUID NOT NULL,
wallet_id UUID NOT NULL,
currency CHAR(3) NOT NULL,
entry_type ENUM('debit','credit') NOT NULL,
amount DECIMAL(15,2) NOT NULL,
balance_after DECIMAL(15,2) NOT NULL,
description TEXT,
created_at TIMESTAMPTZ DEFAULT NOW(),
INDEX idx_wallet (wallet_id, created_at DESC),
INDEX idx_transaction (transaction_id)
) PARTITION BY RANGE (created_at);
CREATE TABLE transactions (
transaction_id UUID PRIMARY KEY,
idempotency_key VARCHAR(64) UNIQUE,
type ENUM('topup','transfer','payment','withdrawal','reward','refund') NOT NULL,
from_wallet_id UUID,
to_wallet_id UUID,
amount DECIMAL(15,2) NOT NULL,
currency CHAR(3) NOT NULL,
status ENUM('pending','completed','failed','reversed') NOT NULL,
metadata JSONB,
created_at TIMESTAMPTZ DEFAULT NOW(),
updated_at TIMESTAMPTZ DEFAULT NOW()
);
CREATE TABLE outbox (
outbox_id BIGSERIAL PRIMARY KEY,
aggregate_type VARCHAR(50) NOT NULL,
aggregate_id UUID NOT NULL,
event_type VARCHAR(50) NOT NULL,
payload JSONB NOT NULL,
published BOOLEAN DEFAULT FALSE,
created_at TIMESTAMPTZ DEFAULT NOW(),
INDEX idx_unpublished (published, created_at) WHERE published = FALSE
);Redis
balance:{user_id}:{currency} → DECIMAL (cached balance), TTL 60s
idempotency:{key} → JSON (cached response), TTL 86400s
daily_spent:{user_id} → DECIMAL (INCRBY on each spend), TTL resets at midnight
fraud:velocity:{user_id} → ZSET (timestamps of recent txns), TTL 300s
fraud:device:{user_id} → SET of known device fingerprints| Concern | Solution |
|---|---|
| Double-spend | SELECT FOR UPDATE row lock; CHECK(balance >= 0) constraint as safety net |
| Duplicate transaction | Idempotency key with UNIQUE constraint + Redis fast-path check |
| Ledger inconsistency | Double-entry: SUM(all entries) must = 0; hourly automated reconciliation |
| DB failover | Synchronous replication to standby; zero data loss on failover |
| Kafka publish failure | Transactional outbox pattern: event stored in DB atomically with txn |
| Cross-shard P2P transfer | Saga with compensating transaction (debit → credit; if credit fails → reverse debit) |
Partial Failure in P2P Transfer
SAME-SHARD (Alice and Bob on same PostgreSQL shard):
PostgreSQL transaction wraps BOTH operations:
BEGIN; UPDATE alice balance; INSERT ledger; UPDATE bob balance; INSERT ledger; COMMIT;
If any step fails → entire transaction rolls back.
CROSS-SHARD (Alice on shard 1, Bob on shard 2):
Cannot use single DB transaction across shards. Use saga pattern:
Step 1: Debit Alice (shard 1)
BEGIN; UPDATE alice balance -= 50; INSERT ledger; INSERT txn status='debit_done'; COMMIT;
Step 2: Credit Bob (shard 2)
BEGIN; UPDATE bob balance += 50; INSERT ledger; UPDATE txn status='completed'; COMMIT;
If Step 2 fails:
Compensating transaction on shard 1:
BEGIN; UPDATE alice balance += 50; INSERT reversal ledger; UPDATE txn status='reversed'; COMMIT;Transactional Outbox vs Direct Kafka Publish
Direct Kafka publish (naive approach):
1. BEGIN; UPDATE balance; INSERT ledger; COMMIT;
2. kafka.send("txn-events", event) // ? this can fail!
✗ No atomicity between DB commit and event publish.
Transactional Outbox ? (recommended):
1. BEGIN;
UPDATE balance; INSERT ledger; INSERT INTO outbox (event_payload);
COMMIT; // all-or-nothing, including the outbox row
2. Background poller: SELECT * FROM outbox WHERE published = FALSE ORDER BY created_at;
3. For each: kafka.send(event) → on success: UPDATE outbox SET published = TRUE;
✓ Atomicity: outbox row committed with the transaction
✓ At-least-once delivery: poller retries until published = TRUEInterview Walkthrough
- Lead with correctness over speed: a double-charge is worse than 500 ms latency — idempotency is the most critical pattern.
- Walk through P2P transfer as a cross-shard saga: debit sender → credit suspense → debit suspense → credit recipient with compensating rollback.
- Explain three-layer idempotency: Redis fast-path check, PostgreSQL UNIQUE on idempotency_key, application-level FOR UPDATE lock.
- Cover KYC tier limits enforced at transfer time — Level 1 ($500/day) through Level 3 ($50K/day) based on verification depth.
- Mention transactional outbox for Kafka events — never publish balance-change notifications outside the DB transaction.
- Discuss real-time fraud tier (velocity, geo-anomaly) blocking transfers before ledger write, with batch graph analysis nightly.
- Common pitfall: updating balance with read-modify-write without row locking — concurrent withdrawals on a $500 balance both succeed at $400 each.
Regulatory Compliance (KYC / AML / CTR)
- KYC: Level 1 (email+phone ? $500/day), Level 2 (govt ID ? $5K/day), Level 3 (address+income ? $50K/day)
- AML: Transaction monitoring, SAR filing, sanctions screening (OFAC/EU), PEP screening
- CTR: Required for cash transactions > $10,000/day
- Data retention: 7 years minimum (BSA requirement)
- Money transmitter license required per US state; e-money license for EU (EMD2)
Fraud Detection: Real-Time + Batch
Tier 1 (Real-time): Rule-based checks: velocity, amount anomaly, geo-anomaly, device fingerprint, recipient risk, time anomaly. Redis-backed with sliding window counters.
Tier 2 (Batch): ML model + graph analysis: network analysis (mule accounts), circular transfer detection, behavioral profiling. Spark + Kafka + ClickHouse.
Why PostgreSQL (Not DynamoDB/Cassandra) for Wallets
PostgreSQL ?: - ACID with serializable isolation for critical paths - CHECK(balance >= 0): database rejects any update that would violate this - Multi-row atomic: debit Alice + credit Bob + 2 ledger entries + outbox = 1 commit - Rich query support for compliance and reconciliation - Partitioning: ledger_entries partitioned by month for fast queries DynamoDB: - Has transactions (up to 100 items in TransactWriteItems) - No CHECK constraints → application must enforce balance >= 0 - No JOINs → reconciliation queries become application-level nightmares Cassandra ? (critically inappropriate): - No multi-row transactions → CANNOT atomically debit + credit - Eventual consistency → two reads might show different balances → Financial regulators would reject this architecture
Sharding Strategy
Shard key: user_id (hash-based). With 64 shards, ~98.5% of P2P transfers are cross-shard ? saga pattern is the common case. Saga adds ~50ms latency: well within 500ms SLA.
Idempotency: The Most Critical Pattern for Wallets
At 50M txns/day × 2% retry rate = 1M retries/day. Without idempotency: 1M potential double-charges per day. Three layers: Redis fast-path, PostgreSQL UNIQUE, application FOR UPDATE.
Staff interviews expect you to articulate how the system evolves under real growth — not jump straight to the final architecture.
Phase 1: MVP (0 to 100K users)
Monolith or minimal services proving core digital wallet system flows. Optimize for shipping speed and correctness over scale.
Key components: Single region · Primary DB + Redis cache · Synchronous core path · Basic monitoring
Move to next phase when: p99 latency exceeds SLO or DB CPU sustained above 70%
Phase 2: Growth (100K to 10M users)
Split read/write paths, introduce async processing for non-critical work, add caching layers and horizontal scaling.
Key components: Read replicas or CQRS · Message queue for async work · CDN / edge caching · Service-level SLOs
Move to next phase when: Hot keys, fan-out bottlenecks, or ops toil from manual scaling
Phase 3: Scale (10M+ users)
Shard data plane, multi-region active-active or active-passive, formal DR runbooks, cost optimization.
Key components: Database sharding / partitioning · Multi-region replication · Auto-scaling + chaos testing · Dedicated platform/SRE ownership
Move to next phase when: Regional failure domain risk, compliance data residency, or linear cost growth unsustainable
SLOs & Error Budgets
| Metric | Target | Rationale |
|---|---|---|
| Core user-facing availability | 99.95% | Budget for planned maintenance + unplanned failures without user-visible outage. |
| p99 latency (critical path) | Problem-specific — state target early and tie to capacity math | Interview credibility comes from connecting SLO to architecture choices. |
| Error rate (5xx) | < 0.1% | Distinguishes transient blips from systemic failure requiring rollback. |
| Data durability | 99.999999999% (11 nines) for committed writes | Define which operations require fsync/quorum vs async replication. |
Incident Scenarios (2am reality)
| Scenario | How you detect | Mitigation |
|---|---|---|
| Primary database unavailable | Health check failures, connection pool exhaustion alerts, elevated 5xx | Failover to replica / promote standby; enable read-only degraded mode if writes impossible; queue writes if async path exists |
| Traffic spike (10× normal) | RPS anomaly alert, autoscaling lag, latency SLO burn rate | Rate limit non-critical endpoints; scale read path horizontally; pre-warm caches; shed load on expensive operations |
| Bad deploy causing elevated errors | Canary metric regression, error budget burn, deployment correlation | Automated rollback within 5 minutes; feature flag kill switch; maintain N-1 compatibility |
Cost Drivers (Staff lens)
- Egress bandwidth and CDN (often dominates media/data-heavy systems)
- Database storage + IOPS at scale (plan compaction, TTL, tiering)
- Compute for async pipelines (right-size workers, spot instances for batch)
- Managed service premiums vs operational headcount trade-off
Multi-Region & DR
Start single-region with cross-AZ redundancy. Add read replicas in secondary region for DR. Move to active-active only when latency SLO or data residency requires it — accept conflict resolution complexity explicitly.