This problem appears in multiple sheets. Depth expectations increase as you progress:
| Track | What to demonstrate |
|---|---|
| Arch 25 | Money path — interviewers probe idempotency keys, authorize→capture flow, double-charge prevention, and reconciliation. Never store raw PAN; know PCI scope boundaries. |
| Arch 50 | Add multi-PSP routing, partial failure handling (auth succeeds, capture fails), ledger double-entry, and webhook idempotency. |
| Arch 75 | Staff: discuss settlement vs authorization timing, chargeback lifecycle, multi-currency FX hedging, and audit trail for regulators. |
Interview Prompt
Design a payment gateway that processes card payments for merchants. Support authorization, capture, refunds, and voids. Integrate with external payment service providers (PSPs) like Stripe or Adyen. Ensure no double-charging and provide reconciliation with PSP settlement files.
Clarifying Questions (ask before designing)
| Question | Why it matters |
|---|---|
| Auth-only or auth+capture in one step? | Two-phase (auth→capture) is standard for e-commerce (ship later). Single-step for digital goods. Drives state machine complexity. |
| What's the transaction volume and peak TPS? | 10K TPS peak drives async webhook processing, idempotency store sharding, and PSP rate limit handling. |
| Do we hold funds or pass through to merchant accounts? | Platform model (marketplace) requires ledger + split payouts. Pass-through is simpler — PSP handles settlement. |
| Which PCI scope are we targeting? | SAQ A (redirect/tokenization) vs SAQ D (store card data). Determines whether we ever touch PAN. |
Scope
In scope
- Authorize, capture, void, refund APIs
- Idempotency key handling
- PSP adapter layer (Stripe/Adyen)
- Internal ledger (double-entry)
- Reconciliation with PSP settlement files
- Webhook ingestion from PSPs
Out of scope (state explicitly)
- Fraud ML model training (mention rules engine + 3DS)
- Merchant onboarding/KYC
- Chargeback dispute workflow (mention as async process)
- Building a PSP from scratch
Assumptions
- 50M transactions/month, 500 TPS peak
- Tokenized cards only — never store raw PAN (PCI SAQ A)
- 99.99% availability for payment API; reconciliation can lag 24h
- USD primary; multi-currency via PSP FX
These foundational concepts underpin the patterns used in this problem. Review them before deep-diving into component-level trade-offs.
- Process payments: Accept payments via credit/debit cards, bank transfers, digital wallets (PayPal, Apple Pay, Google Pay)
- Authorize & Capture: Two-phase payment: authorize (hold funds) → capture (charge): or single-step direct charge
- Refunds: Full and partial refunds
- Recurring payments: Subscriptions, auto-debit
- Multi-currency: Accept and settle in multiple currencies
- Tokenization: Store card details securely as tokens (PCI DSS compliance)
- Webhooks: Notify merchants of payment status changes asynchronously
- Retry failed payments: Automatic retry for transient failures
- Ledger: Double-entry bookkeeping for every transaction
- Merchant dashboard: View transactions, settlements, chargebacks, analytics
- ACID + effectively-once charges: Strong consistency per payment; idempotency keys ensure duplicate API calls never double-charge
- High Availability: 99.999% (five nines): downtime = lost revenue for all merchants
- Low Latency: Payment authorization in < 2 seconds
- Security: PCI DSS Level 1 compliance, encryption at rest and in transit
- Idempotency: Same payment request submitted multiple times results in only one charge
- Auditability: Every state change logged immutably for regulatory compliance
- Scalability: Process 10,000+ transactions per second
- Fault Tolerant: Survive datacenter failures without data loss
| Metric | Calculation | Value |
|---|---|---|
| Transactions / day | Given (500M txn/day) | 500M |
| Transactions / sec | Given | ~6,000 (peak 20K) |
| Avg transaction size | Given (1 KB metadata per txn) | 1 KB (metadata) |
| Ledger entries / day | 500M txn × 2 entries (debit + credit) | 1B (2 entries per txn: debit + credit) |
| Storage / day | 1B × 500 bytes | 500 GB |
| Storage / year | Given | ~180 TB |
| Active merchants | Given (5M merchants) | 5M |
Payment Orchestrator: The Core State Machine
A payment goes through a well-defined state machine. Each state transition is a database transaction with an audit log entry. Idempotency: Every API call includes an idempotency_key. Before processing, check if this key was already processed → return cached result.
Idempotency: The Most Critical Design Decision
Payment Request Flow: 1. Receive request with idempotency_key 2. BEGIN TRANSACTION 3. SELECT * FROM idempotency_keys WHERE key = ? FOR UPDATE 4. If exists → return cached_response (already processed) 5. If not exists → INSERT into idempotency_keys 6. Process payment 7. UPDATE idempotency_keys SET response = ?, status = 'completed' 8. COMMIT TRANSACTION
Why this matters: If a merchant's server crashes after sending the payment request but before receiving the response, it will retry. Without idempotency, the customer gets double-charged.
Tokenization Service (Card Vault)
- Purpose: Store sensitive card data (PAN, CVV) in an isolated, PCI-compliant vault
- Merchant sends raw card data to Tokenization Service. Service encrypts with AES-256, stores in HSM-backed vault. Returns a non-reversible token: tok_4242424242424242
- PCI DSS: Only the vault touches raw card data. All other services only see tokens
- HSM (Hardware Security Module): Physical device that stores encryption keys: keys never leave the HSM
Fraud Detection Engine
- Real-time checks (< 100 ms): Velocity checks (> 5 txns in 1 minute), geolocation mismatch, amount anomaly, known fraudulent BINs/IPs/devices
- ML model: Trained on historical fraud data with features: amount, merchant category, time of day, device fingerprint, card age, transaction frequency
- Rules engine: Configurable rules per merchant (e.g., "block transactions > $10,000 without 3DS")
Payment Processor Connector (PSP Router)
- Smart routing: Choose the optimal PSP for each transaction based on card network, geography, cost, success rate
- Failover: If PSP A is down → automatically route to PSP B
- Retry logic: Soft decline (insufficient funds) → retry after 24 hours. Hard decline (stolen card) → do not retry
Ledger Service: Double-Entry Bookkeeping
Every financial transaction creates TWO ledger entries (debits and credits must balance). Example: Payment of $100 creates DEBIT customer_account $100 and CREDIT merchant_account $100, with platform commission adjustments. Ledger is append-only (immutable). Daily reconciliation job verifies sum of all debits = sum of all credits.
Settlement Service
- T+1 or T+2 settlement: Aggregate captured payments per merchant → batch payout
- Net settlement: payout = sum(captures) - sum(refunds) - sum(fees)
- Daily Spark job computes settlement amounts → initiates bank transfers via ACH/SWIFT
Event Bus Design (Kafka)
Topic: payment_gateway-events Partitions: 64 (scale consumers horizontally) Partition key: entity_id (user_id / order_id — preserves per-entity ordering) Retention: 7 days (compliance) or 24h (high-volume telemetry) Replication factor: 3, min.insync.replicas: 2 Producer: idempotent producer enabled (enable.idempotence=true) Consumer: consumer group "payment_gateway-processors" - At-least-once delivery + idempotent handlers (dedup by event_id) - DLQ topic: payment_gateway-events-dlq (poison messages after 3 retries) - Lag alert: consumer lag > 60s → scale workers Design a Payment Gateway (Handling ACID Transactions): async side effects MUST NOT block the synchronous API response. Sync path: validate → persist source of truth → publish event → return 201 Async path: consumers update caches, indexes, notifications, aggregates
Create Payment
POST /api/v1/payments
Idempotency-Key: idem-uuid-12345
Authorization: Bearer <merchant_api_key>
{
"amount": 10000, // in smallest currency unit (cents)
"currency": "USD",
"payment_method": "tok_4242424242424242",
"capture": true, // false for auth-only
"description": "Order #12345",
"metadata": {"order_id": "ORD-12345"},
"return_url": "https://merchant.com/payment/complete"
}
Response: 200 OK
{
"payment_id": "pay_uuid",
"status": "captured", // or "authorized"
"amount": 10000,
"currency": "USD",
"payment_method": "tok_4242...",
"created_at": "2026-03-13T10:00:00Z"
}Capture (for auth-only payments)
POST /api/v1/payments/{payment_id}/capture
{ "amount": 10000 } // can capture less than authorized (partial capture)Refund
POST /api/v1/payments/{payment_id}/refund
Idempotency-Key: refund-idem-uuid
{ "amount": 5000, "reason": "customer_request" }Webhook
POST https://merchant.com/webhooks/payment
{
"event": "payment.captured",
"payment_id": "pay_uuid",
"amount": 10000,
"currency": "USD",
"timestamp": "2026-03-13T10:00:01Z"
}Common Error Responses
400 Bad Request: invalid input, missing fields, or malformed JSON 401 Unauthorized: missing or invalid auth token or API key 403 Forbidden: authenticated but insufficient permissions 404 Not Found: resource ID does not exist 409 Conflict: duplicate write or version conflict; retry with idempotency key 422 Unprocessable Entity: valid syntax but invalid business logic 429 Too Many Requests: rate limit exceeded; honor Retry-After header 500 Internal Error: unexpected server fault; retry with idempotency key 503 Service Unavailable: dependency down or overloaded; use exponential backoff 402 Payment Required: insufficient funds 502 Bad Gateway: payment provider timeout; poll status endpoint
PostgreSQL: Payments
CREATE TABLE payments (
payment_id UUID PRIMARY KEY,
merchant_id UUID NOT NULL,
amount BIGINT NOT NULL, -- in cents
currency VARCHAR(3) NOT NULL,
status VARCHAR(20) NOT NULL, -- created, authorized, captured, refunded, failed
payment_method VARCHAR(64), -- token reference
capture_method VARCHAR(20), -- 'automatic' or 'manual'
description TEXT,
metadata JSONB,
psp_reference VARCHAR(128), -- PSP's transaction ID
psp_name VARCHAR(64), -- which PSP processed it
failure_reason TEXT,
idempotency_key VARCHAR(128) UNIQUE,
created_at TIMESTAMP NOT NULL,
updated_at TIMESTAMP NOT NULL,
INDEX idx_merchant (merchant_id, created_at DESC),
INDEX idx_status (status)
);PostgreSQL: Idempotency Keys
CREATE TABLE idempotency_keys (
key VARCHAR(128) PRIMARY KEY,
merchant_id UUID NOT NULL,
request_hash VARCHAR(64), -- hash of request body
response_code INT,
response_body JSONB,
created_at TIMESTAMP,
INDEX idx_created (created_at) -- for cleanup of old keys
);
-- TTL: Background job deletes keys older than 24 hoursPostgreSQL: Ledger (Append-Only, Immutable)
CREATE TABLE ledger_entries (
entry_id BIGSERIAL PRIMARY KEY,
payment_id UUID NOT NULL,
account_id UUID NOT NULL,
entry_type ENUM('debit', 'credit'),
amount BIGINT NOT NULL,
currency VARCHAR(3),
balance_after BIGINT, -- running balance
description TEXT,
created_at TIMESTAMP NOT NULL,
INDEX idx_account (account_id, created_at),
INDEX idx_payment (payment_id)
);
-- Constraint: sum of debits = sum of credits per payment_idPostgreSQL: Audit Log
CREATE TABLE payment_audit_log (
log_id BIGSERIAL PRIMARY KEY,
payment_id UUID NOT NULL,
old_status VARCHAR(20),
new_status VARCHAR(20),
actor VARCHAR(64), -- 'system', 'merchant', 'psp'
details JSONB,
created_at TIMESTAMP NOT NULL
);Kafka Topics
Topic: payment-events (status changes for downstream consumers) Topic: webhook-delivery (outbound webhook messages) Topic: settlement-events (captured payments for settlement processing)
ACID Guarantees
| Property | How Ensured |
|---|---|
| Atomicity | PostgreSQL transactions: all or nothing |
| Consistency | DB constraints (CHECK, UNIQUE, FK) + application-level invariants |
| Isolation | Serializable or Read Committed isolation level |
| Durability | WAL (Write-Ahead Log) + synchronous replication to standby |
Fault Tolerance Scenarios
| Concern | Solution |
|---|---|
| Network failure with PSP | Mark payment as pending. Background reconciliation job queries PSP every minute. Never assume failure = declined |
| Service crash after PSP approval | PSP sends webhook with auth result. Webhook handler creates payment record if missing. Reconciliation catches anything missed |
| Double charge prevention | Idempotency key. Second request returns cached response from first successful attempt |
| Partial failure (auth succeeded, capture failed) | Auth has expiry (7-30 days). Background job detects stuck payments and alerts |
| Database failover | Synchronous replication to standby. Zero data loss guaranteed |
| Webhook delivery failure | Retry with exponential backoff (1s, 2s, 4s, ... up to 24h). Merchants can poll via API |
PCI DSS Compliance
- Cardholder data (PAN, CVV) only in the tokenization vault
- All data encrypted at rest (AES-256) and in transit (TLS 1.3)
- Network segmentation: vault is in an isolated network
- Quarterly vulnerability scans, annual penetration testing
Reconciliation
- Internal reconciliation: Verify ledger matches payment records (every hour)
- External reconciliation: Match records with PSP settlement files (daily)
- Bank reconciliation: Match bank statement with expected settlements (daily)
Multi-Currency
- Accept payment in customer's currency (presentment currency)
- Settle to merchant in their currency (settlement currency)
- FX conversion at time of capture using real-time exchange rates
- FX markup: 1-3% fee on cross-currency transactions
3D Secure (3DS) Authentication
- Additional cardholder verification (OTP, biometric) for fraud prevention
- 3DS 2.0: Frictionless flow for low-risk transactions (no redirect)
- Required by regulation in EU (PSD2 Strong Customer Authentication)
Monitoring & Alerting
- Payment success rate per PSP (alert if drops below 95%)
- Average authorization latency (alert if > 3s)
- Fraud rate (alert if > 0.1% of transactions)
- Reconciliation discrepancy rate
Interview Walkthrough
- State the golden rule immediately: payments must never double-charge — idempotency keys on every create/capture request are non-negotiable.
- Model the flow as a Saga pattern: authorize → capture → settle, with explicit compensation (void/refund) on each failure step.
- Insist on ACID transactions in PostgreSQL for the ledger — CAP Theorem trade-offs that favor availability over consistency are unacceptable here.
- Minimize PCI scope: card data never touches your servers; tokenize at the edge and store only vault references.
- Design webhook ingestion as idempotent event processing — PSPs retry notifications and out-of-order delivery is normal.
- Cover reconciliation: hourly internal ledger checks, daily PSP settlement matching, and alerting on discrepancy rates.
- Discuss multi-PSP routing with Circuit Breaker and Retries and Bulkheads — failover when one provider degrades.
- Common pitfall: treating payment status as eventually consistent without a durable state machine and audit trail.
Why PostgreSQL (Not Cassandra/DynamoDB) for Payments?
Payments demand ACID transactions (atomic debit + credit together), strong consistency (never "eventually"), complex SQL queries for finance/compliance, foreign keys, CHECK constraints, and audit requirements. PostgreSQL provides all of these. Cassandra has no transactions, no constraints, no JOINs, and eventual consistency: financial regulators would reject this architecture. DynamoDB has limited transactions (up to 100 items per transaction) and query flexibility.
Authorize-Then-Capture vs Direct Charge
Direct Charge moves money immediately: simple but refunds cost money and card network penalties for high refund rates. Authorize-Then-Capture (⭐) holds funds in Step 1 and only moves money in Step 2 after business confirmation. Zero-cost cancellation before capture, partial capture support, and additional fraud window between auth and capture.
Idempotency: The Most Critical Design Pattern in Payments
Three layers of protection: Layer 1 (Idempotency Key in API with SELECT FOR UPDATE), Layer 2 (PSP-Side idempotency using payment_id as their idempotency key), Layer 3 (Database UNIQUE constraint on merchant_id + idempotency_key as last line of defense).
Synchronous vs Asynchronous Payment Processing
Synchronous gives immediate yes/no but holds HTTP connections. Asynchronous frees the connection immediately but has more complex UX. Hybrid (⭐ Stripe's approach): Try synchronous with 10-second timeout. If PSP responds within 10s → return result immediately. If timeout → return "pending" → continue async → notify via webhook.
Double-Entry Bookkeeping: Why It's Non-Negotiable
Without double-entry, calculating platform revenue requires scanning all payments and calculating commissions: error-prone. With double-entry, every payment creates balanced DEBIT/CREDIT entries. Invariant: sum of all DEBITs = sum of all CREDITs (ALWAYS). If the invariant is ever violated → CRITICAL ALERT. Ledger is append-only: corrections are NEW entries (reversals), not updates.
Smart Routing: Choosing the Right PSP Per Transaction
Score each PSP based on success rate, cost, availability, and latency. Route to the highest-scoring PSP. Fallback cascade: PSP A fails → retry on PSP B → retry on PSP C → give up → alert. This optimization can improve overall authorization rates by 2-5%, which at scale means millions in additional revenue.
Staff interviews expect you to articulate how the system evolves under real growth — not jump straight to the final architecture.
Phase 1 — Single PSP, synchronous flow
Monolith + PostgreSQL ledger. Stripe-only adapter. Synchronous auth/capture. Idempotency keys in Redis. Webhooks processed inline.
Key components: Monolith · PostgreSQL ledger · Redis idempotency · Stripe adapter · Webhook endpoint
Move to next phase when: Webhook backlog during peak; need async processing and second PSP for failover
Phase 2 — Multi-PSP, async pipeline
Payment service + ledger service split. Kafka for PaymentEvents. Multi-PSP routing with circuit breakers. Nightly reconciliation batch. Outbox for PSP calls.
Key components: Payment service · Ledger service · Kafka · PSP router · Reconciliation batch · Outbox pattern
Move to next phase when: Regulatory audit requires immutable audit log; settlement latency blocks merchant payouts
Phase 3 — Global marketplace platform
Multi-currency ledger with FX snapshots. Split payouts (marketplace). Real-time reconciliation streaming. Chargeback reserve accounts. Multi-region active-passive with sticky merchant routing.
Key components: Multi-currency ledger · Split payout engine · Streaming reconciliation · Chargeback reserve · Multi-region failover
Move to next phase when: Cross-border volume exceeds 30%; single PSP settlement currency becomes bottleneck
SLOs & Error Budgets
| Metric | Target | Rationale |
|---|---|---|
| Payment API availability | 99.99% | Revenue-blocking — 52 min downtime/month max |
| Auth p99 latency | < 500ms | Includes PSP round-trip; user waits at checkout |
| Idempotency correctness | 100% | Zero double-charges — non-negotiable |
| Reconciliation auto-match rate | > 99.9% | Manual exceptions don't scale past 50M txns/month |
Incident Scenarios (2am reality)
| Scenario | How you detect | Mitigation |
|---|---|---|
| PSP outage during Black Friday peak | Circuit breaker opens; auth error rate > 5%; PSP status page confirms | Auto-failover to secondary PSP for new auths; queue captures for retry; extend auth hold window via PSP support ticket; display 'payment temporarily unavailable' rather than double-charge |
| Duplicate charges reported by merchants | Support ticket spike; reconciliation finds same idempotency key with two ledger entries | Halt payouts; identify root cause (idempotency store split-brain?); auto-refund duplicates; post-mortem on Redis failover config |
| Webhook delivery storm after PSP maintenance | Kafka consumer lag > 1M; stale payment statuses in merchant dashboard | Scale webhook consumers horizontally; dedupe by PSP event_id; prioritize CAPTURED/FAILED over informational events; backfill via reconciliation if lag exceeds 4h |
Cost Drivers (Staff lens)
- PSP interchange + scheme fees (~2-3% of GMV — largest cost, not infra)
- PostgreSQL ledger storage: append-only grows ~500 GB/year at 50M txns
- Reconciliation compute: batch jobs dwarf real-time API infra cost
Multi-Region & DR
Active-passive initially: primary region handles writes, secondary for read-only merchant dashboards. Payment state must not split across regions (no dual-write to PSP). Failover: DNS flip + replay unprocessed webhooks from PSP. Merchant payout batches run in primary only.