Design a Fraud Detection System

This problem appears in multiple sheets. Depth expectations increase as you progress:

Track	What to demonstrate
Arch 50	Show domain depth beyond the baseline: async pipelines, consistency semantics, and operational SLOs.
Arch 75	Staff angles: partition behavior, cost drivers, and MVP → production evolution with clear triggers.

Interview Prompt

Design Fraud Detection System.

Clarifying Questions (ask before designing)

Question	Why it matters
Which of these is highest priority: Real-time feature computation, Rule engine + ML model ensemble, Graph-based fraud rings?	Forces scope negotiation — senior candidates trim before drawing boxes.
What scale should we design for — DAU, QPS, data volume?	Drives every capacity decision; shows structured thinking.
What are the read vs write patterns on the critical path?	Determines caching, DB choice, and replication topology.
What consistency and durability guarantees are required?	Separates strong-consistency paths from eventual ones — a senior differentiator.

Scope

In scope

Real-time feature computation
Rule engine + ML model ensemble
Graph-based fraud rings
False positive management
Model retraining pipeline
PII handling

Out of scope (state explicitly)

Full payment gateway design (#24)
Regulatory reporting and SAR filing workflows
Graph neural network training infrastructure

Assumptions

Clarify scale (DAU, QPS, data volume) for fraud detection system in the first 5 minutes
Standard reliability target 99.9%–99.99% unless problem implies higher (payments, booking)
Managed cloud services (RDS, S3, Kafka, Redis) are acceptable building blocks

Real-time scoring: Score every transaction/event for fraud risk in < 100 ms
Rule engine: Configurable rules (velocity checks, amount limits, geo-anomalies)
ML models: Machine learning models for pattern detection (supervised + unsupervised)
Case management: Queue suspicious events for human analyst review
Block/allow decisions: Auto-block high-risk, auto-allow low-risk, manual review for medium
Feature store: Real-time and historical features (user behavior, device fingerprint, transaction patterns)
Feedback loop: Analyst decisions feed back into ML model training
Multi-channel: Detect fraud across payments, account creation, login, promo abuse

Metric	Calculation	Value
Transactions scored / sec	Derived from daily volume ÷ 86400 (+ peak factor)	50K
ML features per transaction	Given	~200
Feature computation latency	Given	< 30 ms
Model inference latency	Given	< 20 ms
Rule evaluation latency	Given	< 10 ms
Total fraud decision latency	Given	< 100 ms
Fraud rate	Given	~0.5% of transactions
Manual review queue	Given	~50K cases/day

Loading...

Feature Computation: Real-Time + Historical

200 features per transaction, computed in < 30 ms:

Real-time features (Redis, < 5 ms): transaction_count_last_1h, transaction_amount_last_24h, unique_merchants_last_7d, device_fingerprint_seen_before, ip_address_country
Historical features (pre-computed, ClickHouse → Redis cache): avg_transaction_amount_30d, typical_transaction_hour, account_age_days
Derived features (computed at scoring time): amount_deviation, geo_velocity (distance_from_last_txn / time_since_last_txn), is_new_device, is_new_merchant

Feature store architecture: Flink consumes transaction events → updates real-time features in Redis. Spark (nightly) computes historical features. At scoring time: Feature Service reads from Redis → constructs 200-dim feature vector.

ML Model Architecture: Ensemble

1. XGBoost (primary, supervised):
   Trained on labeled data with all 200 features
   Output: P(fraud) 0.0-1.0, fast inference (< 5 ms)

2. Autoencoder (anomaly detection, unsupervised):
   Trained on legitimate transactions only
   High reconstruction error = anomaly = potential fraud
   Catches NEW fraud patterns not in labeled training data

3. Graph Neural Network (network analysis):
   Detect fraud rings: cluster of accounts sharing devices/IPs
   Run offline, flag suspicious clusters for enhanced scrutiny

Scoring:
  final_score = 0.5 * xgboost_score + 0.3 * autoencoder_anomaly + 0.2 * graph_risk
  score < 0.3: ALLOW,  score 0.3-0.7: REVIEW,  score > 0.7: BLOCK

Rule Engine: Fast, Configurable

JSON

{
  "rule_id": "R001", "name": "high_amount_new_account",
  "condition": "transaction.amount > 500 AND user.account_age_days < 7",
  "action": "review", "priority": 10
},
{
  "rule_id": "R002", "name": "impossible_travel",
  "condition": "geo_velocity_kmh > 1000",
  "action": "block", "priority": 1
},
{
  "rule_id": "R003", "name": "velocity_breach",
  "condition": "transaction_count_last_1h > 20",
  "action": "block", "priority": 2
}

Score Transaction

HTTP

POST /api/v1/fraud/score
{
  "event_type": "payment",
  "transaction_id": "txn-uuid",
  "user_id": "user-uuid",
  "amount": 599.99,
  "merchant_id": "m-uuid",
  "device_fingerprint": "fp-abc",
  "ip_address": "203.0.113.42",
  "timestamp": "2026-03-14T11:00:00Z"
}
Response: 200 OK (< 100 ms)
{
  "decision": "allow",
  "score": 0.15,
  "risk_factors": ["new_device"],
  "rule_triggers": []
}

Analyst Feedback

HTTP

POST /api/v1/fraud/feedback
{ "case_id": "case-uuid", "decision": "fraud_confirmed", "analyst_id": "..." }

Common Error Responses

400 Bad Request: invalid input, missing fields, or malformed JSON
401 Unauthorized: missing or invalid auth token or API key
403 Forbidden: authenticated but insufficient permissions
404 Not Found: resource ID does not exist
409 Conflict: duplicate write or version conflict; retry with idempotency key
422 Unprocessable Entity: valid syntax but invalid business logic
429 Too Many Requests: rate limit exceeded; honor Retry-After header
500 Internal Error: unexpected server fault; retry with idempotency key
503 Service Unavailable: dependency down or overloaded; use exponential backoff

Redis: Real-Time Features

txn_count_1h:{user_id}          -> INT, TTL 3600
txn_amount_24h:{user_id}        -> Sorted Set, TTL 86400
device_history:{user_id}        -> SET of fingerprints
last_location:{user_id}         -> Hash { lat, lng, timestamp }
ip_reputation:{ip}              -> FLOAT (risk score)

PostgreSQL: Rules & Cases

SQL

CREATE TABLE fraud_rules (
    rule_id VARCHAR(20) PRIMARY KEY, name VARCHAR(100),
    condition TEXT NOT NULL, action ENUM('allow','review','block'),
    priority INT, active BOOLEAN DEFAULT TRUE
);

CREATE TABLE fraud_cases (
    case_id UUID PRIMARY KEY, transaction_id UUID,
    user_id UUID, score DECIMAL(4,3),
    decision ENUM('pending','fraud_confirmed','legitimate','escalated'),
    auto_decision VARCHAR(10), analyst_id UUID,
    created_at TIMESTAMPTZ DEFAULT NOW()
);

Concern	Solution
Fraud service down	Fail-open for low-value txns (allow with async review); fail-close for high-value (decline)
Feature store lag	Use stale features with reduced confidence; increase review threshold
ML model error	Rule engine as safety net; circuit breaker on model; fallback to rules-only
False positive spike	Monitor auto-block rate; alert if > 2x normal; auto-switch to review mode
Feedback loop delay	Weekly model retrain; interim: update rules for new patterns within hours

Fail-Open vs Fail-Close

Hybrid (recommended):
  if transaction.amount < 50 AND user.account_age > 90 days:
    allow (low risk, fail-open)
  elif transaction.amount > 500:
    decline (high risk, fail-close)
  else:
    allow + queue for async review (medium risk)

Interview Walkthrough

Frame as a latency-budget problem: scoring must complete in <50 ms — rules first (deterministic, explainable), ML second (novel patterns).
Walk through the tiered pipeline: rule engine blocks known patterns instantly → feature store lookup → ML model score → decision (allow/review/decline).
Explain feature freshness: Flink-updated velocity counters in Redis with staleness detection that biases toward REVIEW when features lag.
Cover hybrid fail-open/fail-close: auto-allow low-amount trusted users, auto-decline high-amount when scorer is down, queue medium for async review.
Mention offline GNN batch pipeline for fraud rings — nightly community detection writes cluster risk scores back to Redis for O(1) lookup at scoring time.
Discuss champion/challenger model deployment with shadow scoring logged to ClickHouse before promoting a new model version.
Common pitfall: pure fail-close when the ML service hiccups — blocking every transaction costs more revenue than the fraud it prevents.

Why Both Rules AND ML

Rules: deterministic, explainable, instant deployment for known patterns.
  "Block all transactions from sanctioned countries" -> rule, not ML.

ML: catches novel patterns, handles complex feature interactions.
  "User's spending pattern changed subtly over 3 weeks" -> ML, not rules.

Together: Rules catch known fraud immediately. ML catches new fraud patterns.
  Defense in depth: if ML misses it, rules may catch it (and vice versa).

Feature Freshness & Staleness Handling

Detection: Every feature hash includes _updated_at timestamp
  At scoring time: feature_age = now() - features._updated_at
  
Strategy 1: Conservative bias
  If velocity features are stale → add 0.15 to fraud score
  Rationale: "I can't see recent activity → assume higher risk"
  
Strategy 2: Default substitution
  Replace stale feature with population median
  
Strategy 3: Feature importance gating
  If top-5 most important features are stale → route to REVIEW (skip ML)

Weekly vs Daily Model Retraining

Hybrid approach (recommended):
  Base model: retrained weekly (stable foundation)
  Rule engine: updated within HOURS for known new patterns
  → Analyst spots new pattern → creates rule → deployed in 30 min
  → Rule catches new fraud immediately while model takes a week to learn

SLOs & Error Budgets

Metric	Target	Rationale
Core user-facing availability	99.95%	Budget for planned maintenance + unplanned failures without user-visible outage.
p99 latency (critical path)	Problem-specific — state target early and tie to capacity math	Interview credibility comes from connecting SLO to architecture choices.
Error rate (5xx)	< 0.1%	Distinguishes transient blips from systemic failure requiring rollback.
Data durability	99.999999999% (11 nines) for committed writes	Define which operations require fsync/quorum vs async replication.

Incident Scenarios (2am reality)

Scenario	How you detect	Mitigation
Primary database unavailable	Health check failures, connection pool exhaustion alerts, elevated 5xx	Failover to replica / promote standby; enable read-only degraded mode if writes impossible; queue writes if async path exists
Traffic spike (10× normal)	RPS anomaly alert, autoscaling lag, latency SLO burn rate	Rate limit non-critical endpoints; scale read path horizontally; pre-warm caches; shed load on expensive operations
Bad deploy causing elevated errors	Canary metric regression, error budget burn, deployment correlation	Automated rollback within 5 minutes; feature flag kill switch; maintain N-1 compatibility

Cost Drivers (Staff lens)

Egress bandwidth and CDN (often dominates media/data-heavy systems)
Database storage + IOPS at scale (plan compaction, TTL, tiering)
Compute for async pipelines (right-size workers, spot instances for batch)
Managed service premiums vs operational headcount trade-off

Multi-Region & DR

Start single-region with cross-AZ redundancy. Add read replicas in secondary region for DR. Move to active-active only when latency SLO or data residency requires it — accept conflict resolution complexity explicitly.