This problem appears in multiple sheets. Depth expectations increase as you progress:
Interview Prompt
Design Fraud Detection System.
Clarifying Questions (ask before designing)
| Question | Why it matters |
|---|---|
| Which of these is highest priority: Real-time feature computation, Rule engine + ML model ensemble, Graph-based fraud rings? | Forces scope negotiation — senior candidates trim before drawing boxes. |
| What scale should we design for — DAU, QPS, data volume? | Drives every capacity decision; shows structured thinking. |
| What are the read vs write patterns on the critical path? | Determines caching, DB choice, and replication topology. |
| What consistency and durability guarantees are required? | Separates strong-consistency paths from eventual ones — a senior differentiator. |
Scope
In scope
- Real-time feature computation
- Rule engine + ML model ensemble
- Graph-based fraud rings
- False positive management
- Model retraining pipeline
- PII handling
Out of scope (state explicitly)
- Full payment gateway design (#24)
- Regulatory reporting and SAR filing workflows
- Graph neural network training infrastructure
Assumptions
- Clarify scale (DAU, QPS, data volume) for fraud detection system in the first 5 minutes
- Standard reliability target 99.9%–99.99% unless problem implies higher (payments, booking)
- Managed cloud services (RDS, S3, Kafka, Redis) are acceptable building blocks
These foundational concepts underpin the patterns used in this problem. Review them before deep-diving into component-level trade-offs.
- Real-time scoring: Score every transaction/event for fraud risk in < 100 ms
- Rule engine: Configurable rules (velocity checks, amount limits, geo-anomalies)
- ML models: Machine learning models for pattern detection (supervised + unsupervised)
- Case management: Queue suspicious events for human analyst review
- Block/allow decisions: Auto-block high-risk, auto-allow low-risk, manual review for medium
- Feature store: Real-time and historical features (user behavior, device fingerprint, transaction patterns)
- Feedback loop: Analyst decisions feed back into ML model training
- Multi-channel: Detect fraud across payments, account creation, login, promo abuse
- Low Latency: Fraud decision in < 100 ms (in transaction critical path)
- High Recall: Catch > 95% of fraud (false negatives are costly)
- Acceptable Precision: False positive rate < 5% (blocking legitimate users is costly too)
- Scale: 50K+ transactions/sec scoring
- Availability: 99.99%: fraud system down means either blocking all or allowing all
- Adaptability: New fraud patterns detected and rules deployed within hours
| Metric | Calculation | Value |
|---|---|---|
| Transactions scored / sec | Derived from daily volume ÷ 86400 (+ peak factor) | 50K |
| ML features per transaction | Given | ~200 |
| Feature computation latency | Given | < 30 ms |
| Model inference latency | Given | < 20 ms |
| Rule evaluation latency | Given | < 10 ms |
| Total fraud decision latency | Given | < 100 ms |
| Fraud rate | Given | ~0.5% of transactions |
| Manual review queue | Given | ~50K cases/day |
Feature Computation: Real-Time + Historical
200 features per transaction, computed in < 30 ms:
- Real-time features (Redis, < 5 ms): transaction_count_last_1h, transaction_amount_last_24h, unique_merchants_last_7d, device_fingerprint_seen_before, ip_address_country
- Historical features (pre-computed, ClickHouse → Redis cache): avg_transaction_amount_30d, typical_transaction_hour, account_age_days
- Derived features (computed at scoring time): amount_deviation, geo_velocity (distance_from_last_txn / time_since_last_txn), is_new_device, is_new_merchant
Feature store architecture: Flink consumes transaction events → updates real-time features in Redis. Spark (nightly) computes historical features. At scoring time: Feature Service reads from Redis → constructs 200-dim feature vector.
ML Model Architecture: Ensemble
1. XGBoost (primary, supervised): Trained on labeled data with all 200 features Output: P(fraud) 0.0-1.0, fast inference (< 5 ms) 2. Autoencoder (anomaly detection, unsupervised): Trained on legitimate transactions only High reconstruction error = anomaly = potential fraud Catches NEW fraud patterns not in labeled training data 3. Graph Neural Network (network analysis): Detect fraud rings: cluster of accounts sharing devices/IPs Run offline, flag suspicious clusters for enhanced scrutiny Scoring: final_score = 0.5 * xgboost_score + 0.3 * autoencoder_anomaly + 0.2 * graph_risk score < 0.3: ALLOW, score 0.3-0.7: REVIEW, score > 0.7: BLOCK
Rule Engine: Fast, Configurable
{
"rule_id": "R001", "name": "high_amount_new_account",
"condition": "transaction.amount > 500 AND user.account_age_days < 7",
"action": "review", "priority": 10
},
{
"rule_id": "R002", "name": "impossible_travel",
"condition": "geo_velocity_kmh > 1000",
"action": "block", "priority": 1
},
{
"rule_id": "R003", "name": "velocity_breach",
"condition": "transaction_count_last_1h > 20",
"action": "block", "priority": 2
}Score Transaction
POST /api/v1/fraud/score
{
"event_type": "payment",
"transaction_id": "txn-uuid",
"user_id": "user-uuid",
"amount": 599.99,
"merchant_id": "m-uuid",
"device_fingerprint": "fp-abc",
"ip_address": "203.0.113.42",
"timestamp": "2026-03-14T11:00:00Z"
}
Response: 200 OK (< 100 ms)
{
"decision": "allow",
"score": 0.15,
"risk_factors": ["new_device"],
"rule_triggers": []
}Analyst Feedback
POST /api/v1/fraud/feedback
{ "case_id": "case-uuid", "decision": "fraud_confirmed", "analyst_id": "..." }Common Error Responses
400 Bad Request: invalid input, missing fields, or malformed JSON 401 Unauthorized: missing or invalid auth token or API key 403 Forbidden: authenticated but insufficient permissions 404 Not Found: resource ID does not exist 409 Conflict: duplicate write or version conflict; retry with idempotency key 422 Unprocessable Entity: valid syntax but invalid business logic 429 Too Many Requests: rate limit exceeded; honor Retry-After header 500 Internal Error: unexpected server fault; retry with idempotency key 503 Service Unavailable: dependency down or overloaded; use exponential backoff
Redis: Real-Time Features
txn_count_1h:{user_id} -> INT, TTL 3600
txn_amount_24h:{user_id} -> Sorted Set, TTL 86400
device_history:{user_id} -> SET of fingerprints
last_location:{user_id} -> Hash { lat, lng, timestamp }
ip_reputation:{ip} -> FLOAT (risk score)PostgreSQL: Rules & Cases
CREATE TABLE fraud_rules (
rule_id VARCHAR(20) PRIMARY KEY, name VARCHAR(100),
condition TEXT NOT NULL, action ENUM('allow','review','block'),
priority INT, active BOOLEAN DEFAULT TRUE
);
CREATE TABLE fraud_cases (
case_id UUID PRIMARY KEY, transaction_id UUID,
user_id UUID, score DECIMAL(4,3),
decision ENUM('pending','fraud_confirmed','legitimate','escalated'),
auto_decision VARCHAR(10), analyst_id UUID,
created_at TIMESTAMPTZ DEFAULT NOW()
);| Concern | Solution |
|---|---|
| Fraud service down | Fail-open for low-value txns (allow with async review); fail-close for high-value (decline) |
| Feature store lag | Use stale features with reduced confidence; increase review threshold |
| ML model error | Rule engine as safety net; circuit breaker on model; fallback to rules-only |
| False positive spike | Monitor auto-block rate; alert if > 2x normal; auto-switch to review mode |
| Feedback loop delay | Weekly model retrain; interim: update rules for new patterns within hours |
Fail-Open vs Fail-Close
Hybrid (recommended):
if transaction.amount < 50 AND user.account_age > 90 days:
allow (low risk, fail-open)
elif transaction.amount > 500:
decline (high risk, fail-close)
else:
allow + queue for async review (medium risk)Interview Walkthrough
- Frame as a latency-budget problem: scoring must complete in <50 ms — rules first (deterministic, explainable), ML second (novel patterns).
- Walk through the tiered pipeline: rule engine blocks known patterns instantly → feature store lookup → ML model score → decision (allow/review/decline).
- Explain feature freshness: Flink-updated velocity counters in Redis with staleness detection that biases toward REVIEW when features lag.
- Cover hybrid fail-open/fail-close: auto-allow low-amount trusted users, auto-decline high-amount when scorer is down, queue medium for async review.
- Mention offline GNN batch pipeline for fraud rings — nightly community detection writes cluster risk scores back to Redis for O(1) lookup at scoring time.
- Discuss champion/challenger model deployment with shadow scoring logged to ClickHouse before promoting a new model version.
- Common pitfall: pure fail-close when the ML service hiccups — blocking every transaction costs more revenue than the fraud it prevents.
Why Both Rules AND ML
Rules: deterministic, explainable, instant deployment for known patterns. "Block all transactions from sanctioned countries" -> rule, not ML. ML: catches novel patterns, handles complex feature interactions. "User's spending pattern changed subtly over 3 weeks" -> ML, not rules. Together: Rules catch known fraud immediately. ML catches new fraud patterns. Defense in depth: if ML misses it, rules may catch it (and vice versa).
Feature Freshness & Staleness Handling
Detection: Every feature hash includes _updated_at timestamp At scoring time: feature_age = now() - features._updated_at Strategy 1: Conservative bias If velocity features are stale → add 0.15 to fraud score Rationale: "I can't see recent activity → assume higher risk" Strategy 2: Default substitution Replace stale feature with population median Strategy 3: Feature importance gating If top-5 most important features are stale → route to REVIEW (skip ML)
Weekly vs Daily Model Retraining
Hybrid approach (recommended): Base model: retrained weekly (stable foundation) Rule engine: updated within HOURS for known new patterns → Analyst spots new pattern → creates rule → deployed in 30 min → Rule catches new fraud immediately while model takes a week to learn
Model Version Management During Canary Deploys
Two models always loaded in memory: Champion: current production model (v12) Challenger: candidate model (v13) Dual scoring: every transaction scored by BOTH models Only champion's decision is enforced Both scores logged to ClickHouse for offline comparison Canary deployment stages: Week 1: Shadow mode (100% champion enforced, challenger logged only) Week 2: 5% canary (challenger enforced on 5% traffic) Week 3: 50% split Week 4: 100% promotion Rollback: Config flag in Redis, flip from "v13" to "v12" → 1 second
Graph Neural Network (GNN): Production Architecture
GNN runs as OFFLINE batch pipeline (not in real-time scoring path):
Nightly pipeline (Spark):
1. Build transaction graph: 500M nodes, 2B edges
2. Community detection (Louvain algorithm): detect fraud rings
3. Score clusters: if cluster_fraud_rate > 10% → all members HIGH risk
4. Write to Redis: HSET graph:risk:{user_id} score 0.85
At scoring time: graph_risk = HGET graph:risk:{user_id} → <1ms lookup
Cost: ~4 hours nightly compute (Spark cluster)
Value: catches fraud rings that individual-transaction models missFail-Open vs Fail-Close: Threshold-Based Decision
Revenue impact analysis (5 min downtime):
Average fraud service downtime: 5 min/month
During 5 min downtime with threshold-based failover:
~1,500 transactions processed
~750 low-risk auto-allowed → 0 expected fraud
~250 high-risk auto-declined → $12,500 lost revenue
~500 medium-risk auto-allowed → ~$250 fraud loss
Compare: fail-close (block all) → $75,000 lost revenue
Compare: fail-open (allow all) → $3,750 fraud loss
Threshold-based: best of both worldsStaff interviews expect you to articulate how the system evolves under real growth — not jump straight to the final architecture.
Phase 1: MVP (0 to 100K users)
Monolith or minimal services proving core fraud detection system flows. Optimize for shipping speed and correctness over scale.
Key components: Single region · Primary DB + Redis cache · Synchronous core path · Basic monitoring
Move to next phase when: p99 latency exceeds SLO or DB CPU sustained above 70%
Phase 2: Growth (100K to 10M users)
Split read/write paths, introduce async processing for non-critical work, add caching layers and horizontal scaling.
Key components: Read replicas or CQRS · Message queue for async work · CDN / edge caching · Service-level SLOs
Move to next phase when: Hot keys, fan-out bottlenecks, or ops toil from manual scaling
Phase 3: Scale (10M+ users)
Shard data plane, multi-region active-active or active-passive, formal DR runbooks, cost optimization.
Key components: Database sharding / partitioning · Multi-region replication · Auto-scaling + chaos testing · Dedicated platform/SRE ownership
Move to next phase when: Regional failure domain risk, compliance data residency, or linear cost growth unsustainable
SLOs & Error Budgets
| Metric | Target | Rationale |
|---|---|---|
| Core user-facing availability | 99.95% | Budget for planned maintenance + unplanned failures without user-visible outage. |
| p99 latency (critical path) | Problem-specific — state target early and tie to capacity math | Interview credibility comes from connecting SLO to architecture choices. |
| Error rate (5xx) | < 0.1% | Distinguishes transient blips from systemic failure requiring rollback. |
| Data durability | 99.999999999% (11 nines) for committed writes | Define which operations require fsync/quorum vs async replication. |
Incident Scenarios (2am reality)
| Scenario | How you detect | Mitigation |
|---|---|---|
| Primary database unavailable | Health check failures, connection pool exhaustion alerts, elevated 5xx | Failover to replica / promote standby; enable read-only degraded mode if writes impossible; queue writes if async path exists |
| Traffic spike (10× normal) | RPS anomaly alert, autoscaling lag, latency SLO burn rate | Rate limit non-critical endpoints; scale read path horizontally; pre-warm caches; shed load on expensive operations |
| Bad deploy causing elevated errors | Canary metric regression, error budget burn, deployment correlation | Automated rollback within 5 minutes; feature flag kill switch; maintain N-1 compatibility |
Cost Drivers (Staff lens)
- Egress bandwidth and CDN (often dominates media/data-heavy systems)
- Database storage + IOPS at scale (plan compaction, TTL, tiering)
- Compute for async pipelines (right-size workers, spot instances for batch)
- Managed service premiums vs operational headcount trade-off
Multi-Region & DR
Start single-region with cross-AZ redundancy. Add read replicas in secondary region for DR. Move to active-active only when latency SLO or data residency requires it — accept conflict resolution complexity explicitly.