This problem appears in multiple sheets. Depth expectations increase as you progress:
| Track | What to demonstrate |
|---|---|
| Arch 75 | Staff level: multi-region, cost at scale, migration path, and production metrics. |
Interview Prompt
Design Ad Click Prediction System.
Clarifying Questions (ask before designing)
| Question | Why it matters |
|---|---|
| Which of these is highest priority: Feature engineering at serving time, Model serving latency, Click-through rate prediction? | Forces scope negotiation — senior candidates trim before drawing boxes. |
| What scale should we design for — DAU, QPS, data volume? | Drives every capacity decision; shows structured thinking. |
| What are the read vs write patterns on the critical path? | Determines caching, DB choice, and replication topology. |
| What consistency and durability guarantees are required? | Separates strong-consistency paths from eventual ones — a senior differentiator. |
Scope
In scope
- Feature engineering at serving time
- Model serving latency
- Click-through rate prediction
- Bid optimization
- Feedback loop
- Online learning
Out of scope (state explicitly)
- GPU cluster training and hyperparameter tuning
- Content moderation of recommended items
- Ad auction / sponsored placement ranking
Assumptions
- Clarify scale (DAU, QPS, data volume) for ad click prediction in the first 5 minutes
- Standard reliability target 99.9%–99.99% unless problem implies higher (payments, booking)
- Managed cloud services (RDS, S3, Kafka, Redis) are acceptable building blocks
These foundational concepts underpin the patterns used in this problem. Review them before deep-diving into component-level trade-offs.
- Predict probability a user will click an ad (CTR prediction) in < 10ms
- Feature engineering from user profile, ad creative, context (page, time, device)
- Model training pipeline: daily retraining on latest click data
- Online learning: model adapts to recent patterns within hours
- A/B testing: compare model versions on live traffic
- Feature store: consistent features between training and serving
- Calibration: predicted probabilities must match actual click rates
- Ultra-Low Latency: < 10ms p99 for prediction
- High Throughput: 1M+ predictions/sec
- Freshness: Model reflects behavior from last 24h
- Accuracy: 0.1% CTR improvement = millions in revenue
| Metric | Calculation | Value |
|---|---|---|
| Predictions / sec | Derived from daily volume ÷ 86400 (+ peak factor) | 50M |
| Feature lookup latency budget | Given | < 2ms |
| Model inference latency budget | Given | < 5ms |
| Features per prediction | Given | 100–200 |
| Training data / day | ~10 TB ÷ 86400 | ~10 TB |
| Model size | Given | 100 MB – 2 GB |
| Feature store (online) | Given | ~2 TB (Redis cluster) |
| Training data (offline) | Given | ~100 TB (S3/Parquet) |
Model Choice
LightGBM (gradient boosted trees): ✅ Fast inference (< 1ms), interpretable, handles sparse features ❌ Can't learn complex interactions automatically Deep & Cross Network (DCN): ✅ Automatically learns feature interactions (crossing layers) ❌ Slower inference (~5ms), needs GPU for serving Practice: Two-stage Stage 1: LightGBM for candidate scoring (fast, high recall) Stage 2: DNN for final ranking (slow but accurate, only top 50 candidates)
Calibration (Critical for Bidding)
Why: If model says P(click)=0.05 but actual CTR is 0.03 -> overbid by 67% How: Isotonic regression or Platt scaling maps raw scores to calibrated probabilities Monitoring: Expected calibration error (ECE) < 0.01 Bucket predictions into deciles -> compare predicted vs actual CTR
Position Bias Correction
Problem: Ads in position 1 get clicked more regardless of relevance Solution: Train on "position-aware" features, but serve WITHOUT position Training: include position as feature -> model learns position bias Serving: set position=1 for all -> model predicts "click if shown in position 1" Alternative: IPW (Inverse Propensity Weighting) Weight each sample by 1/P(position), reducing position bias
Event Bus Design (Kafka)
Topic: ad_click_prediction-events Partitions: 64 (scale consumers horizontally) Partition key: entity_id (user_id / order_id — preserves per-entity ordering) Retention: 7 days (compliance) or 24h (high-volume telemetry) Replication factor: 3, min.insync.replicas: 2 Producer: idempotent producer enabled (enable.idempotence=true) Consumer: consumer group "ad_click_prediction-processors" - At-least-once delivery + idempotent handlers (dedup by event_id) - DLQ topic: ad_click_prediction-events-dlq (poison messages after 3 retries) - Lag alert: consumer lag > 60s → scale workers Design an Ad Click Prediction System: async side effects MUST NOT block the synchronous API response. Sync path: validate → persist source of truth → publish event → return 201 Async path: consumers update caches, indexes, notifications, aggregates
# Real-time prediction (called by ad exchange in auction path)
POST /api/predict
{
"user_id": "u_abc123",
"ad_candidates": ["ad_001", "ad_002", "ad_003"],
"context": { "page_url": "...", "device": "mobile", "geo": "US" }
}
-> {
"predictions": [
{"ad_id": "ad_001", "p_click": 0.032, "calibrated": true},
{"ad_id": "ad_002", "p_click": 0.018, "calibrated": true}
],
"latency_ms": 8
}
# Model management
POST /api/models/deploy -> Deploy new model version (canary)
GET /api/models/active -> Current model version + metrics
POST /api/models/rollback -> Revert to previous modelCommon Error Responses
400 Bad Request: invalid input, missing fields, or malformed JSON 401 Unauthorized: missing or invalid auth token or API key 403 Forbidden: authenticated but insufficient permissions 404 Not Found: resource ID does not exist 409 Conflict: duplicate write or version conflict; retry with idempotency key 422 Unprocessable Entity: valid syntax but invalid business logic 429 Too Many Requests: rate limit exceeded; honor Retry-After header 500 Internal Error: unexpected server fault; retry with idempotency key 503 Service Unavailable: dependency down or overloaded; use exponential backoff
Redis (Online Feature Store)
HSET user:features:{user_id}
avg_ctr_7d "0.032"
click_count_30d "145"
top_category "electronics"
session_depth "3"
last_click_hours "2.5"
_updated_at "2026-03-14T10:05:00Z"
EXPIRE user:features:{user_id} 86400
HSET ad:features:{ad_id}
historical_ctr "0.025"
category "travel"
creative_size "300x250"Kafka Topics
Topic: ad-impressions (partitioned by user_id, RF=3)
{ "user_id": "u_abc", "ad_id": "ad_001", "position": 2, "p_click": 0.032, "clicked": false }
Topic: ad-clicks (partitioned by user_id, RF=3)
{ "user_id": "u_abc", "ad_id": "ad_001", "click_timestamp": "..." }
-> Flink joins impressions + clicks -> training labels| Concern | Solution |
|---|---|
| Model serving failure | Fall back to simpler model (logistic regression) or last-known-good |
| Feature store unavailable | Use default features (population medians); reduces accuracy, not availability |
| Stale features | TTL enforcement; degrade gracefully with confidence reduction |
| Training failure | Don't deploy; keep serving current model; alert ML team |
| Canary deployment | New model serves 5% traffic; monitor AUC, calibration -> promote or rollback |
| Redis shard failure | Redis Cluster auto-failover; missing features for ~1% of users during failover |
Model Redundancy
Two models always warm in memory on every serving host: Champion (current production model, v47) Challenger (candidate model, v48, or last-known-good v46) Routing: config flag determines active model Normal: champion serves 100% Canary: champion 95%, challenger 5% Rollback: instant config flip -> challenger active, <60 seconds Both models score EVERY request (dual scoring): Active model's score -> used for auction Inactive model's score -> logged for offline comparison
Redis Cluster Architecture
Scale: 1B users x 2 KB features = 2 TB online feature store Redis Cluster: 32 primary shards + 32 replicas = 64 nodes Sharding: CRC16(user_id) % 16384 -> slot -> shard Read path (per prediction): 1. Hash user_id -> HGETALL user:features (0.5ms) 2. Hash ad_id -> HGETALL ad:features (0.5ms) 3. Context features in-process (0.1ms) 4. Assemble 200-dim feature vector (0.2ms) Total: < 2ms (parallel Redis calls) Hot-key problem: viral user -> millions of impressions Solution: rate-limit feature updates per user If user_id seen > 100 times in last minute -> skip update
Race Conditions in Online Feature Updates
Race 1: Flink writes while prediction reads HSET and HGETALL are serialized (Redis single-threaded) For multi-field: use MULTI/EXEC for atomic updates Race 2: Two Flink workers updating same user Solution: Partition Kafka by user_id All events for same user -> same partition -> same Flink subtask -> single writer
Interview Walkthrough
- Separate online inference (p(click) in <10ms) from offline training (batch on historical logs).
- Explain feature store for consistent online/offline features — training-serving skew ruins model quality.
- Cover logistic regression or GBDT as baseline before jumping to deep learning — interviewers want pragmatic choices.
- Discuss exploration vs exploitation (multi-armed bandit) for new ads without click history.
- Mention model versioning and shadow deployment before promoting a new model to production traffic.
- Common pitfall: training on click labels without correcting for position bias — top slots get clicks regardless of relevance.
LightGBM vs Deep Neural Network
LightGBM: Inference: < 1ms (single CPU core) Training: 2 hours on 1B samples AUC: 0.785 Best for: Candidate scoring (fast, high recall) Deep & Cross Network (DCN-v2): Inference: 3-5ms (GPU-accelerated) Training: 12 hours on 1B samples (8 GPUs) AUC: 0.795 (+1% -> millions in revenue) Best for: Final ranking (accurate, only top 50) Two-stage: LightGBM scores 200 candidates (1ms) -> DCN re-ranks top 50 (5ms) = 6ms total
Negative Downsampling
Problem: CTR ~1% -> 99% negative samples Model sees 99 negatives per positive -> predicts 0 for everything Solution: Downsample negatives Keep ALL positives, keep 1-in-10 negatives (1:10 ratio) Calibration correction: p_calibrated = p_model x rate / (p_model x rate + (1-p_model)) Industry standard: 1:10 to 1:20 ratio
Online Learning vs Daily Batch Retraining
Daily Batch: stable but 24-48h behind Online Learning: adapts within hours but risk of catastrophic forgetting Hybrid (recommended): Base model: daily batch retrain (stable foundation) Delta model: hourly online updates (captures recent trends) Final score = 0.7 x base_score + 0.3 x delta_score Weekly: merge delta into base -> fresh start for delta If delta model degrades -> coefficient automatically reduces to 0 -> graceful fallback
Staff interviews expect you to articulate how the system evolves under real growth — not jump straight to the final architecture.
Phase 1: MVP (0 to 100K users)
Monolith or minimal services proving core ad click prediction flows. Optimize for shipping speed and correctness over scale.
Key components: Single region · Primary DB + Redis cache · Synchronous core path · Basic monitoring
Move to next phase when: p99 latency exceeds SLO or DB CPU sustained above 70%
Phase 2: Growth (100K to 10M users)
Split read/write paths, introduce async processing for non-critical work, add caching layers and horizontal scaling.
Key components: Read replicas or CQRS · Message queue for async work · CDN / edge caching · Service-level SLOs
Move to next phase when: Hot keys, fan-out bottlenecks, or ops toil from manual scaling
Phase 3: Scale (10M+ users)
Shard data plane, multi-region active-active or active-passive, formal DR runbooks, cost optimization.
Key components: Database sharding / partitioning · Multi-region replication · Auto-scaling + chaos testing · Dedicated platform/SRE ownership
Move to next phase when: Regional failure domain risk, compliance data residency, or linear cost growth unsustainable
SLOs & Error Budgets
| Metric | Target | Rationale |
|---|---|---|
| Core user-facing availability | 99.95% | Budget for planned maintenance + unplanned failures without user-visible outage. |
| p99 latency (critical path) | Problem-specific — state target early and tie to capacity math | Interview credibility comes from connecting SLO to architecture choices. |
| Error rate (5xx) | < 0.1% | Distinguishes transient blips from systemic failure requiring rollback. |
| Data durability | 99.999999999% (11 nines) for committed writes | Define which operations require fsync/quorum vs async replication. |
Incident Scenarios (2am reality)
| Scenario | How you detect | Mitigation |
|---|---|---|
| Primary database unavailable | Health check failures, connection pool exhaustion alerts, elevated 5xx | Failover to replica / promote standby; enable read-only degraded mode if writes impossible; queue writes if async path exists |
| Traffic spike (10× normal) | RPS anomaly alert, autoscaling lag, latency SLO burn rate | Rate limit non-critical endpoints; scale read path horizontally; pre-warm caches; shed load on expensive operations |
| Bad deploy causing elevated errors | Canary metric regression, error budget burn, deployment correlation | Automated rollback within 5 minutes; feature flag kill switch; maintain N-1 compatibility |
Cost Drivers (Staff lens)
- Egress bandwidth and CDN (often dominates media/data-heavy systems)
- Database storage + IOPS at scale (plan compaction, TTL, tiering)
- Compute for async pipelines (right-size workers, spot instances for batch)
- Managed service premiums vs operational headcount trade-off
Multi-Region & DR
Start single-region with cross-AZ redundancy. Add read replicas in secondary region for DR. Move to active-active only when latency SLO or data residency requires it — accept conflict resolution complexity explicitly.