Design an Ad Click Prediction System

This problem appears in multiple sheets. Depth expectations increase as you progress:

Track	What to demonstrate
Arch 75	Staff level: multi-region, cost at scale, migration path, and production metrics.

Interview Prompt

Design Ad Click Prediction System.

Clarifying Questions (ask before designing)

Question	Why it matters
Which of these is highest priority: Feature engineering at serving time, Model serving latency, Click-through rate prediction?	Forces scope negotiation — senior candidates trim before drawing boxes.
What scale should we design for — DAU, QPS, data volume?	Drives every capacity decision; shows structured thinking.
What are the read vs write patterns on the critical path?	Determines caching, DB choice, and replication topology.
What consistency and durability guarantees are required?	Separates strong-consistency paths from eventual ones — a senior differentiator.

Scope

In scope

Feature engineering at serving time
Model serving latency
Click-through rate prediction
Bid optimization
Feedback loop
Online learning

Out of scope (state explicitly)

GPU cluster training and hyperparameter tuning
Content moderation of recommended items
Ad auction / sponsored placement ranking

Assumptions

Clarify scale (DAU, QPS, data volume) for ad click prediction in the first 5 minutes
Standard reliability target 99.9%–99.99% unless problem implies higher (payments, booking)
Managed cloud services (RDS, S3, Kafka, Redis) are acceptable building blocks

Predict probability a user will click an ad (CTR prediction) in < 10ms
Feature engineering from user profile, ad creative, context (page, time, device)
Model training pipeline: daily retraining on latest click data
Online learning: model adapts to recent patterns within hours
A/B testing: compare model versions on live traffic
Feature store: consistent features between training and serving
Calibration: predicted probabilities must match actual click rates

Metric	Calculation	Value
Predictions / sec	Derived from daily volume ÷ 86400 (+ peak factor)	50M
Feature lookup latency budget	Given	< 2ms
Model inference latency budget	Given	< 5ms
Features per prediction	Given	100–200
Training data / day	~10 TB ÷ 86400	~10 TB
Model size	Given	100 MB – 2 GB
Feature store (online)	Given	~2 TB (Redis cluster)
Training data (offline)	Given	~100 TB (S3/Parquet)

Loading...

Model Choice

LightGBM (gradient boosted trees):
  ✅ Fast inference (< 1ms), interpretable, handles sparse features
  ❌ Can't learn complex interactions automatically

Deep & Cross Network (DCN):
  ✅ Automatically learns feature interactions (crossing layers)
  ❌ Slower inference (~5ms), needs GPU for serving

Practice: Two-stage
  Stage 1: LightGBM for candidate scoring (fast, high recall)
  Stage 2: DNN for final ranking (slow but accurate, only top 50 candidates)

Calibration (Critical for Bidding)

Why: If model says P(click)=0.05 but actual CTR is 0.03 -> overbid by 67%

How: Isotonic regression or Platt scaling maps raw scores to calibrated probabilities

Monitoring: Expected calibration error (ECE) < 0.01
  Bucket predictions into deciles -> compare predicted vs actual CTR

Position Bias Correction

Problem: Ads in position 1 get clicked more regardless of relevance

Solution: Train on "position-aware" features, but serve WITHOUT position
  Training: include position as feature -> model learns position bias
  Serving: set position=1 for all -> model predicts "click if shown in position 1"

Alternative: IPW (Inverse Propensity Weighting)
  Weight each sample by 1/P(position), reducing position bias

Event Bus Design (Kafka)

Topic: ad_click_prediction-events
  Partitions: 64 (scale consumers horizontally)
  Partition key: entity_id (user_id / order_id — preserves per-entity ordering)
  Retention: 7 days (compliance) or 24h (high-volume telemetry)
  Replication factor: 3, min.insync.replicas: 2

Producer: idempotent producer enabled (enable.idempotence=true)
Consumer: consumer group "ad_click_prediction-processors"
  - At-least-once delivery + idempotent handlers (dedup by event_id)
  - DLQ topic: ad_click_prediction-events-dlq (poison messages after 3 retries)
  - Lag alert: consumer lag > 60s → scale workers

Design an Ad Click Prediction System: async side effects MUST NOT block the synchronous API response.
  Sync path: validate → persist source of truth → publish event → return 201
  Async path: consumers update caches, indexes, notifications, aggregates

Concern	Solution
Model serving failure	Fall back to simpler model (logistic regression) or last-known-good
Feature store unavailable	Use default features (population medians); reduces accuracy, not availability
Stale features	TTL enforcement; degrade gracefully with confidence reduction
Training failure	Don't deploy; keep serving current model; alert ML team
Canary deployment	New model serves 5% traffic; monitor AUC, calibration -> promote or rollback
Redis shard failure	Redis Cluster auto-failover; missing features for ~1% of users during failover

Model Redundancy

Two models always warm in memory on every serving host:
  Champion (current production model, v47)
  Challenger (candidate model, v48, or last-known-good v46)

Routing: config flag determines active model
  Normal: champion serves 100%
  Canary: champion 95%, challenger 5%
  Rollback: instant config flip -> challenger active, <60 seconds

Both models score EVERY request (dual scoring):
  Active model's score -> used for auction
  Inactive model's score -> logged for offline comparison

SLOs & Error Budgets

Metric	Target	Rationale
Core user-facing availability	99.95%	Budget for planned maintenance + unplanned failures without user-visible outage.
p99 latency (critical path)	Problem-specific — state target early and tie to capacity math	Interview credibility comes from connecting SLO to architecture choices.
Error rate (5xx)	< 0.1%	Distinguishes transient blips from systemic failure requiring rollback.
Data durability	99.999999999% (11 nines) for committed writes	Define which operations require fsync/quorum vs async replication.

Incident Scenarios (2am reality)

Scenario	How you detect	Mitigation
Primary database unavailable	Health check failures, connection pool exhaustion alerts, elevated 5xx	Failover to replica / promote standby; enable read-only degraded mode if writes impossible; queue writes if async path exists
Traffic spike (10× normal)	RPS anomaly alert, autoscaling lag, latency SLO burn rate	Rate limit non-critical endpoints; scale read path horizontally; pre-warm caches; shed load on expensive operations
Bad deploy causing elevated errors	Canary metric regression, error budget burn, deployment correlation	Automated rollback within 5 minutes; feature flag kill switch; maintain N-1 compatibility

Cost Drivers (Staff lens)

Egress bandwidth and CDN (often dominates media/data-heavy systems)
Database storage + IOPS at scale (plan compaction, TTL, tiering)
Compute for async pipelines (right-size workers, spot instances for batch)
Managed service premiums vs operational headcount trade-off

Multi-Region & DR

Start single-region with cross-AZ redundancy. Add read replicas in secondary region for DR. Move to active-active only when latency SLO or data residency requires it — accept conflict resolution complexity explicitly.

Interview Prompt

Clarifying Questions (ask before designing)

Scope

In scope

Out of scope (state explicitly)

Assumptions

Model Choice

Calibration (Critical for Bidding)

Position Bias Correction

Event Bus Design (Kafka)

Common Error Responses

Redis (Online Feature Store)

Kafka Topics

Model Redundancy

Redis Cluster Architecture

Race Conditions in Online Feature Updates

Interview Walkthrough

LightGBM vs Deep Neural Network

Negative Downsampling

Online Learning vs Daily Batch Retraining

Phase 1: MVP (0 to 100K users)

Phase 2: Growth (100K to 10M users)

Phase 3: Scale (10M+ users)

SLOs & Error Budgets

Incident Scenarios (2am reality)

Cost Drivers (Staff lens)

Multi-Region & DR