Design a Real-Time Bidding System (Ad Tech)

This problem appears in multiple sheets. Depth expectations increase as you progress:

Track	What to demonstrate
Arch 75	Staff level: multi-region, cost at scale, migration path, and production metrics.

Interview Prompt

Design Real-Time Bidding System (Ad Tech).

Clarifying Questions (ask before designing)

Question	Why it matters
Which of these is highest priority: 100ms bid deadline, Bid request/response protocol (OpenRTB), Budget pacing?	Forces scope negotiation — senior candidates trim before drawing boxes.
What scale should we design for — DAU, QPS, data volume?	Drives every capacity decision; shows structured thinking.
What are the read vs write patterns on the critical path?	Determines caching, DB choice, and replication topology.
What consistency and durability guarantees are required?	Separates strong-consistency paths from eventual ones — a senior differentiator.

Scope

In scope

100ms bid deadline
Bid request/response protocol (OpenRTB)
Budget pacing
Frequency capping
Auction types (first-price vs second-price)
Capacity estimation with shown math

Out of scope (state explicitly)

Detailed frontend/UI pixel implementation
Org structure, staffing, and hiring plan

Assumptions

Clarify scale (DAU, QPS, data volume) for realtime bidding system in the first 5 minutes
Standard reliability target 99.9%–99.99% unless problem implies higher (payments, booking)
Managed cloud services (RDS, S3, Kafka, Redis) are acceptable building blocks

Real-time auction in < 100ms when user loads a webpage with ad slots
Send bid requests to multiple DSPs simultaneously
Each DSP evaluates the user and returns a bid price + ad creative
Select the highest bidder, render their ad, record the impression
Support multiple ad formats: display, video, native, rich media
Frequency capping: limit how many times a user sees the same ad
Budget management: stop bidding when advertiser budget exhausted
Win notification: notify winning DSP so they can track spend
Click and conversion tracking with attribution
Fraud detection: filter bot traffic, invalid clicks, ad stacking

Metric	Calculation	Value
Auctions / sec	Derived from daily volume ÷ 86400 (+ peak factor)	10M
DSPs per auction	Given	5–20
Bid requests / sec (outbound)	Derived from daily volume ÷ 86400 (+ peak factor)	100M
Bid response time SLA	Given	< 50ms
Win notifications / sec	Derived from daily volume ÷ 86400 (+ peak factor)	5M
Impressions / day	5M wins/sec × 86400	432B
Auction log size / day	864B auctions × ~1 KB	~864 TB/day

Loading...

Auction Engine

Second-Price: Bids $3.10, $2.50, $1.80 -> Winner=$3.10 bidder pays $2.51 (2nd+$0.01). Incentive: bid true value.

First-Price (industry standard since 2019): Winner pays their bid. Challenge: bid shading.

Latency Budget Breakdown

Total budget: 100ms (page load to ad rendered)
  Ad request to SSP edge:     5ms
  User data lookup (Redis):   2ms
  Bid request to DSPs:        5ms (parallel HTTP/2)
  DSP processing + response: 30ms (DSP timeout: 50ms)
  Auction logic:              1ms
  Win notification (async):   0ms
  Ad markup to publisher:     5ms
  Ad render in browser:       ~50ms (client-side)
  Server-side total: ~48ms -> within 100ms budget

Budget Enforcement at Scale

Tier 1 (hot path, Redis):
  Each edge server gets a "budget slice" from central budget
  Check local slice only -> no cross-server coordination
  When slice exhausted -> request new slice from central

Tier 2 (near real-time, Flink):
  Aggregate all win notifications per campaign
  If spend approaches limit -> signal "stop bidding"
  Latency: ~5 seconds behind real-time

Tier 3 (daily reconciliation):
  ClickHouse aggregation -> reconcile against DSP reports

Overspend tolerance: ~2-5% (industry standard, contractually agreed)

Event Bus Design (Kafka)

Topic: realtime_bidding_system-events
  Partitions: 64 (scale consumers horizontally)
  Partition key: entity_id (user_id / order_id — preserves per-entity ordering)
  Retention: 7 days (compliance) or 24h (high-volume telemetry)
  Replication factor: 3, min.insync.replicas: 2

Producer: idempotent producer enabled (enable.idempotence=true)
Consumer: consumer group "realtime_bidding_system-processors"
  - At-least-once delivery + idempotent handlers (dedup by event_id)
  - DLQ topic: realtime_bidding_system-events-dlq (poison messages after 3 retries)
  - Lag alert: consumer lag > 60s → scale workers

Design a Real-Time Bidding System (Ad Tech): async side effects MUST NOT block the synchronous API response.
  Sync path: validate → persist source of truth → publish event → return 201
  Async path: consumers update caches, indexes, notifications, aggregates

OpenRTB Bid Request

JSON

POST /bid (sent to each DSP, parallel)
{
  "id": "auction-uuid-123",
  "imp": [{
    "id": "1", "banner": {"w": 300, "h": 250}, "bidfloor": 0.50
  }],
  "site": { "domain": "example.com" },
  "user": { "id": "user-cookie-hash", "geo": {"country": "US"} },
  "device": { "ua": "Mozilla/5.0...", "ip": "203.0.113.42" },
  "tmax": 50
}

Bid Response

JSON

{
  "seatbid": [{ "bid": [{
    "id": "bid-456", "impid": "1", "price": 3.10,
    "adm": "<div>...ad creative HTML...</div>",
    "nurl": "https://dsp.example.com/win?price=${AUCTION_PRICE}"
  }] }]
}

Internal APIs

GET  /api/campaigns/{id}/budget    -> Current spend vs limit
POST /api/campaigns/{id}/pause     -> Pause bidding for campaign
GET  /api/analytics/spend?campaign=...&date=...  -> Spend analytics

Common Error Responses

400 Bad Request: invalid input, missing fields, or malformed JSON
401 Unauthorized: missing or invalid auth token or API key
403 Forbidden: authenticated but insufficient permissions
404 Not Found: resource ID does not exist
409 Conflict: duplicate write or version conflict; retry with idempotency key
422 Unprocessable Entity: valid syntax but invalid business logic
429 Too Many Requests: rate limit exceeded; honor Retry-After header
500 Internal Error: unexpected server fault; retry with idempotency key
503 Service Unavailable: dependency down or overloaded; use exponential backoff

Redis (Hot Path)

# User profile
HSET user:{cookie_hash} segments "sports,tech" age_range "25-34"  TTL: 30d

# Frequency cap
INCR freq:{user_id}:{campaign_id}:{date}  TTL: 24h
  Check: GET < max_freq -> allow bid

# Budget tracking
HSET budget:{campaign_id} daily_spent 4523.50 daily_limit 5000.00
  If daily_spent >= daily_limit -> don't bid

ClickHouse (Analytics)

SQL

CREATE TABLE auction_logs (
    auction_id UUID, timestamp DateTime, publisher_id UInt64,
    ad_slot_size String, user_id String, country LowCardinality(String),
    device_type LowCardinality(String), num_bids UInt8,
    winning_dsp LowCardinality(String), winning_price Float64,
    second_price Float64, floor_price Float64, campaign_id UInt64,
    is_click UInt8, is_conversion UInt8
) ENGINE = MergeTree()
PARTITION BY toYYYYMMDD(timestamp)
ORDER BY (publisher_id, timestamp);

Technique	Application
DSP timeout (50ms)	If DSP doesn't respond -> excluded from auction
No-bid fallback	Serve house ad or empty slot
Kafka RF=3	Auction logs, impressions survive broker failure
Redis Cluster	User data + budget survives node failure
Edge redundancy	Multiple edge servers per region; LB health checks
Budget reconciliation	Daily reconcile catches any tracking drift

DSP Failure

If DSP consistently times out -> circuit breaker stops sending bid requests for 60s -> retry probe.

Fraud Detection

Real-time (< 5ms): IP blocklist, bot UA detection, IVT score.

Near real-time (Flink): CTR anomalies, click timing (< 100ms = bot), geo anomalies.

Batch (daily): Click farm patterns, publisher traffic quality scoring.

SLOs & Error Budgets

Metric	Target	Rationale
Core user-facing availability	99.95%	Budget for planned maintenance + unplanned failures without user-visible outage.
p99 latency (critical path)	Problem-specific — state target early and tie to capacity math	Interview credibility comes from connecting SLO to architecture choices.
Error rate (5xx)	< 0.1%	Distinguishes transient blips from systemic failure requiring rollback.
Data durability	99.999999999% (11 nines) for committed writes	Define which operations require fsync/quorum vs async replication.

Incident Scenarios (2am reality)

Scenario	How you detect	Mitigation
Primary database unavailable	Health check failures, connection pool exhaustion alerts, elevated 5xx	Failover to replica / promote standby; enable read-only degraded mode if writes impossible; queue writes if async path exists
Traffic spike (10× normal)	RPS anomaly alert, autoscaling lag, latency SLO burn rate	Rate limit non-critical endpoints; scale read path horizontally; pre-warm caches; shed load on expensive operations
Bad deploy causing elevated errors	Canary metric regression, error budget burn, deployment correlation	Automated rollback within 5 minutes; feature flag kill switch; maintain N-1 compatibility

Cost Drivers (Staff lens)

Egress bandwidth and CDN (often dominates media/data-heavy systems)
Database storage + IOPS at scale (plan compaction, TTL, tiering)
Compute for async pipelines (right-size workers, spot instances for batch)
Managed service premiums vs operational headcount trade-off

Multi-Region & DR

Start single-region with cross-AZ redundancy. Add read replicas in secondary region for DR. Move to active-active only when latency SLO or data residency requires it — accept conflict resolution complexity explicitly.