This problem appears in multiple sheets. Depth expectations increase as you progress:
Interview Prompt
Design A/B Testing and Experimentation Platform.
Clarifying Questions (ask before designing)
| Question | Why it matters |
|---|---|
| Which of these is highest priority: Experiment assignment (deterministic hashing), Metric pipeline, Statistical significance engine? | Forces scope negotiation — senior candidates trim before drawing boxes. |
| What scale should we design for — DAU, QPS, data volume? | Drives every capacity decision; shows structured thinking. |
| What are the read vs write patterns on the critical path? | Determines caching, DB choice, and replication topology. |
| What consistency and durability guarantees are required? | Separates strong-consistency paths from eventual ones — a senior differentiator. |
Scope
In scope
- Experiment assignment (deterministic hashing)
- Metric pipeline
- Statistical significance engine
- Interaction effects
- Guardrail metrics
- Capacity estimation with shown math
Out of scope (state explicitly)
- Detailed frontend/UI pixel implementation
- Org structure, staffing, and hiring plan
Assumptions
- Clarify scale (DAU, QPS, data volume) for ab testing platform in the first 5 minutes
- Standard reliability target 99.9%–99.99% unless problem implies higher (payments, booking)
- Managed cloud services (RDS, S3, Kafka, Redis) are acceptable building blocks
These foundational concepts underpin the patterns used in this problem. Review them before deep-diving into component-level trade-offs.
- Create experiments: Define experiment with name, hypothesis, variants, and traffic allocation
- User assignment: Deterministically assign users to experiment variants
- Feature flags: Toggle features on/off; gradual rollout
- Metric tracking: Track conversion rates, revenue, engagement metrics per variant
- Statistical analysis: Compute p-value, confidence interval, statistical significance
- Mutual exclusion: Prevent conflicting experiments from overlapping
- Experiment lifecycle: Draft → Running → Paused → Completed → Archived
- Guardrail metrics: Auto-stop experiment if key metrics degrade
- Segmentation: Run experiments on specific user segments
- Low Latency: Variant assignment in < 5 ms
- Consistency: Same user always sees same variant
- Scale: 1000+ concurrent experiments, 500M+ users, 50B+ events/day
- Statistical Rigor: Correct p-values; account for multiple comparisons
- No Impact on Production: Experiment infrastructure must not slow down main services
- Availability: 99.99% for assignment; analytics can tolerate minutes of lag
| Metric | Calculation | Value |
|---|---|---|
| Concurrent experiments | Given | 1,000 |
| Users | Given | 500M |
| Variant assignment calls / sec | Derived from daily volume ÷ 86400 (+ peak factor) | 500K |
| Experiment events / day | Given | 50B |
| Event data storage / day | 50B events × ~100 B | 5 TB |
Deterministic User Assignment
User must ALWAYS see the same variant. No randomness per request.
Algorithm: hash-based assignment
def get_variant(user_id, experiment_id, variants, traffic_percent):
hash_input = f"{experiment_id}:{user_id}"
# Gate: is this user in the experiment at all?
if murmurhash3(hash_input) % 10000 >= traffic_percent * 100:
return None
# Independent hash for the variant split so it's uniform
# across the in-experiment population (not just the gated range).
variant_hash = murmurhash3(hash_input + ":variant") % 10000
cumulative = 0
for variant_name, weight in variants:
cumulative += weight * 100
if variant_hash < cumulative:
return variant_name
return variants[-1][0]
Properties:
- Deterministic: same user + experiment -> same variant
- Uniform: murmurhash3 is uniformly distributed
- Independent: adding/removing experiments doesn't change other assignments
- Fast: murmurhash3 is < 100 nsMutual Exclusion (Experiment Layers)
Problem: Experiment A tests checkout button color. Experiment B tests checkout page layout.
If same user is in both: which caused the conversion improvement?
Solution: Experiment Layers (Google's Overlapping Experiment Infrastructure)
Layer 1 (UI experiments):
Experiment A: button color (control: blue, treatment: green)
Experiment B: header layout (control: v1, treatment: v2)
Within a layer: user is in AT MOST one experiment
Layer 2 (Backend experiments):
Experiment C: recommendation algorithm (control: v1, treatment: v2)
Independent of Layer 1
Implementation:
Each experiment belongs to a layer.
Assignment: hash(user_id + layer_id) % total_traffic
Each experiment "owns" a non-overlapping range of hash values.Statistical Analysis Engine
1. Conversion rate per variant 2. Relative lift = (treatment_rate - control_rate) / control_rate 3. Statistical significance (two-proportion z-test): p_pooled = total_conversions / total_impressions se = sqrt(p_pooled * (1-p_pooled) * (1/n_control + 1/n_treatment)) z = (treatment_rate - control_rate) / se p_value = 2 * (1 - norm_cdf(abs(z))) 4. Confidence interval (95%): CI = diff ± 1.96 * se 5. Multiple comparison correction: Bonferroni or Benjamini-Hochberg 6. Guardrail metrics: auto-pause experiment if degraded > 2 std dev
Event Bus Design (Kafka)
Topic: ab_testing_platform-events Partitions: 64 (scale consumers horizontally) Partition key: entity_id (user_id / order_id — preserves per-entity ordering) Retention: 7 days (compliance) or 24h (high-volume telemetry) Replication factor: 3, min.insync.replicas: 2 Producer: idempotent producer enabled (enable.idempotence=true) Consumer: consumer group "ab_testing_platform-processors" - At-least-once delivery + idempotent handlers (dedup by event_id) - DLQ topic: ab_testing_platform-events-dlq (poison messages after 3 retries) - Lag alert: consumer lag > 60s → scale workers Design an A/B Testing and Experimentation Platform: async side effects MUST NOT block the synchronous API response. Sync path: validate → persist source of truth → publish event → return 201 Async path: consumers update caches, indexes, notifications, aggregates
POST /api/v1/experiments
{
"name": "checkout_button_color",
"hypothesis": "Green button increases conversion by 5%",
"layer": "checkout_ui",
"traffic_percent": 20,
"variants": [
{ "name": "control", "weight": 50, "config": { "button_color": "blue" } },
{ "name": "treatment", "weight": 50, "config": { "button_color": "green" } }
],
"primary_metric": "checkout_conversion_rate",
"guardrail_metrics": ["crash_rate", "p95_latency"]
}
GET /api/v1/experiments/{id}/assignment?user_id=user-uuid
-> { "variant": "treatment", "config": { "button_color": "green" } }
POST /api/v1/experiments/{id}/events
{ "user_id": "user-uuid", "event_type": "conversion", "value": 1 }
GET /api/v1/experiments/{id}/results
-> {
"status": "running", "days_running": 7,
"variants": {
"control": { "impressions": 50000, "conversions": 2500, "rate": 0.0500 },
"treatment": { "impressions": 49800, "conversions": 2750, "rate": 0.0552 }
},
"lift": 0.104, "p_value": 0.0023, "significant": true,
"confidence_interval": [0.032, 0.176],
"recommended_action": "Ship treatment (statistically significant improvement)"
}Common Error Responses
400 Bad Request: invalid input, missing fields, or malformed JSON 401 Unauthorized: missing or invalid auth token or API key 403 Forbidden: authenticated but insufficient permissions 404 Not Found: resource ID does not exist 409 Conflict: duplicate write or version conflict; retry with idempotency key 422 Unprocessable Entity: valid syntax but invalid business logic 429 Too Many Requests: rate limit exceeded; honor Retry-After header 500 Internal Error: unexpected server fault; retry with idempotency key 503 Service Unavailable: dependency down or overloaded; use exponential backoff
PostgreSQL: Experiment Configuration
CREATE TABLE experiments (
experiment_id UUID PRIMARY KEY, name VARCHAR(100), hypothesis TEXT,
layer VARCHAR(50), traffic_percent DECIMAL(5,2),
primary_metric VARCHAR(100), guardrail_metrics JSONB,
status ENUM('draft','running','paused','completed','archived'),
started_at TIMESTAMPTZ, ended_at TIMESTAMPTZ,
owner VARCHAR(100), created_at TIMESTAMPTZ DEFAULT NOW()
);
CREATE TABLE experiment_variants (
variant_id UUID PRIMARY KEY, experiment_id UUID NOT NULL,
name VARCHAR(50), weight INT, config JSONB
);Redis: Fast Assignment
exp_config:{experiment_id} -> JSON (experiment config)
TTL: 60 (refreshed from PostgreSQL)
active_experiments:{layer} -> LIST of experiment_ids
TTL: 60ClickHouse: Event Analytics
CREATE TABLE experiment_events (
experiment_id UUID, variant String, user_id UUID,
event_type String, value Float64,
timestamp DateTime, date Date MATERIALIZED toDate(timestamp)
) ENGINE = MergeTree() PARTITION BY toYYYYMM(timestamp)
ORDER BY (experiment_id, variant, timestamp);| Concern | Solution |
|---|---|
| Assignment service down | SDK caches last assignment locally; fall back to control variant |
| Event pipeline lag | ClickHouse backfill from Kafka replay; results delayed but not lost |
| Experiment degrades metrics | Guardrail auto-pause; manual kill switch |
| Hash collision causing uneven split | Chi-squared test on assignment counts; alert if > 2% deviation |
| Novelty effect | Run experiments for minimum 2 weeks; track metrics over time |
Interview Walkthrough
- Split the problem into two paths: synchronous variant assignment (< 5 ms, must never block production) and asynchronous event analytics (50B+ events/day via Kafka → ClickHouse).
- Lead with hash-based deterministic assignment —
murmurhash3(experiment_id:user_id)guarantees the same user always sees the same variant with no per-request randomness. - Introduce experiment layers early to handle mutual exclusion — overlapping UI tests on checkout button color vs page layout must not share the same user pool.
- Walk through the statistical pipeline: conversion rate, relative lift, two-proportion z-test, and Bonferroni correction for multiple comparisons.
- Cover guardrail metrics with auto-pause — if crash_rate or p95_latency degrades beyond 2 std dev, stop the experiment before it damages production.
- Detect Sample Ratio Mismatch (SRM) with chi-squared tests on assignment counts — an uneven split invalidates the entire experiment.
- Common pitfall: checking p-values daily and stopping when significant — the peeking problem inflates false positives from 5% to 20–30%.
Peeking Problem
Problem: checking p-value daily and stopping when p < 0.05. This inflates false positive rate from 5% to 20-30%! Why? Statistical tests assume you look at data ONCE at predetermined sample size. Solution: 1. Pre-determine sample size. Don't look until reached. 2. Sequential testing with adjusted boundaries (O'Brien-Fleming). 3. Bayesian approach: compute posterior probability continuously.
Sample Ratio Mismatch (SRM) Detection
Experiment configured for 50/50 split. After 1 week: Control: 502,000 users. Treatment: 498,000 users. Chi-squared test: Expected: 500,000 each. Observed: 502,000 / 498,000. chi2 = 16.0, p-value < 0.001 -> SIGNIFICANT MISMATCH Causes: bot traffic filtered differently, treatment causes more crashes, redirect-based experiment with slower redirect, hash function not uniform. Impact: SRM invalidates the experiment. Results cannot be trusted. Action: investigate root cause. Fix. Re-run experiment.
Network Effects: When User Independence Breaks
A/B tests assume: user A's behavior is independent of user B's assignment. This breaks with network effects (social features). Solutions: 1. Cluster randomization: assign entire friend clusters to same variant 2. Geo-based randomization: assign entire cities to variants 3. Time-based (switchback): alternate treatment by time period For social platforms: cluster or geo randomization is necessary. For independent features (checkout UI): standard user-level randomization is fine.
Feature Flags vs A/B Tests
Feature flag: binary (on/off). Used for gradual rollout, kill switches. No statistical analysis needed. Just monitoring. A/B test: controlled experiment with statistical rigor. Requires control group, sufficient sample size, statistical analysis. In practice: same infrastructure serves both. Feature flag = experiment with 1 variant + 100% traffic + no metrics. A/B test = experiment with 2+ variants + metrics + statistical analysis.
Staff interviews expect you to articulate how the system evolves under real growth — not jump straight to the final architecture.
Phase 1: MVP (0 to 100K users)
Monolith or minimal services proving core ab testing platform flows. Optimize for shipping speed and correctness over scale.
Key components: Single region · Primary DB + Redis cache · Synchronous core path · Basic monitoring
Move to next phase when: p99 latency exceeds SLO or DB CPU sustained above 70%
Phase 2: Growth (100K to 10M users)
Split read/write paths, introduce async processing for non-critical work, add caching layers and horizontal scaling.
Key components: Read replicas or CQRS · Message queue for async work · CDN / edge caching · Service-level SLOs
Move to next phase when: Hot keys, fan-out bottlenecks, or ops toil from manual scaling
Phase 3: Scale (10M+ users)
Shard data plane, multi-region active-active or active-passive, formal DR runbooks, cost optimization.
Key components: Database sharding / partitioning · Multi-region replication · Auto-scaling + chaos testing · Dedicated platform/SRE ownership
Move to next phase when: Regional failure domain risk, compliance data residency, or linear cost growth unsustainable
SLOs & Error Budgets
| Metric | Target | Rationale |
|---|---|---|
| Core user-facing availability | 99.95% | Budget for planned maintenance + unplanned failures without user-visible outage. |
| p99 latency (critical path) | Problem-specific — state target early and tie to capacity math | Interview credibility comes from connecting SLO to architecture choices. |
| Error rate (5xx) | < 0.1% | Distinguishes transient blips from systemic failure requiring rollback. |
| Data durability | 99.999999999% (11 nines) for committed writes | Define which operations require fsync/quorum vs async replication. |
Incident Scenarios (2am reality)
| Scenario | How you detect | Mitigation |
|---|---|---|
| Primary database unavailable | Health check failures, connection pool exhaustion alerts, elevated 5xx | Failover to replica / promote standby; enable read-only degraded mode if writes impossible; queue writes if async path exists |
| Traffic spike (10× normal) | RPS anomaly alert, autoscaling lag, latency SLO burn rate | Rate limit non-critical endpoints; scale read path horizontally; pre-warm caches; shed load on expensive operations |
| Bad deploy causing elevated errors | Canary metric regression, error budget burn, deployment correlation | Automated rollback within 5 minutes; feature flag kill switch; maintain N-1 compatibility |
Cost Drivers (Staff lens)
- Egress bandwidth and CDN (often dominates media/data-heavy systems)
- Database storage + IOPS at scale (plan compaction, TTL, tiering)
- Compute for async pipelines (right-size workers, spot instances for batch)
- Managed service premiums vs operational headcount trade-off
Multi-Region & DR
Start single-region with cross-AZ redundancy. Add read replicas in secondary region for DR. Move to active-active only when latency SLO or data residency requires it — accept conflict resolution complexity explicitly.