Design an A/B Testing and Experimentation Platform

This problem appears in multiple sheets. Depth expectations increase as you progress:

Track	What to demonstrate
Arch 50	Show domain depth beyond the baseline: async pipelines, consistency semantics, and operational SLOs.
Arch 75	Staff angles: partition behavior, cost drivers, and MVP → production evolution with clear triggers.

Interview Prompt

Design A/B Testing and Experimentation Platform.

Clarifying Questions (ask before designing)

Question	Why it matters
Which of these is highest priority: Experiment assignment (deterministic hashing), Metric pipeline, Statistical significance engine?	Forces scope negotiation — senior candidates trim before drawing boxes.
What scale should we design for — DAU, QPS, data volume?	Drives every capacity decision; shows structured thinking.
What are the read vs write patterns on the critical path?	Determines caching, DB choice, and replication topology.
What consistency and durability guarantees are required?	Separates strong-consistency paths from eventual ones — a senior differentiator.

Scope

In scope

Experiment assignment (deterministic hashing)
Metric pipeline
Statistical significance engine
Interaction effects
Guardrail metrics
Capacity estimation with shown math

Out of scope (state explicitly)

Detailed frontend/UI pixel implementation
Org structure, staffing, and hiring plan

Assumptions

Clarify scale (DAU, QPS, data volume) for ab testing platform in the first 5 minutes
Standard reliability target 99.9%–99.99% unless problem implies higher (payments, booking)
Managed cloud services (RDS, S3, Kafka, Redis) are acceptable building blocks

Create experiments: Define experiment with name, hypothesis, variants, and traffic allocation
User assignment: Deterministically assign users to experiment variants
Feature flags: Toggle features on/off; gradual rollout
Metric tracking: Track conversion rates, revenue, engagement metrics per variant
Statistical analysis: Compute p-value, confidence interval, statistical significance
Mutual exclusion: Prevent conflicting experiments from overlapping
Experiment lifecycle: Draft → Running → Paused → Completed → Archived
Guardrail metrics: Auto-stop experiment if key metrics degrade
Segmentation: Run experiments on specific user segments

Metric	Calculation	Value
Concurrent experiments	Given	1,000
Users	Given	500M
Variant assignment calls / sec	Derived from daily volume ÷ 86400 (+ peak factor)	500K
Experiment events / day	Given	50B
Event data storage / day	50B events × ~100 B	5 TB

Loading...

Deterministic User Assignment

PYTHON

User must ALWAYS see the same variant. No randomness per request.

Algorithm: hash-based assignment

  def get_variant(user_id, experiment_id, variants, traffic_percent):
      hash_input = f"{experiment_id}:{user_id}"
      # Gate: is this user in the experiment at all?
      if murmurhash3(hash_input) % 10000 >= traffic_percent * 100:
          return None
      
      # Independent hash for the variant split so it's uniform
      # across the in-experiment population (not just the gated range).
      variant_hash = murmurhash3(hash_input + ":variant") % 10000
      cumulative = 0
      for variant_name, weight in variants:
          cumulative += weight * 100
          if variant_hash < cumulative:
              return variant_name
      
      return variants[-1][0]

Properties:
  - Deterministic: same user + experiment -> same variant
  - Uniform: murmurhash3 is uniformly distributed
  - Independent: adding/removing experiments doesn't change other assignments
  - Fast: murmurhash3 is < 100 ns

Mutual Exclusion (Experiment Layers)

Problem: Experiment A tests checkout button color. Experiment B tests checkout page layout.
If same user is in both: which caused the conversion improvement?

Solution: Experiment Layers (Google's Overlapping Experiment Infrastructure)

  Layer 1 (UI experiments):
    Experiment A: button color (control: blue, treatment: green)
    Experiment B: header layout (control: v1, treatment: v2)
    Within a layer: user is in AT MOST one experiment
    
  Layer 2 (Backend experiments):
    Experiment C: recommendation algorithm (control: v1, treatment: v2)
    Independent of Layer 1

Implementation:
  Each experiment belongs to a layer.
  Assignment: hash(user_id + layer_id) % total_traffic
  Each experiment "owns" a non-overlapping range of hash values.

Statistical Analysis Engine

1. Conversion rate per variant
2. Relative lift = (treatment_rate - control_rate) / control_rate
3. Statistical significance (two-proportion z-test):
   p_pooled = total_conversions / total_impressions
   se = sqrt(p_pooled * (1-p_pooled) * (1/n_control + 1/n_treatment))
   z = (treatment_rate - control_rate) / se
   p_value = 2 * (1 - norm_cdf(abs(z)))
4. Confidence interval (95%): CI = diff ± 1.96 * se
5. Multiple comparison correction: Bonferroni or Benjamini-Hochberg
6. Guardrail metrics: auto-pause experiment if degraded > 2 std dev

Event Bus Design (Kafka)

Topic: ab_testing_platform-events
  Partitions: 64 (scale consumers horizontally)
  Partition key: entity_id (user_id / order_id — preserves per-entity ordering)
  Retention: 7 days (compliance) or 24h (high-volume telemetry)
  Replication factor: 3, min.insync.replicas: 2

Producer: idempotent producer enabled (enable.idempotence=true)
Consumer: consumer group "ab_testing_platform-processors"
  - At-least-once delivery + idempotent handlers (dedup by event_id)
  - DLQ topic: ab_testing_platform-events-dlq (poison messages after 3 retries)
  - Lag alert: consumer lag > 60s → scale workers

Design an A/B Testing and Experimentation Platform: async side effects MUST NOT block the synchronous API response.
  Sync path: validate → persist source of truth → publish event → return 201
  Async path: consumers update caches, indexes, notifications, aggregates

HTTP

POST /api/v1/experiments
{
  "name": "checkout_button_color",
  "hypothesis": "Green button increases conversion by 5%",
  "layer": "checkout_ui",
  "traffic_percent": 20,
  "variants": [
    { "name": "control", "weight": 50, "config": { "button_color": "blue" } },
    { "name": "treatment", "weight": 50, "config": { "button_color": "green" } }
  ],
  "primary_metric": "checkout_conversion_rate",
  "guardrail_metrics": ["crash_rate", "p95_latency"]
}

GET /api/v1/experiments/{id}/assignment?user_id=user-uuid
-> { "variant": "treatment", "config": { "button_color": "green" } }

POST /api/v1/experiments/{id}/events
{ "user_id": "user-uuid", "event_type": "conversion", "value": 1 }

GET /api/v1/experiments/{id}/results
-> {
  "status": "running", "days_running": 7,
  "variants": {
    "control": { "impressions": 50000, "conversions": 2500, "rate": 0.0500 },
    "treatment": { "impressions": 49800, "conversions": 2750, "rate": 0.0552 }
  },
  "lift": 0.104, "p_value": 0.0023, "significant": true,
  "confidence_interval": [0.032, 0.176],
  "recommended_action": "Ship treatment (statistically significant improvement)"
}

Common Error Responses

400 Bad Request: invalid input, missing fields, or malformed JSON
401 Unauthorized: missing or invalid auth token or API key
403 Forbidden: authenticated but insufficient permissions
404 Not Found: resource ID does not exist
409 Conflict: duplicate write or version conflict; retry with idempotency key
422 Unprocessable Entity: valid syntax but invalid business logic
429 Too Many Requests: rate limit exceeded; honor Retry-After header
500 Internal Error: unexpected server fault; retry with idempotency key
503 Service Unavailable: dependency down or overloaded; use exponential backoff

PostgreSQL: Experiment Configuration

SQL

CREATE TABLE experiments (
    experiment_id UUID PRIMARY KEY, name VARCHAR(100), hypothesis TEXT,
    layer VARCHAR(50), traffic_percent DECIMAL(5,2),
    primary_metric VARCHAR(100), guardrail_metrics JSONB,
    status ENUM('draft','running','paused','completed','archived'),
    started_at TIMESTAMPTZ, ended_at TIMESTAMPTZ,
    owner VARCHAR(100), created_at TIMESTAMPTZ DEFAULT NOW()
);

CREATE TABLE experiment_variants (
    variant_id UUID PRIMARY KEY, experiment_id UUID NOT NULL,
    name VARCHAR(50), weight INT, config JSONB
);

Redis: Fast Assignment

exp_config:{experiment_id}  -> JSON (experiment config)
TTL: 60 (refreshed from PostgreSQL)

active_experiments:{layer}  -> LIST of experiment_ids
TTL: 60

ClickHouse: Event Analytics

SQL

CREATE TABLE experiment_events (
    experiment_id UUID, variant String, user_id UUID,
    event_type String, value Float64,
    timestamp DateTime, date Date MATERIALIZED toDate(timestamp)
) ENGINE = MergeTree() PARTITION BY toYYYYMM(timestamp)
  ORDER BY (experiment_id, variant, timestamp);

Concern	Solution
Assignment service down	SDK caches last assignment locally; fall back to control variant
Event pipeline lag	ClickHouse backfill from Kafka replay; results delayed but not lost
Experiment degrades metrics	Guardrail auto-pause; manual kill switch
Hash collision causing uneven split	Chi-squared test on assignment counts; alert if > 2% deviation
Novelty effect	Run experiments for minimum 2 weeks; track metrics over time

SLOs & Error Budgets

Metric	Target	Rationale
Core user-facing availability	99.95%	Budget for planned maintenance + unplanned failures without user-visible outage.
p99 latency (critical path)	Problem-specific — state target early and tie to capacity math	Interview credibility comes from connecting SLO to architecture choices.
Error rate (5xx)	< 0.1%	Distinguishes transient blips from systemic failure requiring rollback.
Data durability	99.999999999% (11 nines) for committed writes	Define which operations require fsync/quorum vs async replication.

Incident Scenarios (2am reality)

Scenario	How you detect	Mitigation
Primary database unavailable	Health check failures, connection pool exhaustion alerts, elevated 5xx	Failover to replica / promote standby; enable read-only degraded mode if writes impossible; queue writes if async path exists
Traffic spike (10× normal)	RPS anomaly alert, autoscaling lag, latency SLO burn rate	Rate limit non-critical endpoints; scale read path horizontally; pre-warm caches; shed load on expensive operations
Bad deploy causing elevated errors	Canary metric regression, error budget burn, deployment correlation	Automated rollback within 5 minutes; feature flag kill switch; maintain N-1 compatibility

Cost Drivers (Staff lens)

Egress bandwidth and CDN (often dominates media/data-heavy systems)
Database storage + IOPS at scale (plan compaction, TTL, tiering)
Compute for async pipelines (right-size workers, spot instances for batch)
Managed service premiums vs operational headcount trade-off

Multi-Region & DR

Start single-region with cross-AZ redundancy. Add read replicas in secondary region for DR. Move to active-active only when latency SLO or data residency requires it — accept conflict resolution complexity explicitly.

Interview Prompt

Clarifying Questions (ask before designing)

Scope

In scope

Out of scope (state explicitly)

Assumptions

Deterministic User Assignment

Mutual Exclusion (Experiment Layers)

Statistical Analysis Engine

Event Bus Design (Kafka)

Common Error Responses

PostgreSQL: Experiment Configuration

Redis: Fast Assignment

ClickHouse: Event Analytics

Interview Walkthrough

Peeking Problem

Sample Ratio Mismatch (SRM) Detection

Network Effects: When User Independence Breaks

Feature Flags vs A/B Tests

Phase 1: MVP (0 to 100K users)

Phase 2: Growth (100K to 10M users)

Phase 3: Scale (10M+ users)

SLOs & Error Budgets

Incident Scenarios (2am reality)

Cost Drivers (Staff lens)

Multi-Region & DR