Design a User Analytics Pipeline (like Google Analytics)

Interview Prompt

Design User Analytics Pipeline (like Google Analytics).

Clarifying Questions (ask before designing)

Question	Why it matters
Which of these is highest priority: Lambda vs Kappa architecture, Event schema evolution, Data warehouse star schema?	Forces scope negotiation — senior candidates trim before drawing boxes.
What scale should we design for — DAU, QPS, data volume?	Drives every capacity decision; shows structured thinking.
What are the read vs write patterns on the critical path?	Determines caching, DB choice, and replication topology.
What consistency and durability guarantees are required?	Separates strong-consistency paths from eventual ones — a senior differentiator.

Scope

In scope

Lambda vs Kappa architecture
Event schema evolution
Data warehouse star schema
ETL vs ELT
Data lake partitioning
Capacity estimation with shown math

Out of scope (state explicitly)

Detailed frontend/UI pixel implementation
Org structure, staffing, and hiring plan

Assumptions

Clarify scale (DAU, QPS, data volume) for user analytics pipeline in the first 5 minutes
Standard reliability target 99.9%–99.99% unless problem implies higher (payments, booking)
Managed cloud services (RDS, S3, Kafka, Redis) are acceptable building blocks

Event tracking: Capture client-side page views, link clicks, scroll events, and custom events across web and mobile.
Real-time dashboard: Show active current users, global events/sec, and top trending pages instantaneously.
Historical reports: Run aggregations (e.g. pageviews over 30 days grouped by country or device category).
Funnel conversion: Trace user conversion journeys (e.g. Landing Page → Registration → Check Out).
Cohort retention: Group users by registration cohorts to measure long-term user retention.
Custom dimensions: Allow developers to attach arbitrary key-value metadata to events.
Session management: Stitch independent event streams into discrete user sessions with a 30-minute inactivity threshold.

Metric	Calculation	Value
Events / day	Given (typical workload assumption)	100 Billion
Events / sec	From Events / day ÷ 86400 (+ peak factor in value)	~1.2 Million
Avg event size	Given (typical workload assumption)	500 bytes
Ingestion throughput	Derived	600 MB/s
Storage / day (raw)	Derived from upstream throughput × size	50 TB
Storage / 90 days	Derived	4.5 PB
Unique users tracked	Derived	1 Billion

I/O and Bandwidth Derivations:
- Global Ingestion rate: 100,000,000,000 events / 86,400 sec = ~1,157,400 events/sec average.
- Network Bandwidth: 1.2M events/sec × 500 bytes = 600 MB/s network ingestion rate.
- Storage Calculations (Raw Events):
  - 600 MB/s × 86,400 sec = 51.84 TB raw logs per day.
  - 90-Day Raw Storage Footprint: 51.84 TB × 90 = ~4.66 Petabytes.

Web and Mobile SDKs batch and forward events to a geo-distributed Edge Collector Cluster (which executes validation, GeoIP enrichment, and bot filtering). Enriched events pass through Kafka queues. Apache Flink aggregates real-time metrics (writing to Redis), while Spark jobs build session models and load columnar ClickHouse databases for historical OLAP queries.

Loading...

1. Edge Collector & Enrichment Architecture

The client SDK batches events locally before forwarding them to reduce network overhead. The stateless collectors enrich the incoming payload:

JSON

POST /collect
{
  "site_id": "site-123",
  "client_id": "anon-uuid",      // browser fingerprint / cookie
  "session_id": "sess-uuid",
  "events": [
    {"type": "pageview", "url": "/pricing", "ts": 1710320000, "referrer": "google.com"},
    {"type": "click", "element": "#signup-btn", "ts": 1710320005},
    {"type": "custom", "name": "video_play", "props": {"video_id": "v123"}, "ts": 1710320010}
  ]
}

GeoIP Lookup: Resolve client IP to country and city utilizing an in-memory MaxMind database.
User-Agent Resolution: Parse client browser, OS version, and device type.
Bot Filtering: Cross-reference signatures against known bot lists to filter noise.

2. Real-Time Processing (Apache Flink)

Flink consumes the enriched event stream from Kafka, executing windowed aggregations:

1. Active Users (1-minute tumbling window):
   Count distinct client_ids per site_id per minute
   Write to Redis: SETEX active_users:{site_id} 120 {count}

2. Events per second (10-second sliding window):
   Count events per site_id → write to Redis time-series database.

3. Top Pages (1-minute window):
   Count pageviews per URL per site_id → update Redis sorted set
   ZADD top_pages:{site_id} {count} {url}

3. Batch Processing & Session Stitching (Apache Spark)

Spark pulls raw records from S3 partitions, running intensive hourly processing jobs:

1. Read raw events from Kafka → S3 (stored as Parquet, partitioned by date + site_id)
2. Session reconstruction: group events by session_id, calculate total session duration and page depth.
3. Funnel analysis aggregation: compute retention steps across funnel definition matrices.
4. Materialized pre-aggregations: generate daily pageview rollups grouped by country × device type.
5. Archival compaction: compact small files into large Parquet blocks in cold S3 storage.

4. ClickHouse OLAP Columnar Store

Row-oriented relational databases (like PostgreSQL) fail when executing analytical aggregates across billions of rows. We leverage ClickHouse's columnar MergeTree engine:

SQL

CREATE TABLE events (
    event_date    Date,
    site_id       String,
    session_id    String,
    client_id     String,
    event_type    LowCardinality(String),
    url           String,
    referrer      String,
    country       LowCardinality(String),
    device_type   LowCardinality(String),
    browser       LowCardinality(String),
    custom_props  Map(String, String),
    timestamp     DateTime64(3)
) ENGINE = MergeTree()
PARTITION BY (event_date, site_id)
ORDER BY (site_id, event_date, session_id, timestamp);

Event Bus Design (Kafka)

Topic: user_analytics_pipeline-events
  Partitions: 64 (scale consumers horizontally)
  Partition key: entity_id (user_id / order_id — preserves per-entity ordering)
  Retention: 7 days (compliance) or 24h (high-volume telemetry)
  Replication factor: 3, min.insync.replicas: 2

Producer: idempotent producer enabled (enable.idempotence=true)
Consumer: consumer group "user_analytics_pipeline-processors"
  - At-least-once delivery + idempotent handlers (dedup by event_id)
  - DLQ topic: user_analytics_pipeline-events-dlq (poison messages after 3 retries)
  - Lag alert: consumer lag > 60s → scale workers

Design a User Analytics Pipeline (like Google Analytics): async side effects MUST NOT block the synchronous API response.
  Sync path: validate → persist source of truth → publish event → return 201
  Async path: consumers update caches, indexes, notifications, aggregates

1. Collect Events (Ingestion API)

HTTP

POST /collect
Content-Type: application/json

{
  "site_id": "site-882",
  "client_id": "anon-uuid-2918",
  "session_id": "sess-uuid-9921",
  "events": [...]
}

Response: 204 No Content (Compatible with JS navigator.sendBeacon API)

2. Query Real-Time & Historical Reports

HTTP

GET /api/v1/analytics/site-882/realtime
Response: 200 OK
{
  "active_users": 1523,
  "events_per_sec": 342,
  "top_pages": [
    { "url": "/pricing", "views": 842 },
    { "url": "/docs", "views": 321 }
  ]
}

GET /api/v1/analytics/site-882/report?metric=pageviews&from=2026-03-01&to=2026-03-14&group_by=country
Response: 200 OK
{
  "data": [
    { "country": "US", "pageviews": 1523000 },
    { "country": "DE", "pageviews": 381000 }
  ]
}

Common Error Responses

400 Bad Request: invalid input, missing fields, or malformed JSON
401 Unauthorized: missing or invalid auth token or API key
403 Forbidden: authenticated but insufficient permissions
404 Not Found: resource ID does not exist
409 Conflict: duplicate write or version conflict; retry with idempotency key
422 Unprocessable Entity: valid syntax but invalid business logic
429 Too Many Requests: rate limit exceeded; honor Retry-After header
500 Internal Error: unexpected server fault; retry with idempotency key
503 Service Unavailable: dependency down or overloaded; use exponential backoff
202 Accepted: job queued; poll GET /jobs/{id} for status
408 Request Timeout: job still processing; continue polling

Raw Enriched Event Schema (Kafka / S3 Parquet)

JSON

{
  "event_id": "event-192a-3312",
  "site_id": "site-123",
  "session_id": "sess-abc",
  "client_id": "anon-xyz",
  "event_type": "pageview",
  "url": "/pricing",
  "referrer": "google.com",
  "country": "US",
  "device": "mobile",
  "browser": "Chrome",
  "timestamp": 1710320000000,
  "custom": {
    "plan": "pro",
    "source": "ad_campaign_123"
  }
}

Redis Real-Time Cache Topology

# Active users key (string counter)
active_users:{site_id}        → integer (TTL 120s)

# Events per second
events_per_sec:{site_id}      → time-series (last 60 data points)

# Top pages
top_pages:{site_id}           → sorted set (top 100 URLs)

Ingestion Component	System Failure Solution
Event loss	Kafka RF=3; collector returns 204 only after Kafka ACK
Flink failure	Flink checkpointing to S3 every 60s; resume from checkpoint
ClickHouse failure	Replicated tables (ReplicatedMergeTree); multi-AZ
Client offline	SDK buffers events in localStorage; sends on reconnect
Bot traffic	Filter known bots; rate limit per IP; CAPTCHA for suspicious patterns
Data skew	Partition Kafka by site_id; ClickHouse partitioned by date + site

1. Real-Time Session Stitching & Watermarks ⭐

Out-of-order events from unstable networks make it difficult to determine when a user's session ends. We solve this with Flink session windows:

Problem: Reassembling disjointed out-of-order events from unstable networks into continuous user sessions.
Session Definition: A sequence of user interactions containing no gap > 30 minutes.

Implementation (Flink Session Windows) ⭐:
- Group events by client_id using Flink session windowing with a 30-minute inactivity gap.
- Maintain a Watermark allowing up to 5 minutes of late-arriving mobile events before sealing a window.

Edge Case: Events arriving 6 minutes late (beyond watermark).
- Consequence: Triggers a new, split session (results in ~1% session count inflation).
- Mitigation: Spark hourly cron jobs re-stitch and correct sessions offline using absolute timestamps.

2. Browser Exit Loss & navigator.sendBeacon()

If a user closes a browser tab, pending click logs can be lost. We mitigate exit loss using background delivery methods:

Client browsers frequently close before XHR requests complete, dropping exit logs.

Implementation & Fallbacks:
- navigator.sendBeacon(): Uses browser background streams to guarantee POST delivery on close.
- Fallback 1 (Pixel Tracking): Get requests firing /pixel.gif?events=... bypass strict ad blocker policies.
- Fallback 2 (Local Storage Buffer): SDK saves logs locally, flushing them on the next page initialization.
- Threshold: Accept an industry-standard 2-5% analytics event loss to maintain low client overhead.

3. High-Scale Unique Counting (HyperLogLog) ⭐

Counting millions of unique active users in memory is extremely expensive. We implement HyperLogLog:

Counting distinct active users across 50,000 corporate websites in real-time creates massive memory bottlenecks.

Solution: Redis HyperLogLog (HLL) ⭐:
- PFADD active_users:{site_id}:{minute} {client_id}
- PFCOUNT active_users:{site_id}:{minute} → Returns unique counts with a tiny ~0.81% error rate.
- Memory: Each HLL consumes a maximum of 12 KB, regardless of cardinality.
  - 50,000 active websites × 12 KB = ~600 MB memory footprint globally (trivial for Redis).

4. Late-Arriving Event Windowing Accuracy ⭐

Handling events that arrive hours late without distorting real-time report lines requires a clear timing strategy:

Dashboard Query: SELECT COUNT(*) FROM events WHERE timestamp > NOW() - 1h

Case 1 (Processing Time): Count events as they land on the server.
- ✓ Highly stable (counts never shift retroactively).
- ✗ Inaccurate (delayed mobile events are counted in the wrong hour window).

Case 2 (Event Time) ⭐: Count by the timestamp recorded in the client event.
- ✓ Perfect historical accuracy.
- ✗ Dynamic shifting (counts for the last hour shift dynamically as delayed events roll in).
- Standard: Display a 5-minute "settling window" warning: "Numbers for the current hour are rolling. Final counts lock after 5 minutes."

1. GDPR Right to Erasure & De-identification

Under privacy legislation, users can request that all their analytics events be erased. This presents a challenge when dealing with immutable files in S3:

GDPR, CCPA, and Erasure Requirements:
1. Consent & Opt-Out: Suppress SDK logging for EU users prior to cookie banner consent.
2. Right to Erasure: Publish deletions to a "deletion-requests" Kafka stream.
3. Erasure Challenge in Immutable S3 Parquet:
   - Modifying a single client record in Parquet requires rewriting the entire daily partition.
   - Solution: Apply soft-deletes in ClickHouse in real-time, executing background weekly compaction tasks to clean cold S3 storage blocks.

Cookie-Less Tracking Strategies:
- First-Party Cookies: Set tracking IDs exclusively on the host domain (resilient, but deleted on browser swap).
- Server-Side Collection: Forward analytics through an active server backend rather than direct browser endpoints (bypasses ad blockers).
- Plausible/Fathom Aggregate Analytics: Track raw pageview volumes without collecting any unique client fingerprint identifiers (no consent required).

Interview Walkthrough

Separate the ingest path (client SDK → Kafka) from the query path (ClickHouse OLAP) — they have opposite optimization needs.
Use navigator.sendBeacon() and local-storage buffering so page-unload events are not lost when the browser closes.
Apply Lambda architecture: Redis speed layer for real-time counters, ClickHouse batch layer for historical funnel analysis.
Stitch sessions with Flink session windows (30-minute inactivity gap) and watermarks for late-arriving mobile events.
Choose event time over processing time for accuracy — display a 5-minute settling window so dashboard numbers don't shift retroactively.
Use HyperLogLog for unique visitor counts: 12 KB per site regardless of cardinality, with ~0.81% error.
Plan GDPR erasure with pseudonymization at ingest — deleting from immutable S3 parquet files requires tombstone markers, not row deletes.
Common pitfall: counting raw pageview events as unique users without deduplication — DAU inflates by 2–5× from repeat visits.

1. Columnar OLAP Database Analysis

Feature	ClickHouse ⭐	Druid	BigQuery	TimescaleDB
Query Speed	Sub-second on billions of rows	Sub-second (pre-aggregated)	Seconds (Serverless query scan)	Seconds on millions
Ingestion Rate	1M+ rows/sec	1M+ rows/sec	Streaming insert API	100K rows/sec
Storage Cost	Low (extreme compression)	Moderate	High (pay-per-query scan)	Low
Real-time Lag	Near real-time (sub-second)	Real-time	Minutes delay	Real-time
SQL Support	Full SQL dialect	Limited subset	Full Standard SQL	Full PostgreSQL compatibility
Operational Complexity	Moderate (clustered or single)	Very Complex	Zero Ops (Managed)	Easy (PostgreSQL)

2. Conversion Funnel Query Optimization ⭐

Calculating funnel conversion steps across billions of rows requires deep SQL optimizations. We leverage ClickHouse funnel functions:

SQL

Conversion Funnel Steps: Landing Page → Sign Up → Add Payment → First Purchase

ClickHouse windowFunnel() Optimized Analytics Engine ⭐:
SELECT 
  countIf(step >= 1) AS landing,
  countIf(step >= 2) AS signup,
  countIf(step >= 3) AS payment,
  countIf(step >= 4) AS purchase
FROM (
  SELECT client_id, 
    windowFunnel(86400)(timestamp, 
      url = '/landing',
      event = 'signup_complete',
      event = 'payment_added',
      event = 'purchase_complete'
    ) AS step
  FROM events
  WHERE timestamp > now() - INTERVAL 7 DAY
  GROUP BY client_id
);

- windowFunnel() operates natively on ClickHouse columnar arrays, computing conversion steps in sub-second timelines.

SLOs & Error Budgets

Metric	Target	Rationale
Core user-facing availability	99.95%	Budget for planned maintenance + unplanned failures without user-visible outage.
p99 latency (critical path)	Problem-specific — state target early and tie to capacity math	Interview credibility comes from connecting SLO to architecture choices.
Error rate (5xx)	< 0.1%	Distinguishes transient blips from systemic failure requiring rollback.
Data durability	99.999999999% (11 nines) for committed writes	Define which operations require fsync/quorum vs async replication.

Incident Scenarios (2am reality)

Scenario	How you detect	Mitigation
Primary database unavailable	Health check failures, connection pool exhaustion alerts, elevated 5xx	Failover to replica / promote standby; enable read-only degraded mode if writes impossible; queue writes if async path exists
Traffic spike (10× normal)	RPS anomaly alert, autoscaling lag, latency SLO burn rate	Rate limit non-critical endpoints; scale read path horizontally; pre-warm caches; shed load on expensive operations
Bad deploy causing elevated errors	Canary metric regression, error budget burn, deployment correlation	Automated rollback within 5 minutes; feature flag kill switch; maintain N-1 compatibility

Cost Drivers (Staff lens)

Egress bandwidth and CDN (often dominates media/data-heavy systems)
Database storage + IOPS at scale (plan compaction, TTL, tiering)
Compute for async pipelines (right-size workers, spot instances for batch)
Managed service premiums vs operational headcount trade-off

Multi-Region & DR

Start single-region with cross-AZ redundancy. Add read replicas in secondary region for DR. Move to active-active only when latency SLO or data residency requires it — accept conflict resolution complexity explicitly.