Interview Prompt
Design User Analytics Pipeline (like Google Analytics).
Clarifying Questions (ask before designing)
| Question | Why it matters |
|---|---|
| Which of these is highest priority: Lambda vs Kappa architecture, Event schema evolution, Data warehouse star schema? | Forces scope negotiation — senior candidates trim before drawing boxes. |
| What scale should we design for — DAU, QPS, data volume? | Drives every capacity decision; shows structured thinking. |
| What are the read vs write patterns on the critical path? | Determines caching, DB choice, and replication topology. |
| What consistency and durability guarantees are required? | Separates strong-consistency paths from eventual ones — a senior differentiator. |
Scope
In scope
- Lambda vs Kappa architecture
- Event schema evolution
- Data warehouse star schema
- ETL vs ELT
- Data lake partitioning
- Capacity estimation with shown math
Out of scope (state explicitly)
- Detailed frontend/UI pixel implementation
- Org structure, staffing, and hiring plan
Assumptions
- Clarify scale (DAU, QPS, data volume) for user analytics pipeline in the first 5 minutes
- Standard reliability target 99.9%–99.99% unless problem implies higher (payments, booking)
- Managed cloud services (RDS, S3, Kafka, Redis) are acceptable building blocks
These foundational concepts underpin the patterns used in this problem. Review them before deep-diving into component-level trade-offs.
- Event tracking: Capture client-side page views, link clicks, scroll events, and custom events across web and mobile.
- Real-time dashboard: Show active current users, global events/sec, and top trending pages instantaneously.
- Historical reports: Run aggregations (e.g. pageviews over 30 days grouped by country or device category).
- Funnel conversion: Trace user conversion journeys (e.g. Landing Page → Registration → Check Out).
- Cohort retention: Group users by registration cohorts to measure long-term user retention.
- Custom dimensions: Allow developers to attach arbitrary key-value metadata to events.
- Session management: Stitch independent event streams into discrete user sessions with a 30-minute inactivity threshold.
- High Scale Ingestion: Process 1M+ incoming events per second originating from millions of active client domains.
- Low Ingestion Latency: Enriched event data must reach historical analytics engines in < 5 seconds.
- Sub-Second Queries: Deliver high-performance dashboard aggregations in < 2 seconds.
- Petabyte Scale: Scale to process and index 100B+ events per day.
- Tiered Retention: Retain high-fidelity raw event data for 90 days and pre-aggregated rollups for 2 years.
- At-Least-Once Delivery: Enforce zero-event-loss ingestion pipelines utilizing replica confirmation loops.
| Metric | Calculation | Value |
|---|---|---|
| Events / day | Given (typical workload assumption) | 100 Billion |
| Events / sec | From Events / day ÷ 86400 (+ peak factor in value) | ~1.2 Million |
| Avg event size | Given (typical workload assumption) | 500 bytes |
| Ingestion throughput | Derived | 600 MB/s |
| Storage / day (raw) | Derived from upstream throughput × size | 50 TB |
| Storage / 90 days | Derived | 4.5 PB |
| Unique users tracked | Derived | 1 Billion |
I/O and Bandwidth Derivations: - Global Ingestion rate: 100,000,000,000 events / 86,400 sec = ~1,157,400 events/sec average. - Network Bandwidth: 1.2M events/sec × 500 bytes = 600 MB/s network ingestion rate. - Storage Calculations (Raw Events): - 600 MB/s × 86,400 sec = 51.84 TB raw logs per day. - 90-Day Raw Storage Footprint: 51.84 TB × 90 = ~4.66 Petabytes.
Web and Mobile SDKs batch and forward events to a geo-distributed Edge Collector Cluster (which executes validation, GeoIP enrichment, and bot filtering). Enriched events pass through Kafka queues. Apache Flink aggregates real-time metrics (writing to Redis), while Spark jobs build session models and load columnar ClickHouse databases for historical OLAP queries.
1. Edge Collector & Enrichment Architecture
The client SDK batches events locally before forwarding them to reduce network overhead. The stateless collectors enrich the incoming payload:
POST /collect
{
"site_id": "site-123",
"client_id": "anon-uuid", // browser fingerprint / cookie
"session_id": "sess-uuid",
"events": [
{"type": "pageview", "url": "/pricing", "ts": 1710320000, "referrer": "google.com"},
{"type": "click", "element": "#signup-btn", "ts": 1710320005},
{"type": "custom", "name": "video_play", "props": {"video_id": "v123"}, "ts": 1710320010}
]
}- GeoIP Lookup: Resolve client IP to country and city utilizing an in-memory MaxMind database.
- User-Agent Resolution: Parse client browser, OS version, and device type.
- Bot Filtering: Cross-reference signatures against known bot lists to filter noise.
2. Real-Time Processing (Apache Flink)
Flink consumes the enriched event stream from Kafka, executing windowed aggregations:
1. Active Users (1-minute tumbling window):
Count distinct client_ids per site_id per minute
Write to Redis: SETEX active_users:{site_id} 120 {count}
2. Events per second (10-second sliding window):
Count events per site_id → write to Redis time-series database.
3. Top Pages (1-minute window):
Count pageviews per URL per site_id → update Redis sorted set
ZADD top_pages:{site_id} {count} {url}3. Batch Processing & Session Stitching (Apache Spark)
Spark pulls raw records from S3 partitions, running intensive hourly processing jobs:
1. Read raw events from Kafka → S3 (stored as Parquet, partitioned by date + site_id) 2. Session reconstruction: group events by session_id, calculate total session duration and page depth. 3. Funnel analysis aggregation: compute retention steps across funnel definition matrices. 4. Materialized pre-aggregations: generate daily pageview rollups grouped by country × device type. 5. Archival compaction: compact small files into large Parquet blocks in cold S3 storage.
4. ClickHouse OLAP Columnar Store
Row-oriented relational databases (like PostgreSQL) fail when executing analytical aggregates across billions of rows. We leverage ClickHouse's columnar MergeTree engine:
CREATE TABLE events (
event_date Date,
site_id String,
session_id String,
client_id String,
event_type LowCardinality(String),
url String,
referrer String,
country LowCardinality(String),
device_type LowCardinality(String),
browser LowCardinality(String),
custom_props Map(String, String),
timestamp DateTime64(3)
) ENGINE = MergeTree()
PARTITION BY (event_date, site_id)
ORDER BY (site_id, event_date, session_id, timestamp);Event Bus Design (Kafka)
Topic: user_analytics_pipeline-events Partitions: 64 (scale consumers horizontally) Partition key: entity_id (user_id / order_id — preserves per-entity ordering) Retention: 7 days (compliance) or 24h (high-volume telemetry) Replication factor: 3, min.insync.replicas: 2 Producer: idempotent producer enabled (enable.idempotence=true) Consumer: consumer group "user_analytics_pipeline-processors" - At-least-once delivery + idempotent handlers (dedup by event_id) - DLQ topic: user_analytics_pipeline-events-dlq (poison messages after 3 retries) - Lag alert: consumer lag > 60s → scale workers Design a User Analytics Pipeline (like Google Analytics): async side effects MUST NOT block the synchronous API response. Sync path: validate → persist source of truth → publish event → return 201 Async path: consumers update caches, indexes, notifications, aggregates
1. Collect Events (Ingestion API)
POST /collect
Content-Type: application/json
{
"site_id": "site-882",
"client_id": "anon-uuid-2918",
"session_id": "sess-uuid-9921",
"events": [...]
}
Response: 204 No Content (Compatible with JS navigator.sendBeacon API)2. Query Real-Time & Historical Reports
GET /api/v1/analytics/site-882/realtime
Response: 200 OK
{
"active_users": 1523,
"events_per_sec": 342,
"top_pages": [
{ "url": "/pricing", "views": 842 },
{ "url": "/docs", "views": 321 }
]
}
GET /api/v1/analytics/site-882/report?metric=pageviews&from=2026-03-01&to=2026-03-14&group_by=country
Response: 200 OK
{
"data": [
{ "country": "US", "pageviews": 1523000 },
{ "country": "DE", "pageviews": 381000 }
]
}Common Error Responses
400 Bad Request: invalid input, missing fields, or malformed JSON
401 Unauthorized: missing or invalid auth token or API key
403 Forbidden: authenticated but insufficient permissions
404 Not Found: resource ID does not exist
409 Conflict: duplicate write or version conflict; retry with idempotency key
422 Unprocessable Entity: valid syntax but invalid business logic
429 Too Many Requests: rate limit exceeded; honor Retry-After header
500 Internal Error: unexpected server fault; retry with idempotency key
503 Service Unavailable: dependency down or overloaded; use exponential backoff
202 Accepted: job queued; poll GET /jobs/{id} for status
408 Request Timeout: job still processing; continue pollingRaw Enriched Event Schema (Kafka / S3 Parquet)
{
"event_id": "event-192a-3312",
"site_id": "site-123",
"session_id": "sess-abc",
"client_id": "anon-xyz",
"event_type": "pageview",
"url": "/pricing",
"referrer": "google.com",
"country": "US",
"device": "mobile",
"browser": "Chrome",
"timestamp": 1710320000000,
"custom": {
"plan": "pro",
"source": "ad_campaign_123"
}
}Redis Real-Time Cache Topology
# Active users key (string counter)
active_users:{site_id} → integer (TTL 120s)
# Events per second
events_per_sec:{site_id} → time-series (last 60 data points)
# Top pages
top_pages:{site_id} → sorted set (top 100 URLs)| Ingestion Component | System Failure Solution |
|---|---|
| Event loss | Kafka RF=3; collector returns 204 only after Kafka ACK |
| Flink failure | Flink checkpointing to S3 every 60s; resume from checkpoint |
| ClickHouse failure | Replicated tables (ReplicatedMergeTree); multi-AZ |
| Client offline | SDK buffers events in localStorage; sends on reconnect |
| Bot traffic | Filter known bots; rate limit per IP; CAPTCHA for suspicious patterns |
| Data skew | Partition Kafka by site_id; ClickHouse partitioned by date + site |
1. Real-Time Session Stitching & Watermarks ⭐
Out-of-order events from unstable networks make it difficult to determine when a user's session ends. We solve this with Flink session windows:
Problem: Reassembling disjointed out-of-order events from unstable networks into continuous user sessions. Session Definition: A sequence of user interactions containing no gap > 30 minutes. Implementation (Flink Session Windows) ⭐: - Group events by client_id using Flink session windowing with a 30-minute inactivity gap. - Maintain a Watermark allowing up to 5 minutes of late-arriving mobile events before sealing a window. Edge Case: Events arriving 6 minutes late (beyond watermark). - Consequence: Triggers a new, split session (results in ~1% session count inflation). - Mitigation: Spark hourly cron jobs re-stitch and correct sessions offline using absolute timestamps.
2. Browser Exit Loss & navigator.sendBeacon()
If a user closes a browser tab, pending click logs can be lost. We mitigate exit loss using background delivery methods:
Client browsers frequently close before XHR requests complete, dropping exit logs. Implementation & Fallbacks: - navigator.sendBeacon(): Uses browser background streams to guarantee POST delivery on close. - Fallback 1 (Pixel Tracking): Get requests firing /pixel.gif?events=... bypass strict ad blocker policies. - Fallback 2 (Local Storage Buffer): SDK saves logs locally, flushing them on the next page initialization. - Threshold: Accept an industry-standard 2-5% analytics event loss to maintain low client overhead.
3. High-Scale Unique Counting (HyperLogLog) ⭐
Counting millions of unique active users in memory is extremely expensive. We implement HyperLogLog:
Counting distinct active users across 50,000 corporate websites in real-time creates massive memory bottlenecks.
Solution: Redis HyperLogLog (HLL) ⭐:
- PFADD active_users:{site_id}:{minute} {client_id}
- PFCOUNT active_users:{site_id}:{minute} → Returns unique counts with a tiny ~0.81% error rate.
- Memory: Each HLL consumes a maximum of 12 KB, regardless of cardinality.
- 50,000 active websites × 12 KB = ~600 MB memory footprint globally (trivial for Redis).4. Late-Arriving Event Windowing Accuracy ⭐
Handling events that arrive hours late without distorting real-time report lines requires a clear timing strategy:
Dashboard Query: SELECT COUNT(*) FROM events WHERE timestamp > NOW() - 1h Case 1 (Processing Time): Count events as they land on the server. - ✓ Highly stable (counts never shift retroactively). - ✗ Inaccurate (delayed mobile events are counted in the wrong hour window). Case 2 (Event Time) ⭐: Count by the timestamp recorded in the client event. - ✓ Perfect historical accuracy. - ✗ Dynamic shifting (counts for the last hour shift dynamically as delayed events roll in). - Standard: Display a 5-minute "settling window" warning: "Numbers for the current hour are rolling. Final counts lock after 5 minutes."
1. GDPR Right to Erasure & De-identification
Under privacy legislation, users can request that all their analytics events be erased. This presents a challenge when dealing with immutable files in S3:
GDPR, CCPA, and Erasure Requirements: 1. Consent & Opt-Out: Suppress SDK logging for EU users prior to cookie banner consent. 2. Right to Erasure: Publish deletions to a "deletion-requests" Kafka stream. 3. Erasure Challenge in Immutable S3 Parquet: - Modifying a single client record in Parquet requires rewriting the entire daily partition. - Solution: Apply soft-deletes in ClickHouse in real-time, executing background weekly compaction tasks to clean cold S3 storage blocks. Cookie-Less Tracking Strategies: - First-Party Cookies: Set tracking IDs exclusively on the host domain (resilient, but deleted on browser swap). - Server-Side Collection: Forward analytics through an active server backend rather than direct browser endpoints (bypasses ad blockers). - Plausible/Fathom Aggregate Analytics: Track raw pageview volumes without collecting any unique client fingerprint identifiers (no consent required).
Interview Walkthrough
- Separate the ingest path (client SDK → Kafka) from the query path (ClickHouse OLAP) — they have opposite optimization needs.
- Use
navigator.sendBeacon()and local-storage buffering so page-unload events are not lost when the browser closes. - Apply Lambda architecture: Redis speed layer for real-time counters, ClickHouse batch layer for historical funnel analysis.
- Stitch sessions with Flink session windows (30-minute inactivity gap) and watermarks for late-arriving mobile events.
- Choose event time over processing time for accuracy — display a 5-minute settling window so dashboard numbers don't shift retroactively.
- Use HyperLogLog for unique visitor counts: 12 KB per site regardless of cardinality, with ~0.81% error.
- Plan GDPR erasure with pseudonymization at ingest — deleting from immutable S3 parquet files requires tombstone markers, not row deletes.
- Common pitfall: counting raw pageview events as unique users without deduplication — DAU inflates by 2–5× from repeat visits.
1. Columnar OLAP Database Analysis
| Feature | ClickHouse ⭐ | Druid | BigQuery | TimescaleDB |
|---|---|---|---|---|
| Query Speed | Sub-second on billions of rows | Sub-second (pre-aggregated) | Seconds (Serverless query scan) | Seconds on millions |
| Ingestion Rate | 1M+ rows/sec | 1M+ rows/sec | Streaming insert API | 100K rows/sec |
| Storage Cost | Low (extreme compression) | Moderate | High (pay-per-query scan) | Low |
| Real-time Lag | Near real-time (sub-second) | Real-time | Minutes delay | Real-time |
| SQL Support | Full SQL dialect | Limited subset | Full Standard SQL | Full PostgreSQL compatibility |
| Operational Complexity | Moderate (clustered or single) | Very Complex | Zero Ops (Managed) | Easy (PostgreSQL) |
2. Conversion Funnel Query Optimization ⭐
Calculating funnel conversion steps across billions of rows requires deep SQL optimizations. We leverage ClickHouse funnel functions:
Conversion Funnel Steps: Landing Page → Sign Up → Add Payment → First Purchase
ClickHouse windowFunnel() Optimized Analytics Engine ⭐:
SELECT
countIf(step >= 1) AS landing,
countIf(step >= 2) AS signup,
countIf(step >= 3) AS payment,
countIf(step >= 4) AS purchase
FROM (
SELECT client_id,
windowFunnel(86400)(timestamp,
url = '/landing',
event = 'signup_complete',
event = 'payment_added',
event = 'purchase_complete'
) AS step
FROM events
WHERE timestamp > now() - INTERVAL 7 DAY
GROUP BY client_id
);
- windowFunnel() operates natively on ClickHouse columnar arrays, computing conversion steps in sub-second timelines.Staff interviews expect you to articulate how the system evolves under real growth — not jump straight to the final architecture.
Phase 1: MVP (0 to 100K users)
Monolith or minimal services proving core user analytics pipeline flows. Optimize for shipping speed and correctness over scale.
Key components: Single region · Primary DB + Redis cache · Synchronous core path · Basic monitoring
Move to next phase when: p99 latency exceeds SLO or DB CPU sustained above 70%
Phase 2: Growth (100K to 10M users)
Split read/write paths, introduce async processing for non-critical work, add caching layers and horizontal scaling.
Key components: Read replicas or CQRS · Message queue for async work · CDN / edge caching · Service-level SLOs
Move to next phase when: Hot keys, fan-out bottlenecks, or ops toil from manual scaling
Phase 3: Scale (10M+ users)
Shard data plane, multi-region active-active or active-passive, formal DR runbooks, cost optimization.
Key components: Database sharding / partitioning · Multi-region replication · Auto-scaling + chaos testing · Dedicated platform/SRE ownership
Move to next phase when: Regional failure domain risk, compliance data residency, or linear cost growth unsustainable
SLOs & Error Budgets
| Metric | Target | Rationale |
|---|---|---|
| Core user-facing availability | 99.95% | Budget for planned maintenance + unplanned failures without user-visible outage. |
| p99 latency (critical path) | Problem-specific — state target early and tie to capacity math | Interview credibility comes from connecting SLO to architecture choices. |
| Error rate (5xx) | < 0.1% | Distinguishes transient blips from systemic failure requiring rollback. |
| Data durability | 99.999999999% (11 nines) for committed writes | Define which operations require fsync/quorum vs async replication. |
Incident Scenarios (2am reality)
| Scenario | How you detect | Mitigation |
|---|---|---|
| Primary database unavailable | Health check failures, connection pool exhaustion alerts, elevated 5xx | Failover to replica / promote standby; enable read-only degraded mode if writes impossible; queue writes if async path exists |
| Traffic spike (10× normal) | RPS anomaly alert, autoscaling lag, latency SLO burn rate | Rate limit non-critical endpoints; scale read path horizontally; pre-warm caches; shed load on expensive operations |
| Bad deploy causing elevated errors | Canary metric regression, error budget burn, deployment correlation | Automated rollback within 5 minutes; feature flag kill switch; maintain N-1 compatibility |
Cost Drivers (Staff lens)
- Egress bandwidth and CDN (often dominates media/data-heavy systems)
- Database storage + IOPS at scale (plan compaction, TTL, tiering)
- Compute for async pipelines (right-size workers, spot instances for batch)
- Managed service premiums vs operational headcount trade-off
Multi-Region & DR
Start single-region with cross-AZ redundancy. Add read replicas in secondary region for DR. Move to active-active only when latency SLO or data residency requires it — accept conflict resolution complexity explicitly.