Design a Distributed Metrics Aggregation System

Interview Prompt

Design Distributed Metrics Aggregation System.

Clarifying Questions (ask before designing)

Question	Why it matters
Which of these is highest priority: StatsD-style push model, Pre-aggregation at agents, Rollup storage?	Forces scope negotiation — senior candidates trim before drawing boxes.
What scale should we design for — DAU, QPS, data volume?	Drives every capacity decision; shows structured thinking.
What are the read vs write patterns on the critical path?	Determines caching, DB choice, and replication topology.
What consistency and durability guarantees are required?	Separates strong-consistency paths from eventual ones — a senior differentiator.

Scope

In scope

StatsD-style push model
Pre-aggregation at agents
Rollup storage
Percentile computation
Capacity estimation with shown math

Out of scope (state explicitly)

Application instrumentation SDK design
Full distributed tracing system (#33)
On-call paging and escalation policy (#37)

Assumptions

Clarify scale (DAU, QPS, data volume) for distributed metrics aggregation in the first 5 minutes
Standard reliability target 99.9%–99.99% unless problem implies higher (payments, booking)
Managed cloud services (RDS, S3, Kafka, Redis) are acceptable building blocks

Ingest metrics: Receive time-series metrics from thousands of hosts
Aggregate: Sum, avg, p50/p95/p99, min, max across dimensions
Downsample: Auto-downsample old data (1s → 1m → 1h → 1d)
Query: "Average CPU across region=us-east for last 6 hours at 1-minute granularity"
Alerting integration: Feed metrics to alerting rules
Dashboarding: Low-latency queries for Grafana-style dashboards

Metric	Calculation	Value
Hosts reporting	Given (assumption documented in value)	500K
Metrics per host	Given (assumption documented in value)	200
Total unique time series	500K hosts × 200 metrics/host	100M
Data points / sec	Given (10M points/s)	10M
Data point size	Given (assumption documented in value)	16 bytes (timestamp + value)
Ingestion throughput	10M × 16 bytes = 160 MB/s	160 MB/s
Raw storage / day	160 MB/s × 86400 ≈ 13.8 TB	13.8 TB
With downsampling	Given	~500 TB

Loading...

Push vs Pull Ingestion

Model	How	Pros	Cons
Pull (Prometheus)	Central server scrapes targets every 15s	Server controls pace; discovers targets via service discovery	Doesn't scale to millions of targets; need federation
Push (StatsD/OTEL) ⭐	Agents push metrics to gateway	Scales to millions of agents; works for ephemeral workloads	Must handle ingestion spikes; agents need gateway address

Time-Series Database: Why Specialized?

Regular DB (PostgreSQL):
  At 10M inserts/sec → PostgreSQL dies
  Query "avg cpu for last 6 hours across 1000 hosts" → full table scan → minutes

Time-Series DB (VictoriaMetrics / InfluxDB / Mimir):
  1. High write throughput (append-only, LSM-tree / columnar)
  2. Time-range queries (data organized by time)
  3. Compression (delta-of-delta + XOR → 10× compression)
  4. Downsampling (automatic resolution reduction for old data)

Downsampling Strategy

Resolution tiers:
  Raw (10s):   retained for 30 days
  1-min avg:   retained for 6 months  → 6× reduction
  1-hour avg:  retained for 2 years   → 360× reduction
  1-day avg:   retained for 5 years   → 8,640× reduction

Total: ~485 TB (vs 10 PB without downsampling)

Cardinality Explosion: The #1 Operational Problem

Each unique request_id creates a NEW time series.
10K requests/sec × 86400 sec/day = 864M unique time series per day

Impact: Ingester OOM, index grows unbounded, query timeout

Solutions:
  1. Label cardinality limits: Reject if > 10K unique values per label
  2. Relabeling at ingestion: Drop or hash high-cardinality labels
  3. Active series limit: Max 1M active series per tenant
  4. Monitoring: Dashboard showing top 10 metrics by cardinality

Event Bus Design (Kafka)

Topic: distributed_metrics_aggregation-events
  Partitions: 64 (scale consumers horizontally)
  Partition key: entity_id (user_id / order_id — preserves per-entity ordering)
  Retention: 7 days (compliance) or 24h (high-volume telemetry)
  Replication factor: 3, min.insync.replicas: 2

Producer: idempotent producer enabled (enable.idempotence=true)
Consumer: consumer group "distributed_metrics_aggregation-processors"
  - At-least-once delivery + idempotent handlers (dedup by event_id)
  - DLQ topic: distributed_metrics_aggregation-events-dlq (poison messages after 3 retries)
  - Lag alert: consumer lag > 60s → scale workers

Design a Distributed Metrics Aggregation System: async side effects MUST NOT block the synchronous API response.
  Sync path: validate → persist source of truth → publish event → return 201
  Async path: consumers update caches, indexes, notifications, aggregates

Write Metrics (Prometheus Remote Write)

HTTP

POST /api/v1/write
Content-Type: application/x-protobuf
Body: TimeSeries { labels: [{name:"__name__", value:"cpu_usage"}, {name:"host", value:"h1"}], samples: [{timestamp: 1710320000, value: 85.2}] }

Query (PromQL)

HTTP

GET /api/v1/query?query=avg(cpu_usage{region="us-east"})&time=1710320000
GET /api/v1/query_range?query=rate(http_requests_total[5m])&start=1710316400&end=1710320000&step=60

Common Error Responses

400 Bad Request: invalid input, missing fields, or malformed JSON
401 Unauthorized: missing or invalid auth token or API key
403 Forbidden: authenticated but insufficient permissions
404 Not Found: resource ID does not exist
409 Conflict: duplicate write or version conflict; retry with idempotency key
422 Unprocessable Entity: valid syntax but invalid business logic
429 Too Many Requests: rate limit exceeded; honor Retry-After header
500 Internal Error: unexpected server fault; retry with idempotency key
503 Service Unavailable: dependency down or overloaded; use exponential backoff
504 Gateway Timeout: index shard slow; narrow query or retry

Concern	Solution
Ingestion spike	Kafka buffers; auto-scale ingestion workers
TSDB node failure	Replicated storage (Mimir uses object store + replicated ingesters)
Data loss	Write-ahead log on ingesters; replay on restart
High cardinality	Limit labels (reject metrics with >100K unique series per metric name)
Query overload	Query concurrency limits; query timeout (30s max); caching layer
Clock skew	Accept data within ±5 minute window; reject future timestamps

VictoriaMetrics vs Prometheus vs InfluxDB vs Mimir

Feature	Prometheus	VictoriaMetrics ⭐	InfluxDB	Mimir
Scalability	Single node	Clustered	Clustered (paid)	Horizontally scalable
Ingestion	Pull only	Push + Pull	Push	Push (remote write)
Storage	Local disk	Local + S3	Local	Object store (S3)
Query language	PromQL	MetricsQL	InfluxQL / Flux	PromQL

SLOs & Error Budgets

Metric	Target	Rationale
Core user-facing availability	99.95%	Budget for planned maintenance + unplanned failures without user-visible outage.
p99 latency (critical path)	Problem-specific — state target early and tie to capacity math	Interview credibility comes from connecting SLO to architecture choices.
Error rate (5xx)	< 0.1%	Distinguishes transient blips from systemic failure requiring rollback.
Data durability	99.999999999% (11 nines) for committed writes	Define which operations require fsync/quorum vs async replication.

Incident Scenarios (2am reality)

Scenario	How you detect	Mitigation
Primary database unavailable	Health check failures, connection pool exhaustion alerts, elevated 5xx	Failover to replica / promote standby; enable read-only degraded mode if writes impossible; queue writes if async path exists
Traffic spike (10× normal)	RPS anomaly alert, autoscaling lag, latency SLO burn rate	Rate limit non-critical endpoints; scale read path horizontally; pre-warm caches; shed load on expensive operations
Bad deploy causing elevated errors	Canary metric regression, error budget burn, deployment correlation	Automated rollback within 5 minutes; feature flag kill switch; maintain N-1 compatibility

Cost Drivers (Staff lens)

Egress bandwidth and CDN (often dominates media/data-heavy systems)
Database storage + IOPS at scale (plan compaction, TTL, tiering)
Compute for async pipelines (right-size workers, spot instances for batch)
Managed service premiums vs operational headcount trade-off

Multi-Region & DR

Start single-region with cross-AZ redundancy. Add read replicas in secondary region for DR. Move to active-active only when latency SLO or data residency requires it — accept conflict resolution complexity explicitly.

Interview Prompt

Clarifying Questions (ask before designing)

Scope

In scope

Out of scope (state explicitly)

Assumptions

Push vs Pull Ingestion

Time-Series Database: Why Specialized?

Downsampling Strategy

Cardinality Explosion: The #1 Operational Problem

Event Bus Design (Kafka)

Write Metrics (Prometheus Remote Write)

Query (PromQL)

Common Error Responses

Time-Series Storage Format

Label Index (Inverted Index)

Percentile Calculation at Scale

Write-Ahead Log (WAL): Ingester Crash Recovery

Multi-Tenancy

Interview Walkthrough

VictoriaMetrics vs Prometheus vs InfluxDB vs Mimir

Phase 1: MVP (0 to 100K users)

Phase 2: Growth (100K to 10M users)

Phase 3: Scale (10M+ users)

SLOs & Error Budgets

Incident Scenarios (2am reality)

Cost Drivers (Staff lens)

Multi-Region & DR