Time-Series & Metrics Storage – System Design Core Concept

What:

A storage and query engine optimized for time-ordered numeric samples — each point is a metric name, value, timestamp, and optional label set (tags).

Primary purpose:

Ingest high-volume measurements cheaply, aggregate them over time windows, and alert when trends cross thresholds — without treating metrics like relational rows.

Usually used for:

Infrastructure monitoring, application APM, IoT telemetry, business KPI dashboards, and capacity planning — not general user profile storage.

Think in three dimensions:

⏱ Time

Every sample is anchored to a timestamp; queries are range scans, not primary-key lookups.

🏷 Labels

Dimensions like service=api, region=us-east — cardinality of label combinations drives memory cost.

📈 Aggregation

Raw points are rarely queried forever; rollups (sum, avg, max) at 1m/5m/1h tiers shrink storage and query cost.

Metric Type	Behavior	Typical Query
Counter	Monotonically increases (requests_total)	rate() over window → requests per second
Gauge	Up or down (queue_depth, CPU %)	Instant value or avg over window
Histogram	Distribution of values (latency buckets)	p50/p99 from cumulative bucket counts

Append-only writes — samples are inserted, rarely updated in place.
Compression — delta encoding and Gorilla-style algorithms exploit sequential timestamps.
Retention policies — raw 7–15 days, aggregated months/years at coarser resolution.
Pull vs push — Prometheus pulls from /metrics endpoints; Influx/Telegraf often push via HTTP.

Benefit	Cost
Write-optimized ingestion — millions of samples/sec via append-only paths and batch compression	Poor fit for OLTP — no arbitrary UPDATE/DELETE; point lookups by primary key are awkward
Cheap long retention — rollups and columnar cold storage keep years of trends affordable	Query flexibility limits — ad-hoc joins across high-cardinality labels explode memory

Problem	Usage
Service monitoring dashboard	Prometheus scrape + Grafana; 15-day hot retention, remote write to long-term store
IoT sensor fleet	Per-device timestamped readings; downsample raw 1s → 1min aggregates after 48 hours
Ad impression analytics	Counter + label dimensions (campaign, region); cardinality caps on user_id labels
SLO error budget tracking	Histogram of request latency; burn-rate alerts on 5m and 1h windows

Rollup Tier Design

Production systems rarely keep raw 1-second resolution forever. A typical tier ladder:

Raw (1s resolution)     → retain 7 days   → SSD hot tier
Rollup 1 (1min avg/max) → retain 90 days  → SSD warm tier
Rollup 2 (1h avg/max)   → retain 2 years  → object storage / columnar
Rollup 3 (1d avg/max)   → retain 5+ years → cold archive (Parquet)

Background compaction jobs merge raw blocks into coarser buckets. Queries auto-select the finest tier that satisfies the requested time range — dashboards for "last 6 months" hit hourly rollups, not raw samples.

Prometheus Pull Model vs Push Gateway

Prometheus scrapes HTTP /metrics endpoints on a fixed interval (typically 15s). This gives uniform sampling and service discovery integration. Short-lived batch jobs cannot be scraped — they push to a Pushgateway, which Prometheus then scrapes. Interview pitfall: pushing every sample from every pod creates a single point of failure; pull is preferred for long-running services.

Histograms and Percentile Estimation

Storing every request latency as a raw sample is expensive at billions of QPS. Histograms bucket values (le=0.1, le=0.5, le=1.0 seconds) and increment counters per bucket. Percentiles are estimated from cumulative bucket counts — not exact, but stable under load. Choose bucket boundaries to match SLO thresholds (e.g., 100ms, 300ms, 1s for API latency).

TSDB vs Columnar Warehouse vs Search Index

TSDB (Prometheus, InfluxDB, M3): low-latency range queries on metrics; weak at ad-hoc SQL joins.
ClickHouse / BigQuery: excellent for analytics over event tables with time partition; higher query latency, cheaper at petabyte scale.
Elasticsearch: full-text log search with time filters; not a replacement for numeric metric rollups at Prometheus scale.

In interviews, metrics dashboards → TSDB; business analytics over click events → Kafka + columnar warehouse; grep-able logs → search index. Many production stacks use all three with clear boundaries.

Capacity Sketch

10k pods × 500 metrics × 1 sample/15s ≈ 333k samples/sec. At ~2 bytes compressed per sample (optimistic), that is ~600 MB/min raw — retention and cardinality multiply this quickly. Always estimate active series count (unique metric+label combinations), not just sample rate.