What:
A storage and query engine optimized for time-ordered numeric samples — each point is a metric name, value, timestamp, and optional label set (tags).
Primary purpose:
Ingest high-volume measurements cheaply, aggregate them over time windows, and alert when trends cross thresholds — without treating metrics like relational rows.
Usually used for:
Infrastructure monitoring, application APM, IoT telemetry, business KPI dashboards, and capacity planning — not general user profile storage.
Think in three dimensions:
⏱ Time
Every sample is anchored to a timestamp; queries are range scans, not primary-key lookups.
🏷 Labels
Dimensions like service=api, region=us-east — cardinality of label combinations drives memory cost.
📈 Aggregation
Raw points are rarely queried forever; rollups (sum, avg, max) at 1m/5m/1h tiers shrink storage and query cost.
Needed When:
Problems mention dashboards, SLIs/SLOs, metrics pipelines, IoT sensors, or "billions of events per day" with time-range queries.
Avoids:
Storing every CPU sample in PostgreSQL rows — INSERT storms and index bloat kill the database.
Optimizes For:
Write throughput and time-range aggregation — not ACID transactions or complex JOINs.
Agents or services emit samples → hot TSDB serves real-time queries → background jobs downsample into warm/cold tiers:
| Metric Type | Behavior | Typical Query |
|---|---|---|
| Counter | Monotonically increases (requests_total) | rate() over window → requests per second |
| Gauge | Up or down (queue_depth, CPU %) | Instant value or avg over window |
| Histogram | Distribution of values (latency buckets) | p50/p99 from cumulative bucket counts |
- Append-only writes — samples are inserted, rarely updated in place.
- Compression — delta encoding and Gorilla-style algorithms exploit sequential timestamps.
- Retention policies — raw 7–15 days, aggregated months/years at coarser resolution.
- Pull vs push — Prometheus pulls from /metrics endpoints; Influx/Telegraf often push via HTTP.
| Benefit | Cost |
|---|---|
| Write-optimized ingestion — millions of samples/sec via append-only paths and batch compression | Poor fit for OLTP — no arbitrary UPDATE/DELETE; point lookups by primary key are awkward |
| Cheap long retention — rollups and columnar cold storage keep years of trends affordable | Query flexibility limits — ad-hoc joins across high-cardinality labels explode memory |
Problem: A developer adds user_id as a metric label on HTTP requests. One million users → one million active time series; TSDB memory exhausts.
Mitigation: Cap label dimensions in code review; use logs or traces for high-cardinality IDs; aggregate before export; drop or hash labels at collection agent.
Problem: Dashboard queries 90 days of 1-second CPU samples per pod — billions of points scanned per refresh.
Mitigation: Pre-aggregate to 1m/5m rollups; query hot tier for recent detail, warm tier for historical trends; cache dashboard query results.
Problem: IoT devices with wrong clocks write samples into the future or distant past, corrupting rollups.
Mitigation: Reject samples outside acceptable window; use ingestion-time bucketing; reconcile with NTP on edge gateways.
| Problem | Usage |
|---|---|
| Service monitoring dashboard | Prometheus scrape + Grafana; 15-day hot retention, remote write to long-term store |
| IoT sensor fleet | Per-device timestamped readings; downsample raw 1s → 1min aggregates after 48 hours |
| Ad impression analytics | Counter + label dimensions (campaign, region); cardinality caps on user_id labels |
| SLO error budget tracking | Histogram of request latency; burn-rate alerts on 5m and 1h windows |
- Queries always include a time range (last 1h, last 30d).
- Data is immutable measurements, not mutable entity state.
- Aggregations (sum, rate, percentile) matter more than fetching one row by ID.
- Write volume exceeds what a relational DB index can sustain on timestamp columns.
- You need UPDATE/DELETE on individual business records.
- Queries join many entity types with arbitrary filters unrelated to time.
- Exact row-level audit trails matter — use event logs or OLTP, not downsampled metrics.
- Stream Processing Basics — real-time aggregation before samples land in TSDB.
- Kafka Architecture & Guarantees — durable buffer between emitters and metrics consumers.
- Observability & Distributed Tracing — metrics complement traces and structured logs in the three pillars.
- Back-of-the-Envelope Estimation — sizing ingestion QPS, retention TB, and cardinality budgets.
- Probabilistic Data Structures — HyperLogLog for approximate unique counts when exact TSDB cardinality is too expensive.
Rollup Tier Design
Production systems rarely keep raw 1-second resolution forever. A typical tier ladder:
Raw (1s resolution) → retain 7 days → SSD hot tier Rollup 1 (1min avg/max) → retain 90 days → SSD warm tier Rollup 2 (1h avg/max) → retain 2 years → object storage / columnar Rollup 3 (1d avg/max) → retain 5+ years → cold archive (Parquet)
Background compaction jobs merge raw blocks into coarser buckets. Queries auto-select the finest tier that satisfies the requested time range — dashboards for "last 6 months" hit hourly rollups, not raw samples.
Prometheus Pull Model vs Push Gateway
Prometheus scrapes HTTP /metrics endpoints on a fixed interval (typically 15s). This gives uniform sampling and service discovery integration. Short-lived batch jobs cannot be scraped — they push to a Pushgateway, which Prometheus then scrapes. Interview pitfall: pushing every sample from every pod creates a single point of failure; pull is preferred for long-running services.
Histograms and Percentile Estimation
Storing every request latency as a raw sample is expensive at billions of QPS. Histograms bucket values (le=0.1, le=0.5, le=1.0 seconds) and increment counters per bucket. Percentiles are estimated from cumulative bucket counts — not exact, but stable under load. Choose bucket boundaries to match SLO thresholds (e.g., 100ms, 300ms, 1s for API latency).
TSDB vs Columnar Warehouse vs Search Index
- TSDB (Prometheus, InfluxDB, M3): low-latency range queries on metrics; weak at ad-hoc SQL joins.
- ClickHouse / BigQuery: excellent for analytics over event tables with time partition; higher query latency, cheaper at petabyte scale.
- Elasticsearch: full-text log search with time filters; not a replacement for numeric metric rollups at Prometheus scale.
In interviews, metrics dashboards → TSDB; business analytics over click events → Kafka + columnar warehouse; grep-able logs → search index. Many production stacks use all three with clear boundaries.
Capacity Sketch
10k pods × 500 metrics × 1 sample/15s ≈ 333k samples/sec. At ~2 bytes compressed per sample (optimistic), that is ~600 MB/min raw — retention and cardinality multiply this quickly. Always estimate active series count (unique metric+label combinations), not just sample rate.