Design a Time-Series Database

This problem appears in multiple sheets. Depth expectations increase as you progress:

Track	What to demonstrate
Arch 50	Show domain depth beyond the baseline: async pipelines, consistency semantics, and operational SLOs.
Arch 75	Staff angles: partition behavior, cost drivers, and MVP → production evolution with clear triggers.

Interview Prompt

Design Time-Series Database.

Clarifying Questions (ask before designing)

Question	Why it matters
Which of these is highest priority: Write-optimized storage (LSM/columnar), Downsampling, Rollup aggregation?	Forces scope negotiation — senior candidates trim before drawing boxes.
What scale should we design for — DAU, QPS, data volume?	Drives every capacity decision; shows structured thinking.
What are the read vs write patterns on the critical path?	Determines caching, DB choice, and replication topology.
What consistency and durability guarantees are required?	Separates strong-consistency paths from eventual ones — a senior differentiator.

Scope

In scope

Write-optimized storage (LSM/columnar)
Downsampling
Rollup aggregation
Retention policies
Tag-based indexing
Out-of-order write handling

Out of scope (state explicitly)

Detailed frontend/UI pixel implementation
Org structure, staffing, and hiring plan

Assumptions

Clarify scale (DAU, QPS, data volume) for time series database in the first 5 minutes
Standard reliability target 99.9%–99.99% unless problem implies higher (payments, booking)
Managed cloud services (RDS, S3, Kafka, Redis) are acceptable building blocks

Ingest time-stamped data points at high throughput (metrics, IoT, financial ticks)
Write format: (metric_name, tags, timestamp, value)
Query by metric name + tag filters + time range
Aggregation queries: avg, sum, min, max, percentile over time windows
Downsampling: automatically roll up high-resolution data for older data
Retention policies: auto-delete data older than configured retention
Tag-based indexing: efficiently filter by any combination of tags
Support rate, derivative, moving average, and other time-series functions
Alerting integration: trigger alerts based on threshold queries

Metric	Calculation	Value
Data points / sec (ingestion)	From Data points / day ÷ 86400 (+ peak factor in value)	10M
Unique time series (cardinality)	Given (assumption documented in value)	10M
Avg tags per series	Given (typical workload assumption)	5–10
Raw data point size	Given (assumption documented in value)	16 bytes (8B timestamp + 8B value)
Compressed data point	Given (assumption documented in value)	2–4 bytes (Gorilla compression)
Raw ingestion bandwidth	10M × 16B	160 MB/sec
Compressed storage / day	Given	~3.5 TB/day
Query rate	Given (assumption documented in value)	10K queries/sec

Time-series database architecture based on the Prometheus TSDB model: WAL → in-memory HEAD block → immutable blocks on disk → compaction → downsampling to object store.

Loading...

Comparison: TSDB vs General-Purpose DB

Aspect	PostgreSQL	TSDB
Storage	Row-based → 100+ bytes/point	Column/chunk → 2-4 bytes/point (50× less)
Write pattern	B-tree index → write amplification	Append-only → no index maintenance during write
Compression	No time-series patterns	Gorilla compression → 12×
Throughput	10K writes/sec max	10M writes/sec

When PostgreSQL IS appropriate: low cardinality + low volume (< 1K writes/sec), need joins with relational data. TimescaleDB extension bridges the gap.

Multi-Tenancy & Isolation

Strategies: Tenant ID as top-level label (simple but noisy neighbor), per-tenant ingester (better isolation, more overhead), per-tenant storage (best isolation, easy billing). Rate limiting per tenant: write rate, series cardinality, query rate and timeout.

Interview Walkthrough

Frame as fundamentally different from OLTP: append-only time-ordered writes, not random read/write on indexed rows — justify why PostgreSQL fails at 10M points/sec.
Explain Gorilla compression: delta-of-delta timestamps and XOR value encoding exploit temporal locality — ~12× compression vs raw storage.
Walk through block-based storage: in-memory HEAD block → flush immutable 2-hour blocks → background compaction merges adjacent blocks.
WAL for write durability; inverted index (label → series IDs) for query-time series filtering via bitwise AND on label matchers.
High cardinality is the silent killer — unbounded tags like user_id or request_id explode the index; enforce limits and route high-cardinality data to ClickHouse.
Query optimization: block pruning by time range, chunk LRU cache, parallel sub-queries, and step alignment to storage resolution.
Common pitfall: using a TSDB as a general-purpose database with joins — it excels at time-range scans, not relational queries across entities.

SLOs & Error Budgets

Metric	Target	Rationale
Core user-facing availability	99.95%	Budget for planned maintenance + unplanned failures without user-visible outage.
p99 latency (critical path)	Problem-specific — state target early and tie to capacity math	Interview credibility comes from connecting SLO to architecture choices.
Error rate (5xx)	< 0.1%	Distinguishes transient blips from systemic failure requiring rollback.
Data durability	99.999999999% (11 nines) for committed writes	Define which operations require fsync/quorum vs async replication.

Incident Scenarios (2am reality)

Scenario	How you detect	Mitigation
Primary database unavailable	Health check failures, connection pool exhaustion alerts, elevated 5xx	Failover to replica / promote standby; enable read-only degraded mode if writes impossible; queue writes if async path exists
Traffic spike (10× normal)	RPS anomaly alert, autoscaling lag, latency SLO burn rate	Rate limit non-critical endpoints; scale read path horizontally; pre-warm caches; shed load on expensive operations
Bad deploy causing elevated errors	Canary metric regression, error budget burn, deployment correlation	Automated rollback within 5 minutes; feature flag kill switch; maintain N-1 compatibility

Cost Drivers (Staff lens)

Egress bandwidth and CDN (often dominates media/data-heavy systems)
Database storage + IOPS at scale (plan compaction, TTL, tiering)
Compute for async pipelines (right-size workers, spot instances for batch)
Managed service premiums vs operational headcount trade-off

Multi-Region & DR

Start single-region with cross-AZ redundancy. Add read replicas in secondary region for DR. Move to active-active only when latency SLO or data residency requires it — accept conflict resolution complexity explicitly.

Interview Prompt

Clarifying Questions (ask before designing)

Scope

In scope

Out of scope (state explicitly)

Assumptions

Gorilla Compression (Facebook's Time-Series Compression)

Block-Based Storage (Prometheus TSDB Model)

Event Bus Design (Kafka)

Write API

Query API (PromQL-style)

Admin APIs

Common Error Responses

On-Disk Block Format (Prometheus TSDB)

In-Memory (HEAD Block)

WAL (Write-Ahead Log)

High Cardinality Problem

Write Path Durability

Query Performance Optimization

Out-of-Order Writes

Comparison: TSDB vs General-Purpose DB

Multi-Tenancy & Isolation

Interview Walkthrough

Write Path Optimization: LSM-Tree vs B-Tree for Time-Series

Cardinality Explosion: The TSDB Killer

Downsampling and Retention: The Storage Cost Equation

Phase 1: MVP (0 to 100K users)

Phase 2: Growth (100K to 10M users)

Phase 3: Scale (10M+ users)

SLOs & Error Budgets

Incident Scenarios (2am reality)

Cost Drivers (Staff lens)

Multi-Region & DR