This problem appears in multiple sheets. Depth expectations increase as you progress:
Interview Prompt
Design Time-Series Database.
Clarifying Questions (ask before designing)
| Question | Why it matters |
|---|---|
| Which of these is highest priority: Write-optimized storage (LSM/columnar), Downsampling, Rollup aggregation? | Forces scope negotiation — senior candidates trim before drawing boxes. |
| What scale should we design for — DAU, QPS, data volume? | Drives every capacity decision; shows structured thinking. |
| What are the read vs write patterns on the critical path? | Determines caching, DB choice, and replication topology. |
| What consistency and durability guarantees are required? | Separates strong-consistency paths from eventual ones — a senior differentiator. |
Scope
In scope
- Write-optimized storage (LSM/columnar)
- Downsampling
- Rollup aggregation
- Retention policies
- Tag-based indexing
- Out-of-order write handling
Out of scope (state explicitly)
- Detailed frontend/UI pixel implementation
- Org structure, staffing, and hiring plan
Assumptions
- Clarify scale (DAU, QPS, data volume) for time series database in the first 5 minutes
- Standard reliability target 99.9%–99.99% unless problem implies higher (payments, booking)
- Managed cloud services (RDS, S3, Kafka, Redis) are acceptable building blocks
These foundational concepts underpin the patterns used in this problem. Review them before deep-diving into component-level trade-offs.
- Ingest time-stamped data points at high throughput (metrics, IoT, financial ticks)
- Write format:
(metric_name, tags, timestamp, value) - Query by metric name + tag filters + time range
- Aggregation queries: avg, sum, min, max, percentile over time windows
- Downsampling: automatically roll up high-resolution data for older data
- Retention policies: auto-delete data older than configured retention
- Tag-based indexing: efficiently filter by any combination of tags
- Support rate, derivative, moving average, and other time-series functions
- Alerting integration: trigger alerts based on threshold queries
- High Write Throughput: 10M+ data points/sec ingestion
- Low Query Latency: < 100ms for recent data; < 5s for large historical queries
- Storage Efficiency: 2–4 bytes per data point after compression
- High Availability: 99.99% for writes
- Horizontal Scalability: add nodes for more throughput and storage
- Durability: no data loss for committed writes
- Write-Optimized: 95% writes, 5% reads (typical monitoring workload)
| Metric | Calculation | Value |
|---|---|---|
| Data points / sec (ingestion) | From Data points / day ÷ 86400 (+ peak factor in value) | 10M |
| Unique time series (cardinality) | Given (assumption documented in value) | 10M |
| Avg tags per series | Given (typical workload assumption) | 5–10 |
| Raw data point size | Given (assumption documented in value) | 16 bytes (8B timestamp + 8B value) |
| Compressed data point | Given (assumption documented in value) | 2–4 bytes (Gorilla compression) |
| Raw ingestion bandwidth | 10M × 16B | 160 MB/sec |
| Compressed storage / day | Given | ~3.5 TB/day |
| Query rate | Given (assumption documented in value) | 10K queries/sec |
Time-series database architecture based on the Prometheus TSDB model: WAL → in-memory HEAD block → immutable blocks on disk → compaction → downsampling to object store.
Gorilla Compression (Facebook's Time-Series Compression)
Key insight: consecutive data points in a time series are similar.
Timestamp compression (Delta-of-Delta): t0 stored full (64 bits), t1 stores delta (variable bits), t2 stores delta-of-delta (if regular interval → 1 bit). Most points: 1 bit per timestamp when scrape interval is regular.
Value compression (XOR): v0 stored full (64 bits), v1 stores XOR with v0 (only changed bits), v2 same as v1 → 1 bit. For slowly changing metrics: ~2 bits per value.
Result: 16 bytes raw → 1.37 bytes average (12× compression). This is why TSDBs can handle 10M points/sec on modest hardware.
Block-Based Storage (Prometheus TSDB Model)
Time axis is divided into 2-hour blocks. HEAD block: in-memory, accepts writes. After 2h: HEAD becomes immutable block on disk → new HEAD created. Compaction merges adjacent blocks up to max block duration (e.g., 48h). Benefits: writes go to single in-memory block (fast), immutable blocks are easy to cache/replicate/backup, deletion is entire-block delete, each block has its own index for parallel scan.
Event Bus Design (Kafka)
Topic: time_series_database-events Partitions: 64 (scale consumers horizontally) Partition key: entity_id (user_id / order_id — preserves per-entity ordering) Retention: 7 days (compliance) or 24h (high-volume telemetry) Replication factor: 3, min.insync.replicas: 2 Producer: idempotent producer enabled (enable.idempotence=true) Consumer: consumer group "time_series_database-processors" - At-least-once delivery + idempotent handlers (dedup by event_id) - DLQ topic: time_series_database-events-dlq (poison messages after 3 retries) - Lag alert: consumer lag > 60s → scale workers Design a Time-Series Database: async side effects MUST NOT block the synchronous API response. Sync path: validate → persist source of truth → publish event → return 201 Async path: consumers update caches, indexes, notifications, aggregates
Write API
POST /api/v1/write
Content-Type: application/x-protobuf
# Prometheus exposition format:
cpu_usage{host="web-01", dc="us-east", env="prod"} 72.5 1710410700
memory_used_bytes{host="web-01", dc="us-east"} 8589934592 1710410700
http_requests_total{method="GET", path="/api", status="200"} 15234 1710410700Query API (PromQL-style)
GET /api/v1/query_range
?query=avg(cpu_usage{dc="us-east"}) by (host)
&start=1710324300 &end=1710410700 &step=60
Response:
{
"status": "success",
"data": {
"resultType": "matrix",
"result": [
{"metric": {"host": "web-01"},
"values": [[1710324300, "72.5"], [1710324360, "73.1"], ...]}
]
}
}Admin APIs
POST /api/v1/admin/retention → Configure retention policies POST /api/v1/admin/downsample → Configure downsampling rules GET /api/v1/admin/status → Storage node health, disk usage POST /api/v1/admin/compact → Trigger manual compaction
Common Error Responses
400 Bad Request: invalid input, missing fields, or malformed JSON 401 Unauthorized: missing or invalid auth token or API key 403 Forbidden: authenticated but insufficient permissions 404 Not Found: resource ID does not exist 409 Conflict: duplicate write or version conflict; retry with idempotency key 422 Unprocessable Entity: valid syntax but invalid business logic 429 Too Many Requests: rate limit exceeded; honor Retry-After header 500 Internal Error: unexpected server fault; retry with idempotency key 503 Service Unavailable: dependency down or overloaded; use exponential backoff
On-Disk Block Format (Prometheus TSDB)
Each block is a directory (named by ULID timestamp) containing: meta.json (time range, sample counts), index (inverted index mapping labels to series IDs + series metadata), chunks/ (Gorilla-compressed data files, max 512 MB each), and tombstones (deletion markers). A 2-hour block for 100K series with 15-second intervals ≈ 48M samples ≈ 66 MB.
In-Memory (HEAD Block)
HashMap<SeriesID, HeadChunk>
HeadChunk:
labels: {__name__="cpu_usage", host="web-01", dc="us-east"}
samples: Gorilla-compressed buffer (appends only)
min_time, max_time, num_samples
Inverted Index (in-memory):
label_name=label_value → BitSet of series IDs
Fast intersection: host="web-01" AND dc="us-east" → bitwise ANDWAL (Write-Ahead Log)
WAL segments (sequential, append-only):
segment-00001:
[record_type=SERIES, series_id=1, labels={...}]
[record_type=SAMPLES, series_id=1, t=1710410700, v=72.5]
[record_type=SAMPLES, series_id=1, t=1710410715, v=73.1]
On restart: replay WAL → rebuild HEAD block
WAL truncated: after HEAD block is flushed to persistent blockHigh Cardinality Problem
Tags with unbounded values (user_id, request_id, IP) can create 100M unique series → 100 GB just for index. Solutions: cardinality limits (reject > 100K unique tag values), pre-aggregation at ingestion, separate high-cardinality data into ClickHouse/Druid, tag value allowlisting.
Write Path Durability
Client → Write Gateway → WAL (fsync) → ACK client → batch to storage node → storage node WAL + in-memory HEAD. Replication options: A) Write to N replicas independently (simple, slight gaps). B) Kafka buffer → multiple consumers (no data loss). C) Storage-level replication with S3 (Cortex/Thanos).
Query Performance Optimization
Block pruning (skip non-overlapping blocks), series filtering via inverted index, chunk LRU cache, query splitting into parallel sub-queries, step alignment to storage resolution, results caching in Redis, materialized views. Benchmark: 1M series, 6h query, 1m step → 30 seconds unoptimized → 200ms optimized.
Out-of-Order Writes
Options: Reject out-of-order (simple, loses data), accept with limited window (Prometheus v2.39+: within 1h), full out-of-order support (InfluxDB, VictoriaMetrics: higher memory). Recommendation: configurable out-of-order window (e.g., 30 min).
Comparison: TSDB vs General-Purpose DB
| Aspect | PostgreSQL | TSDB |
|---|---|---|
| Storage | Row-based → 100+ bytes/point | Column/chunk → 2-4 bytes/point (50× less) |
| Write pattern | B-tree index → write amplification | Append-only → no index maintenance during write |
| Compression | No time-series patterns | Gorilla compression → 12× |
| Throughput | 10K writes/sec max | 10M writes/sec |
When PostgreSQL IS appropriate: low cardinality + low volume (< 1K writes/sec), need joins with relational data. TimescaleDB extension bridges the gap.
Multi-Tenancy & Isolation
Strategies: Tenant ID as top-level label (simple but noisy neighbor), per-tenant ingester (better isolation, more overhead), per-tenant storage (best isolation, easy billing). Rate limiting per tenant: write rate, series cardinality, query rate and timeout.
Interview Walkthrough
- Frame as fundamentally different from OLTP: append-only time-ordered writes, not random read/write on indexed rows — justify why PostgreSQL fails at 10M points/sec.
- Explain Gorilla compression: delta-of-delta timestamps and XOR value encoding exploit temporal locality — ~12× compression vs raw storage.
- Walk through block-based storage: in-memory HEAD block → flush immutable 2-hour blocks → background compaction merges adjacent blocks.
- WAL for write durability; inverted index (label → series IDs) for query-time series filtering via bitwise AND on label matchers.
- High cardinality is the silent killer — unbounded tags like
user_idorrequest_idexplode the index; enforce limits and route high-cardinality data to ClickHouse. - Query optimization: block pruning by time range, chunk LRU cache, parallel sub-queries, and step alignment to storage resolution.
- Common pitfall: using a TSDB as a general-purpose database with joins — it excels at time-range scans, not relational queries across entities.
Write Path Optimization: LSM-Tree vs B-Tree for Time-Series
B-Tree (PostgreSQL, MySQL): Good for random reads, but random writes cause page splits (write amplification). Not optimized for sequential time-range writes. At 10M writes/sec: too slow.
LSM-Tree (Cassandra, InfluxDB, Prometheus): Write to in-memory MemTable (sorted by timestamp) → flush to immutable SSTable → background compaction merges and compresses. Sequential disk writes, perfect for ordered timestamps, high write throughput (in-memory buffer absorbs bursts). Gorilla compression achieves 1.37 bytes per data point. Read amplification is the trade-off (check multiple SSTables).
Cardinality Explosion: The TSDB Killer
Low cardinality: cpu_usage{host="web-01"} → 100 series for 100 hosts.
High cardinality (dangerous): http_requests{path="/api/users/12345"} → 500M unique paths. Cardinality bomb: adding user_id tag → 1M users × 100 metrics = 100M series → OOM crash.
Solutions: label value limits (reject > 10K unique values), label normalization (use cohort labels instead of user_id), hierarchical metrics (roll up to coarser granularity), Bloom filter on series existence (1 GB for 100M series, O(1) check). Rule of thumb: max 1M unique series per Prometheus instance.
Downsampling and Retention: The Storage Cost Equation
Tier 1 (0-7 days raw): Every data point. 24 TB total.
Tier 2 (7-90 days, 1-min aggregates): MAX, AVG, MIN, P95 per minute. 28 GB (1000x compression).
Tier 3 (90 days - 2 years, 1-hour aggregates): ~1 GB. Total: ~25 TB vs ~10 PB without downsampling (400x reduction). Implementation: Flink + Kafka for streaming aggregation, or TimescaleDB continuous aggregates.
Staff interviews expect you to articulate how the system evolves under real growth — not jump straight to the final architecture.
Phase 1: MVP (0 to 100K users)
Monolith or minimal services proving core time series database flows. Optimize for shipping speed and correctness over scale.
Key components: Single region · Primary DB + Redis cache · Synchronous core path · Basic monitoring
Move to next phase when: p99 latency exceeds SLO or DB CPU sustained above 70%
Phase 2: Growth (100K to 10M users)
Split read/write paths, introduce async processing for non-critical work, add caching layers and horizontal scaling.
Key components: Read replicas or CQRS · Message queue for async work · CDN / edge caching · Service-level SLOs
Move to next phase when: Hot keys, fan-out bottlenecks, or ops toil from manual scaling
Phase 3: Scale (10M+ users)
Shard data plane, multi-region active-active or active-passive, formal DR runbooks, cost optimization.
Key components: Database sharding / partitioning · Multi-region replication · Auto-scaling + chaos testing · Dedicated platform/SRE ownership
Move to next phase when: Regional failure domain risk, compliance data residency, or linear cost growth unsustainable
SLOs & Error Budgets
| Metric | Target | Rationale |
|---|---|---|
| Core user-facing availability | 99.95% | Budget for planned maintenance + unplanned failures without user-visible outage. |
| p99 latency (critical path) | Problem-specific — state target early and tie to capacity math | Interview credibility comes from connecting SLO to architecture choices. |
| Error rate (5xx) | < 0.1% | Distinguishes transient blips from systemic failure requiring rollback. |
| Data durability | 99.999999999% (11 nines) for committed writes | Define which operations require fsync/quorum vs async replication. |
Incident Scenarios (2am reality)
| Scenario | How you detect | Mitigation |
|---|---|---|
| Primary database unavailable | Health check failures, connection pool exhaustion alerts, elevated 5xx | Failover to replica / promote standby; enable read-only degraded mode if writes impossible; queue writes if async path exists |
| Traffic spike (10× normal) | RPS anomaly alert, autoscaling lag, latency SLO burn rate | Rate limit non-critical endpoints; scale read path horizontally; pre-warm caches; shed load on expensive operations |
| Bad deploy causing elevated errors | Canary metric regression, error budget burn, deployment correlation | Automated rollback within 5 minutes; feature flag kill switch; maintain N-1 compatibility |
Cost Drivers (Staff lens)
- Egress bandwidth and CDN (often dominates media/data-heavy systems)
- Database storage + IOPS at scale (plan compaction, TTL, tiering)
- Compute for async pipelines (right-size workers, spot instances for batch)
- Managed service premiums vs operational headcount trade-off
Multi-Region & DR
Start single-region with cross-AZ redundancy. Add read replicas in secondary region for DR. Move to active-active only when latency SLO or data residency requires it — accept conflict resolution complexity explicitly.