Design a Distributed Tracing System (like Jaeger / Zipkin)

This problem appears in multiple sheets. Depth expectations increase as you progress:

Track	What to demonstrate
Arch 25	Observability pillar — explain trace/span model, context propagation (W3C Trace Context), head vs tail sampling trade-offs, and how traces are assembled from spans.
Arch 50	Add storage backend choice (columnar vs indexed), cardinality management, and integration with metrics/logs (three pillars).
Arch 75	Staff: discuss sampling bias, PII in spans, tail sampling at 10M spans/sec, and trace-driven alerting.

Interview Prompt

Design a distributed tracing system (like Jaeger or Zipkin) that collects spans from microservices, assembles them into traces, supports head-based and tail-based sampling, and stores/query traces efficiently for debugging production issues.

Clarifying Questions (ask before designing)

Question	Why it matters
What scale — spans per second and retention?	10M spans/sec at 1% sampling still = 100K spans/sec stored — drives storage and indexing.
Head or tail sampling — or adaptive?	Head sampling is simple but misses errors; tail sampling captures all errors but needs buffering.
What's the query pattern — by trace ID, service, latency?	Trace ID lookup is O(1); service+latency range queries need indexed storage.
Integration with existing metrics and logs?	Exemplars link metrics→traces; trace ID in logs links logs→traces (three pillars).

Scope

In scope

Span and trace data model
Context propagation (W3C Trace Context)
Head-based and tail-based sampling
Trace assembly from distributed spans
Columnar storage for span data
Query by trace ID and service/latency

Out of scope (state explicitly)

Full APM agent instrumentation internals
Metrics and logging systems (mention integration)
Real-time anomaly detection on traces

Assumptions

10M spans/sec ingested, 1% head sampling default
7-day trace retention
100 microservices, avg 10 spans per trace
Avg span size: 500 bytes (tags + metadata)

Instrument services: Inject trace context (trace_id, span_id, parent_span_id) into all inter-service network calls.
Collect spans: Emit span records representing standard units of work (start time, duration, tags, metadata) from every service.
Correlate traces: Reconstruct the full request flow across multiple downstream microservices using a unified trace_id.
Visualize traces: Present a waterfall and timeline visualization of spans showing detailed latency breakdowns.
Search traces: Enable searching by trace_id, service name, operation, duration, error status, and tags.
Service dependency graph: Discover and visualize topological service-to-service dependencies dynamically.
Alerting: Alert on anomalies (p99 latency spikes, high error ratios) detected within the trace telemetry stream.

Metric	Calculation	Value
Microservices	Derived	2,000
Requests / sec (system-wide)	From Requests / day ÷ 86400 (+ peak factor in value)	500K
Avg spans per trace	Given (typical workload assumption)	10
Total spans / sec	500K req/s × 10 spans/request	5M (before sampling)
Sampling rate	Derived	10% (1 in 10 traces)
Sampled spans / sec	5M × 10% sampling	500K
Avg span size	Given (typical workload assumption)	500 bytes
Ingestion throughput	500K spans × 500 bytes	250 MB/s
Storage / day	250 MB/s × 86400 ≈ 21 TB	21 TB
Retention (14 days)	21 TB × 14 ≈ 300 TB	~300 TB

Bandwidth and Processing Calculations:
1. Ingestion Volume (Sampled):
   - 500,000 spans/sec × 500 bytes/span = 250 MB/s.
2. Daily Storage Volume:
   - 250 MB/s × 86,400 seconds = 21.6 TB per day (raw JSON before compression).
3. Columnar ClickHouse Compression:
   - Highly compressed ClickHouse schemas reduce this by ~5–10×, requiring only ~3 TB/day.

The distributed tracing architecture comprises three primary tiers: the Data Plane SDK (generates and propagates trace context), the Per-Host Agent Sidecar (batches, filters, and uploads spans), and the Centralized Collector & Ingestion Pipeline(aggregates spans, enforces tail-based sampling, and writes asynchronously to cold/hot storage backends).

Loading...

1. Trace Context Propagation

Distributed tracing relies on passing trace context across process boundaries. When Service A initiates an outbound call:

The SDK generates a globally unique trace_id and a local span_id.
Outbound calls inject headers following the W3C Trace Context standard:

W3C traceparent structure:
traceparent: {version}-{trace-id}-{parent-span-id}-{trace-flags}
Example:     00-0af7651916cd43dd8448eb211c80319c-b7ad6b7169203331-01

- version: 2 hex chars (currently 00).
- trace-id: 32 hex chars (unique across the system).
- parent-span-id: 16 hex chars (tracks the caller's span).
- trace-flags: 8-bit flags (01 means the trace is sampled).

Downstream Service B extracts the header, matches the trace_id, and spawns a child span with B's parent set to A's span_id.

2. Advanced Sampling Strategies ⭐

Emitting 5M spans/sec generates massive network/storage costs. We implement a hybrid model to balance cost and fidelity:

Strategy	How It Works	Pros	Cons
Head-based (probabilistic)	Decide at trace start (e.g. sample 10% randomly)	Simple, low application overhead	May completely miss rare errors or low-volume endpoints
Tail-based ⭐	Collect all spans, buffer them, decide AFTER trace completes	Can sample 100% of error or highly latent traces	Requires temporary in-memory collector buffering
Adaptive	Adjust rate dynamically per service/endpoint based on traffic	Ensures excellent coverage of low-traffic endpoints	Complex control plane logic to implement
Always-on for errors	Force sample 100% of any trace registering errors	Never miss critical system failures	Error-heavy services will spike storage footprint

Optimal Production Mix:Combine Head-Based 10% (for uniform coverage) + Always-on for Errors (via OTel collector buffering) + Tail-Based Sampling for slow traces (where p99 latency > 2× median).

3. Storage Engine Trade-offs

Choosing the right backend determines both query latency and operational costs:

Elasticsearch / OpenSearch: Fully indexes every field, allowing fast searches on dynamic trace tags and logs. However, it is highly resource-intensive and expensive at high scale.
Cassandra: High write throughput and built-in TTL support, but queries are heavily restricted. You can only retrieve traces if you already know the specific trace_id or primary index.
ClickHouse: Columnar storage with outstanding data compression (often 10× superior to Elasticsearch). Exceptional for OLAP metrics aggregation, though less mature for complex nested JSON trace searching.
Object Store + Index (Grafana Tempo) ⭐: Stores spans as compressed blocks in S3/GCS, creating a compact Redis/Bloom filter index mapping trace_id to object paths. Reduces costs by up to 100×, enabling infinite retention.

Event Bus Design (Kafka)

Topic: distributed_tracing-events
  Partitions: 64 (scale consumers horizontally)
  Partition key: entity_id (user_id / order_id — preserves per-entity ordering)
  Retention: 7 days (compliance) or 24h (high-volume telemetry)
  Replication factor: 3, min.insync.replicas: 2

Producer: idempotent producer enabled (enable.idempotence=true)
Consumer: consumer group "distributed_tracing-processors"
  - At-least-once delivery + idempotent handlers (dedup by event_id)
  - DLQ topic: distributed_tracing-events-dlq (poison messages after 3 retries)
  - Lag alert: consumer lag > 60s → scale workers

Design a Distributed Tracing System (like Jaeger / Zipkin): async side effects MUST NOT block the synchronous API response.
  Sync path: validate → persist source of truth → publish event → return 201
  Async path: consumers update caches, indexes, notifications, aggregates

1. Ingest Spans (Collector API)

HTTP

POST /api/v1/spans
Content-Type: application/json
[
  {
    "trace_id": "abc12377a0bf",
    "span_id": "span-B78",
    "parent_span_id": "span-A21",
    "operation_name": "GET /api/users",
    "service_name": "user-service",
    "start_time": 1710320000000,
    "duration_us": 12500,
    "status": "OK",
    "tags": { "http.method": "GET", "http.status_code": 200, "db.type": "postgresql" },
    "logs": [{ "timestamp": 1710320000005, "message": "cache miss, querying DB" }]
  }
]

Response: 202 Accepted
{ "accepted": 1, "rejected": 0 }

2. Query Traces

HTTP

GET /api/v1/traces?service=user-service&operation=GET+/api/users&minDuration=100ms&maxDuration=5s&limit=20&start=1710320000&end=1710406400

GET /api/v1/traces/abc12377a0bf

3. Service Dependency Discovery

HTTP

GET /api/v1/dependencies?start=1710320000&end=1710406400
Response: 200 OK
[
  { "parent": "api-gateway", "child": "user-service", "call_count": 150000 },
  { "parent": "user-service", "child": "postgres-primary", "call_count": 120000 }
]

Common Error Responses

400 Bad Request: invalid input, missing fields, or malformed JSON
401 Unauthorized: missing or invalid auth token or API key
403 Forbidden: authenticated but insufficient permissions
404 Not Found: resource ID does not exist
409 Conflict: duplicate write or version conflict; retry with idempotency key
422 Unprocessable Entity: valid syntax but invalid business logic
429 Too Many Requests: rate limit exceeded; honor Retry-After header
500 Internal Error: unexpected server fault; retry with idempotency key
503 Service Unavailable: dependency down or overloaded; use exponential backoff
504 Gateway Timeout: index shard slow; narrow query or retry

Span Document Schema (JSON Format)

JSON

{
  "trace_id": "0af7651916cd43dd8448eb211c80319c",
  "span_id": "b7ad6b7169203331",
  "parent_span_id": "3a2fb4a1b3c4d5e6",
  "operation_name": "GET /api/users/{id}",
  "service_name": "user-service",
  "service_version": "v2.3.1",
  "environment": "production",
  "start_time_us": 1710320000000000,
  "duration_us": 12500,
  "status_code": "OK",
  "tags": {
    "http.method": "GET",
    "http.status_code": 200,
    "http.url": "/api/users/123",
    "db.type": "postgresql",
    "db.statement": "SELECT * FROM users WHERE id = $1",
    "peer.service": "postgres-primary",
    "component": "spring-web"
  },
  "logs": [
    { "timestamp_us": 1710320000005000, "event": "cache_miss" },
    { "timestamp_us": 1710320000008000, "event": "db_query_complete", "rows": 1 }
  ],
  "process": {
    "hostname": "user-svc-pod-abc123",
    "ip": "10.0.5.42"
  }
}

Cassandra Schema Design

SQL

-- Span storage table
CREATE TABLE traces (
    trace_id        TEXT,
    span_id         TEXT,
    parent_span_id  TEXT,
    operation_name  TEXT,
    service_name    TEXT,
    start_time      TIMESTAMP,
    duration        BIGINT,
    tags            MAP<TEXT, TEXT>,
    PRIMARY KEY (trace_id, span_id)
) WITH default_time_to_live = 1209600;  -- Auto expire in 14 days

-- Secondary index table for query lookup by service name
CREATE TABLE service_operations (
    service_name    TEXT,
    start_time      TIMESTAMP,
    trace_id        TEXT,
    duration        BIGINT,
    PRIMARY KEY ((service_name), start_time, trace_id)
) WITH CLUSTERING ORDER BY (start_time DESC);

Scenario Concern	System Solution Design
Collector Overload	OTel Agents implement backpressure, buffering up to 500 spans locally before gracefully dropping non-sampled telemetry.
Storage Ingestion Failures	Stateless collector pool writes to a highly durable, partitioned Kafka cluster before committing to Elasticsearch/S3.
Application SDK Overhead	SDK operations run asynchronously; context propagation relies on UDP sidecars to avoid blocking the hot CPU execution path.
Unbounded Tag Cardinality	Centralized collectors strip dynamic user-specific tags (e.g. user_id) at ingestion, directing them to unindexed log strings.

1. Handling Clock Skew Across Services ⭐

Microservices run on virtual hosts with independent system clocks. A skew of 150ms could make a child span appear to start *before* its parent span began, breaking timeline visualizations.

UI Adjustment: Use absolute parent-child relationship keys (not absolute timestamps) to draw the timeline tree sequentially.
Relative Offsets: The parent SDK logs the exact duration of downstream HTTP requests. The collector estimates and normalizes network transport delays.
Clock Synchronization: Implement NTP synchronization across cluster nodes to keep skew under < 10ms.

2. Trace Assembly Windows

Because spans arrive out-of-order and asynchronously, the tail-sampling processor holds a sliding 30-second assembly window. If the root span completes, or if the 30-second TTL expires without new child activity, the collector commits the trace decision.

SLOs & Error Budgets

Metric	Target	Rationale
Trace ingestion availability	99.9%	Tracing loss during outage = blind to production issues
Trace query latency (by trace_id)	< 2s p99	On-call needs fast drill-down during incidents
Sampling coverage of errors	100%	Tail sampling must capture every error trace
Span ingestion lag	< 30s p99	Traces must appear in UI within seconds of request

Incident Scenarios (2am reality)

Scenario	How you detect	Mitigation
Collector OOM during traffic spike	Collector pods restarting; span drop rate > 1%; queue depth maxed	Scale collectors horizontally; increase memory limits; reduce head sample rate temporarily; enable drop policy on queue overflow
Storage cost doubles in one month	ClickHouse/S3 bill spike; cardinality alert on new high-cardinality tag	Identify new tag source (bad instrumentation deploy); add attribute processor to drop/scrub; reduce sample rate; shorten retention
Broken trace propagation after service migration	Orphan span rate > 20%; traces appear as disconnected fragments	Verify traceparent header forwarding through new proxy/LB; check gRPC metadata propagation; rollback instrumentation change

Cost Drivers (Staff lens)

Storage: 10M spans/sec × 1% × 500B × 7 days ≈ 3 TB (largest cost with columnar compression)
Collector compute: scales with ingestion rate — modest vs storage
Indexed query storage: service/latency indexes add 20-30% overhead

Multi-Region & DR

Collectors are regional — spans stay in-region (data residency). Cross-region trace lookup: federated query across regional backends (slow) or replicate trace_id index globally. Tail sampling state is per-collector — shard by trace_id for consistency. Don't centralize all spans in one region — latency and compliance.

Interview Prompt

Clarifying Questions (ask before designing)

Scope

In scope

Out of scope (state explicitly)

Assumptions

1. Trace Context Propagation

2. Advanced Sampling Strategies ⭐

3. Storage Engine Trade-offs

Event Bus Design (Kafka)

1. Ingest Spans (Collector API)

2. Query Traces

3. Service Dependency Discovery

Common Error Responses

Span Document Schema (JSON Format)

Cassandra Schema Design

1. Handling Clock Skew Across Services ⭐

2. Trace Assembly Windows

Observer Effect Mitigation

Interview Walkthrough

1. Why Introduce Kafka Durability Buffers?

2. Head-Based vs Tail-Based Sampling

Phase 1 — Basic tracing (Jaeger all-in-one)

Phase 2 — Production pipeline with tail sampling

Phase 3 — Hyperscale with tiered storage

SLOs & Error Budgets

Incident Scenarios (2am reality)

Cost Drivers (Staff lens)

Multi-Region & DR