This problem appears in multiple sheets. Depth expectations increase as you progress:
| Track | What to demonstrate |
|---|---|
| Arch 25 | Observability pillar — explain trace/span model, context propagation (W3C Trace Context), head vs tail sampling trade-offs, and how traces are assembled from spans. |
| Arch 50 | Add storage backend choice (columnar vs indexed), cardinality management, and integration with metrics/logs (three pillars). |
| Arch 75 | Staff: discuss sampling bias, PII in spans, tail sampling at 10M spans/sec, and trace-driven alerting. |
Interview Prompt
Design a distributed tracing system (like Jaeger or Zipkin) that collects spans from microservices, assembles them into traces, supports head-based and tail-based sampling, and stores/query traces efficiently for debugging production issues.
Clarifying Questions (ask before designing)
| Question | Why it matters |
|---|---|
| What scale — spans per second and retention? | 10M spans/sec at 1% sampling still = 100K spans/sec stored — drives storage and indexing. |
| Head or tail sampling — or adaptive? | Head sampling is simple but misses errors; tail sampling captures all errors but needs buffering. |
| What's the query pattern — by trace ID, service, latency? | Trace ID lookup is O(1); service+latency range queries need indexed storage. |
| Integration with existing metrics and logs? | Exemplars link metrics→traces; trace ID in logs links logs→traces (three pillars). |
Scope
In scope
- Span and trace data model
- Context propagation (W3C Trace Context)
- Head-based and tail-based sampling
- Trace assembly from distributed spans
- Columnar storage for span data
- Query by trace ID and service/latency
Out of scope (state explicitly)
- Full APM agent instrumentation internals
- Metrics and logging systems (mention integration)
- Real-time anomaly detection on traces
Assumptions
- 10M spans/sec ingested, 1% head sampling default
- 7-day trace retention
- 100 microservices, avg 10 spans per trace
- Avg span size: 500 bytes (tags + metadata)
These foundational concepts underpin the patterns used in this problem. Review them before deep-diving into component-level trade-offs.
- Instrument services: Inject trace context (trace_id, span_id, parent_span_id) into all inter-service network calls.
- Collect spans: Emit span records representing standard units of work (start time, duration, tags, metadata) from every service.
- Correlate traces: Reconstruct the full request flow across multiple downstream microservices using a unified trace_id.
- Visualize traces: Present a waterfall and timeline visualization of spans showing detailed latency breakdowns.
- Search traces: Enable searching by trace_id, service name, operation, duration, error status, and tags.
- Service dependency graph: Discover and visualize topological service-to-service dependencies dynamically.
- Alerting: Alert on anomalies (p99 latency spikes, high error ratios) detected within the trace telemetry stream.
- Low Overhead: Tracing MUST add < 1% CPU utilization and < 5% latency overhead to the hot execution path.
- High Throughput: Ingest 1M+ spans per second (every internal RPC or database lookup generates a span).
- Sampling Control: Support head and tail-based adaptive sampling to balance diagnostic visibility and storage cost.
- Scalability: Easily handle thousands of microservices and billions of telemetry spans daily.
- Durability: Retain trace records for 7–14 days for comprehensive debugging and analytics.
- Low Query Latency: Retrieve search results and specific trace IDs in < 1 second.
- Open Standards: Full compatibility with OpenTelemetry (OTel) conventions.
| Metric | Calculation | Value |
|---|---|---|
| Microservices | Derived | 2,000 |
| Requests / sec (system-wide) | From Requests / day ÷ 86400 (+ peak factor in value) | 500K |
| Avg spans per trace | Given (typical workload assumption) | 10 |
| Total spans / sec | 500K req/s × 10 spans/request | 5M (before sampling) |
| Sampling rate | Derived | 10% (1 in 10 traces) |
| Sampled spans / sec | 5M × 10% sampling | 500K |
| Avg span size | Given (typical workload assumption) | 500 bytes |
| Ingestion throughput | 500K spans × 500 bytes | 250 MB/s |
| Storage / day | 250 MB/s × 86400 ≈ 21 TB | 21 TB |
| Retention (14 days) | 21 TB × 14 ≈ 300 TB | ~300 TB |
Bandwidth and Processing Calculations: 1. Ingestion Volume (Sampled): - 500,000 spans/sec × 500 bytes/span = 250 MB/s. 2. Daily Storage Volume: - 250 MB/s × 86,400 seconds = 21.6 TB per day (raw JSON before compression). 3. Columnar ClickHouse Compression: - Highly compressed ClickHouse schemas reduce this by ~5–10×, requiring only ~3 TB/day.
The distributed tracing architecture comprises three primary tiers: the Data Plane SDK (generates and propagates trace context), the Per-Host Agent Sidecar (batches, filters, and uploads spans), and the Centralized Collector & Ingestion Pipeline(aggregates spans, enforces tail-based sampling, and writes asynchronously to cold/hot storage backends).
1. Trace Context Propagation
Distributed tracing relies on passing trace context across process boundaries. When Service A initiates an outbound call:
- The SDK generates a globally unique
trace_idand a localspan_id. - Outbound calls inject headers following the W3C Trace Context standard:
W3C traceparent structure:
traceparent: {version}-{trace-id}-{parent-span-id}-{trace-flags}
Example: 00-0af7651916cd43dd8448eb211c80319c-b7ad6b7169203331-01
- version: 2 hex chars (currently 00).
- trace-id: 32 hex chars (unique across the system).
- parent-span-id: 16 hex chars (tracks the caller's span).
- trace-flags: 8-bit flags (01 means the trace is sampled).Downstream Service B extracts the header, matches the trace_id, and spawns a child span with B's parent set to A's span_id.
2. Advanced Sampling Strategies ⭐
Emitting 5M spans/sec generates massive network/storage costs. We implement a hybrid model to balance cost and fidelity:
| Strategy | How It Works | Pros | Cons |
|---|---|---|---|
| Head-based (probabilistic) | Decide at trace start (e.g. sample 10% randomly) | Simple, low application overhead | May completely miss rare errors or low-volume endpoints |
| Tail-based ⭐ | Collect all spans, buffer them, decide AFTER trace completes | Can sample 100% of error or highly latent traces | Requires temporary in-memory collector buffering |
| Adaptive | Adjust rate dynamically per service/endpoint based on traffic | Ensures excellent coverage of low-traffic endpoints | Complex control plane logic to implement |
| Always-on for errors | Force sample 100% of any trace registering errors | Never miss critical system failures | Error-heavy services will spike storage footprint |
Optimal Production Mix:Combine Head-Based 10% (for uniform coverage) + Always-on for Errors (via OTel collector buffering) + Tail-Based Sampling for slow traces (where p99 latency > 2× median).
3. Storage Engine Trade-offs
Choosing the right backend determines both query latency and operational costs:
- Elasticsearch / OpenSearch: Fully indexes every field, allowing fast searches on dynamic trace tags and logs. However, it is highly resource-intensive and expensive at high scale.
- Cassandra: High write throughput and built-in TTL support, but queries are heavily restricted. You can only retrieve traces if you already know the specific
trace_idor primary index. - ClickHouse: Columnar storage with outstanding data compression (often 10× superior to Elasticsearch). Exceptional for OLAP metrics aggregation, though less mature for complex nested JSON trace searching.
- Object Store + Index (Grafana Tempo) ⭐: Stores spans as compressed blocks in S3/GCS, creating a compact Redis/Bloom filter index mapping
trace_idto object paths. Reduces costs by up to 100×, enabling infinite retention.
Event Bus Design (Kafka)
Topic: distributed_tracing-events Partitions: 64 (scale consumers horizontally) Partition key: entity_id (user_id / order_id — preserves per-entity ordering) Retention: 7 days (compliance) or 24h (high-volume telemetry) Replication factor: 3, min.insync.replicas: 2 Producer: idempotent producer enabled (enable.idempotence=true) Consumer: consumer group "distributed_tracing-processors" - At-least-once delivery + idempotent handlers (dedup by event_id) - DLQ topic: distributed_tracing-events-dlq (poison messages after 3 retries) - Lag alert: consumer lag > 60s → scale workers Design a Distributed Tracing System (like Jaeger / Zipkin): async side effects MUST NOT block the synchronous API response. Sync path: validate → persist source of truth → publish event → return 201 Async path: consumers update caches, indexes, notifications, aggregates
1. Ingest Spans (Collector API)
POST /api/v1/spans
Content-Type: application/json
[
{
"trace_id": "abc12377a0bf",
"span_id": "span-B78",
"parent_span_id": "span-A21",
"operation_name": "GET /api/users",
"service_name": "user-service",
"start_time": 1710320000000,
"duration_us": 12500,
"status": "OK",
"tags": { "http.method": "GET", "http.status_code": 200, "db.type": "postgresql" },
"logs": [{ "timestamp": 1710320000005, "message": "cache miss, querying DB" }]
}
]
Response: 202 Accepted
{ "accepted": 1, "rejected": 0 }2. Query Traces
GET /api/v1/traces?service=user-service&operation=GET+/api/users&minDuration=100ms&maxDuration=5s&limit=20&start=1710320000&end=1710406400
GET /api/v1/traces/abc12377a0bf3. Service Dependency Discovery
GET /api/v1/dependencies?start=1710320000&end=1710406400
Response: 200 OK
[
{ "parent": "api-gateway", "child": "user-service", "call_count": 150000 },
{ "parent": "user-service", "child": "postgres-primary", "call_count": 120000 }
]Common Error Responses
400 Bad Request: invalid input, missing fields, or malformed JSON 401 Unauthorized: missing or invalid auth token or API key 403 Forbidden: authenticated but insufficient permissions 404 Not Found: resource ID does not exist 409 Conflict: duplicate write or version conflict; retry with idempotency key 422 Unprocessable Entity: valid syntax but invalid business logic 429 Too Many Requests: rate limit exceeded; honor Retry-After header 500 Internal Error: unexpected server fault; retry with idempotency key 503 Service Unavailable: dependency down or overloaded; use exponential backoff 504 Gateway Timeout: index shard slow; narrow query or retry
Span Document Schema (JSON Format)
{
"trace_id": "0af7651916cd43dd8448eb211c80319c",
"span_id": "b7ad6b7169203331",
"parent_span_id": "3a2fb4a1b3c4d5e6",
"operation_name": "GET /api/users/{id}",
"service_name": "user-service",
"service_version": "v2.3.1",
"environment": "production",
"start_time_us": 1710320000000000,
"duration_us": 12500,
"status_code": "OK",
"tags": {
"http.method": "GET",
"http.status_code": 200,
"http.url": "/api/users/123",
"db.type": "postgresql",
"db.statement": "SELECT * FROM users WHERE id = $1",
"peer.service": "postgres-primary",
"component": "spring-web"
},
"logs": [
{ "timestamp_us": 1710320000005000, "event": "cache_miss" },
{ "timestamp_us": 1710320000008000, "event": "db_query_complete", "rows": 1 }
],
"process": {
"hostname": "user-svc-pod-abc123",
"ip": "10.0.5.42"
}
}Cassandra Schema Design
-- Span storage table
CREATE TABLE traces (
trace_id TEXT,
span_id TEXT,
parent_span_id TEXT,
operation_name TEXT,
service_name TEXT,
start_time TIMESTAMP,
duration BIGINT,
tags MAP<TEXT, TEXT>,
PRIMARY KEY (trace_id, span_id)
) WITH default_time_to_live = 1209600; -- Auto expire in 14 days
-- Secondary index table for query lookup by service name
CREATE TABLE service_operations (
service_name TEXT,
start_time TIMESTAMP,
trace_id TEXT,
duration BIGINT,
PRIMARY KEY ((service_name), start_time, trace_id)
) WITH CLUSTERING ORDER BY (start_time DESC);| Scenario Concern | System Solution Design |
|---|---|
| Collector Overload | OTel Agents implement backpressure, buffering up to 500 spans locally before gracefully dropping non-sampled telemetry. |
| Storage Ingestion Failures | Stateless collector pool writes to a highly durable, partitioned Kafka cluster before committing to Elasticsearch/S3. |
| Application SDK Overhead | SDK operations run asynchronously; context propagation relies on UDP sidecars to avoid blocking the hot CPU execution path. |
| Unbounded Tag Cardinality | Centralized collectors strip dynamic user-specific tags (e.g. user_id) at ingestion, directing them to unindexed log strings. |
1. Handling Clock Skew Across Services ⭐
Microservices run on virtual hosts with independent system clocks. A skew of 150ms could make a child span appear to start *before* its parent span began, breaking timeline visualizations.
- UI Adjustment: Use absolute parent-child relationship keys (not absolute timestamps) to draw the timeline tree sequentially.
- Relative Offsets: The parent SDK logs the exact duration of downstream HTTP requests. The collector estimates and normalizes network transport delays.
- Clock Synchronization: Implement NTP synchronization across cluster nodes to keep skew under < 10ms.
2. Trace Assembly Windows
Because spans arrive out-of-order and asynchronously, the tail-sampling processor holds a sliding 30-second assembly window. If the root span completes, or if the 30-second TTL expires without new child activity, the collector commits the trace decision.
Observer Effect Mitigation
To ensure the tracing SDK itself never degrades production request performance, the SDK employs a No-Op Span Pool. If the head-based sampler decides a request is not to be sampled, the SDK returns a statically allocated, lightweight no-op memory pointer. This entirely avoids heap allocations and serializations for 90% of requests.
Interview Walkthrough
- Frame the scale problem first: 500K req/s × 10 spans = 5M spans/sec raw — storing everything is financially impossible without sampling.
- Describe the three tiers: SDK propagation → per-host agent batching → centralized collector with tail-based sampling.
- Explain W3C
traceparentheader injection so child spans link to parent spans across service boundaries. - Propose hybrid sampling: 10% head-based for baseline coverage, 100% for errors and traces exceeding p99 latency thresholds.
- Insert Kafka between collectors and storage backends to absorb ingestion spikes without dropping spans during ES maintenance.
- Compare storage options: Elasticsearch (rich search) vs S3+compact index (Tempo-style, 100× cheaper at petabyte scale).
- Address clock skew: normalize waterfall rendering using parent-child relationships, not absolute timestamps alone.
- Common pitfall: proposing 100% span retention without sampling — interviewers will push back on the 300 TB/14-day storage bill.
1. Why Introduce Kafka Durability Buffers?
Direct ingestion from collectors to Elasticsearch can lead to thundering herd query spikes, stalling the cluster. Adding Kafka between collectors and span ingesters adds a tiny 5–10ms delay but delivers complete buffer safety. During Elasticsearch maintenance, Kafka persists spans safely on disk, avoiding any data loss.
2. Head-Based vs Tail-Based Sampling
Head-based sampling is trivial to implement and scales infinitely with zero collector memory buffering. However, random head-based choices might discard 90% of slow database calls or production application errors. Tail-based sampling offers flawless analytical visibility (100% of errors are stored) but requires substantial in-memory buffer infrastructure inside the collector tier to compile the full trace before executing the decision.
Staff interviews expect you to articulate how the system evolves under real growth — not jump straight to the final architecture.
Phase 1 — Basic tracing (Jaeger all-in-one)
OpenTelemetry SDK in services. Jaeger agent + collector + Cassandra/Elasticsearch storage. 1% head sampling. Trace ID lookup only.
Key components: OTel SDK · Jaeger collector · Cassandra/ES storage · Head sampling 1% · Trace ID UI
Move to next phase when: Storage cost explodes at 10M spans/sec; need tail sampling and columnar storage
Phase 2 — Production pipeline with tail sampling
OTel Collector with tail sampling processor (keep all errors + p99 latency). ClickHouse for columnar storage. Agent DaemonSet per K8s node. Metrics exemplars linking to traces.
Key components: OTel Collector · Tail sampling · ClickHouse · K8s DaemonSet agents · Exemplars
Move to next phase when: Query patterns exceed trace_id lookup — need service/latency search
Phase 3 — Hyperscale with tiered storage
Tempo-style object storage (S3) for traces with hot cache (Redis) for recent trace_ids. Continuous aggregation for slow-request indexing. Adaptive sampling based on error rate. Trace-driven alerting.
Key components: S3 tiered storage · Redis hot cache · Adaptive sampling · Slow-request index · Trace alerting
Move to next phase when: 7-day retention at scale exceeds $100K/month — need S3 tiering
SLOs & Error Budgets
| Metric | Target | Rationale |
|---|---|---|
| Trace ingestion availability | 99.9% | Tracing loss during outage = blind to production issues |
| Trace query latency (by trace_id) | < 2s p99 | On-call needs fast drill-down during incidents |
| Sampling coverage of errors | 100% | Tail sampling must capture every error trace |
| Span ingestion lag | < 30s p99 | Traces must appear in UI within seconds of request |
Incident Scenarios (2am reality)
| Scenario | How you detect | Mitigation |
|---|---|---|
| Collector OOM during traffic spike | Collector pods restarting; span drop rate > 1%; queue depth maxed | Scale collectors horizontally; increase memory limits; reduce head sample rate temporarily; enable drop policy on queue overflow |
| Storage cost doubles in one month | ClickHouse/S3 bill spike; cardinality alert on new high-cardinality tag | Identify new tag source (bad instrumentation deploy); add attribute processor to drop/scrub; reduce sample rate; shorten retention |
| Broken trace propagation after service migration | Orphan span rate > 20%; traces appear as disconnected fragments | Verify traceparent header forwarding through new proxy/LB; check gRPC metadata propagation; rollback instrumentation change |
Cost Drivers (Staff lens)
- Storage: 10M spans/sec × 1% × 500B × 7 days ≈ 3 TB (largest cost with columnar compression)
- Collector compute: scales with ingestion rate — modest vs storage
- Indexed query storage: service/latency indexes add 20-30% overhead
Multi-Region & DR
Collectors are regional — spans stay in-region (data residency). Cross-region trace lookup: federated query across regional backends (slow) or replicate trace_id index globally. Tail sampling state is per-collector — shard by trace_id for consistency. Don't centralize all spans in one region — latency and compliance.