System Design Problem

Design a Distributed Tracing System (like Jaeger / Zipkin)

Commonly Asked By:UberGoogleNetflixMicrosoft

  • Instrument services: Inject trace context (trace_id, span_id, parent_span_id) into all inter-service network calls.
  • Collect spans: Emit span records representing standard units of work (start time, duration, tags, metadata) from every service.
  • Correlate traces: Reconstruct the full request flow across multiple downstream microservices using a unified trace_id.
  • Visualize traces: Present a waterfall and timeline visualization of spans showing detailed latency breakdowns.
  • Search traces: Enable searching by trace_id, service name, operation, duration, error status, and tags.
  • Service dependency graph: Discover and visualize topological service-to-service dependencies dynamically.
  • Alerting: Alert on anomalies (p99 latency spikes, high error ratios) detected within the trace telemetry stream.

The distributed tracing architecture comprises three primary tiers: the Data Plane SDK (generates and propagates trace context), the Per-Host Agent Sidecar (batches, filters, and uploads spans), and the Centralized Collector & Ingestion Pipeline(aggregates spans, enforces tail-based sampling, and writes asynchronously to cold/hot storage backends).

Loading...