System Design Problem

Design a Log Aggregation and Search System (like Splunk / ELK)

Commonly Asked By:SplunkElasticDatadogAWS

  • Collect logs: Ingest logs from thousands of services, servers, containers, cloud resources
  • Structured and unstructured: Handle JSON, syslog, plain text, multi-line stack traces
  • Search: Full-text search across all logs with < 5 second latency
  • Filter: By service, severity, time range, hostname, custom fields
  • Live tail: Stream new logs matching a filter in real-time
  • Dashboards: Visualize log volume, error rates, trends
  • Alerting: Alert when error log rate exceeds threshold or specific patterns appear
  • Retention: Configurable per log source (7 days to 1 year)

The system employs a pull-collect-push architecture: local agents tail log files, enrich and compress events, and stream them to Kafka. Ingestors consume from Kafka, parse raw payloads into structured documents, redact sensitive PII, and route them to tiered storage.

Loading...