Design a Log Aggregation and Search System (like Splunk / ELK)

Interview Prompt

Design Log Aggregation and Search System (like Splunk / ELK).

Clarifying Questions (ask before designing)

Question	Why it matters
Query QPS vs document corpus size vs index freshness SLA?	Separates retrieval path from indexing pipeline design.
What scale should we design for — DAU, QPS, data volume?	Drives every capacity decision; shows structured thinking.
What are the read vs write patterns on the critical path?	Determines caching, DB choice, and replication topology.
What consistency and durability guarantees are required?	Separates strong-consistency paths from eventual ones — a senior differentiator.

Scope

In scope

Agent → collector → indexer pipeline
Structured logging
Index rotation & retention
Full-text search at scale
Hot-warm-cold tiering
Log-based alerting

Out of scope (state explicitly)

Application instrumentation SDK design
Full distributed tracing system (#33)
On-call paging and escalation policy (#37)

Assumptions

Index staleness of minutes is acceptable unless real-time is stated
Clarify query QPS vs index update rate early
Managed search/stream stack (Elasticsearch, Kafka) is fine to propose

Collect logs: Ingest logs from thousands of services, servers, containers, cloud resources
Structured and unstructured: Handle JSON, syslog, plain text, multi-line stack traces
Search: Full-text search across all logs with < 5 second latency
Filter: By service, severity, time range, hostname, custom fields
Live tail: Stream new logs matching a filter in real-time
Dashboards: Visualize log volume, error rates, trends
Alerting: Alert when error log rate exceeds threshold or specific patterns appear
Retention: Configurable per log source (7 days to 1 year)

At scale, storage costs and physical input/output throughput dominate the design. Aggressive compression and smart data lifecycle tiering are required.

Metric	Calculation	Value
Log lines / sec	From Log lines / day ÷ 86400 (+ peak factor in value)	1M
Avg log line size	Given (typical workload assumption)	500 bytes
Ingestion throughput	Given (assumption documented in value)	500 MB/s
Storage / day (raw)	Derived from upstream throughput × size	43 TB
With compression (~10x)	Given (assumption documented in value)	4.3 TB/day
30-day retention	Given	~130 TB
1-year (with tiering)	Given	~400 TB

The system employs a pull-collect-push architecture: local agents tail log files, enrich and compress events, and stream them to Kafka. Ingestors consume from Kafka, parse raw payloads into structured documents, redact sensitive PII, and route them to tiered storage.

Loading...

Search Logs

Queries logs using a powerful query syntax, supporting wildcards, logical operators, and precise filters.

HTTP

POST /api/v1/logs/search
{
  "query": "level:ERROR AND service:payment-service AND message:*timeout*",
  "from": "2026-03-14T04:00:00Z",
  "to": "2026-03-14T10:00:00Z",
  "limit": 100,
  "sort": "timestamp:desc"
}

Live Tail (WebSocket)

Establishes a persistent bi-directional connection to stream matching logs to the user in real-time.

JSON

// Client sends subscription request:
{
  "type": "subscribe",
  "filter": "service=payment-service AND level=ERROR"
}

// Server pushes matching logs in real-time as they arrive in Kafka:
{
  "type": "log",
  "data": {
    "timestamp": "2026-03-14T10:05:00.123Z",
    "service": "payment-service",
    "level": "ERROR",
    "message": "Gateway timeout: Stripe API failed to respond"
  }
}

Common Error Responses

400 Bad Request: invalid input, missing fields, or malformed JSON
401 Unauthorized: missing or invalid auth token or API key
403 Forbidden: authenticated but insufficient permissions
404 Not Found: resource ID does not exist
409 Conflict: duplicate write or version conflict; retry with idempotency key
422 Unprocessable Entity: valid syntax but invalid business logic
429 Too Many Requests: rate limit exceeded; honor Retry-After header
500 Internal Error: unexpected server fault; retry with idempotency key
503 Service Unavailable: dependency down or overloaded; use exponential backoff
504 Gateway Timeout: index shard slow; narrow query or retry

Elasticsearch Index Mapping

Structured logs mapped to specific field data types to optimize space and query speeds. High-cardinality fields are marked as keyword (non-analyzed) to prevent indexing overhead.

JSON

{
  "mappings": {
    "properties": {
      "timestamp": {"type": "date"},
      "level": {"type": "keyword"},
      "service": {"type": "keyword"},
      "host": {"type": "keyword"},
      "env": {"type": "keyword"},
      "message": {"type": "text", "analyzer": "standard"},
      "trace_id": {"type": "keyword"},
      "custom_fields": {"type": "object", "dynamic": true}
    }
  },
  "settings": {
    "index.number_of_shards": 5,
    "index.number_of_replicas": 1,
    "index.lifecycle.name": "logs-policy"
  }
}

Index Lifecycle Management (ILM) Flow

Automates the migration of indices down the storage hierarchy:

Hot Phase:  0-2 days  → 5 shards, 1 replica, SSD, force merge at rollover
Warm Phase: 2-30 days → shrink to 1 shard, read-only, HDD, compressed
Cold Phase: 30-90 days → freeze index, S3 snapshot, 0 replicas
Delete Phase: > 90 days → delete index (or archive snapshot to Glacier)

The ingestion pipeline must handle physical bottlenecks, downstream system failures, and unexpected log volume surges.

Concern	Solution
Agent failure	Agent buffers to local disk; retry on restart
Kafka down	Agent buffers locally (up to 1 GB); retries with backoff
Elasticsearch overload	Kafka absorbs bursts; auto-scale indexer workers
Index corruption	Elasticsearch replicas; daily S3 snapshots
Query overload	Query timeout (30s); rate limiting per user
Disk full	ILM auto-deletes oldest indices; alert on disk usage > 80%

1. Log Volume Spike: Cascading Failure Floods Logs

The Problem: An application encounters an error (e.g., database connection timeout), logs it, retries immediately, and logs it again. In a cluster of 500 pods, this can suddenly flood the logging system with 50M log lines/sec (100x normal load). This causes Kafka topics to fill up, indexing queues to fall behind (hours of query lag), and node disks to run out of space.

Solutions:

Agent-level Rate Limiting: The local agent throttles log shipping. Configured to emit maximum 1,000 log lines/sec per container. Excess logs are sampled or dropped with a summary line: "Dropped 49,000 log entries from payment-service in the last minute".
Kafka Per-Service Quotas: Throttles producers sending too much data, introducing backpressure to the logging framework.
Circuit Breaker in Log Libraries: If logging starts blocking or failing, the application-level log library drops logs gracefully or falls back to basic counters.
Tiered Ingestion (Severity Shedding): During cluster spikes, indexers drop DEBUG and INFO logs to preserve ERROR and CRITICAL events.

2. Multi-Line Log Parsing: Stack Traces

The Problem: Multi-line runtime exceptions (e.g., Java stack traces) are printed on separate lines. If parsed naively, a single stack trace gets split into 10 separate log entries, making it impossible to search or debug.

Solutions:

Agent-level Regexp Assembly: The agent buffers multiline blocks. Configured rule: "Lines starting with whitespace or 'at' are continuations" (e.g., pattern: '^\s|^Caused by|^at ').
Flushing Timeouts: After 500ms without new continuation lines, the agent flushes the multiline buffer to prevent delay.
Structured JSON Logging: By forcing applications to serialize exception objects directly into structured JSON strings on a single stdout line, multi-line agent parsing is eliminated entirely.

3. Index Management: The Elasticsearch Scale Problem

The Problem: A cluster ingesting 4.3 TB/day of compressed logs across 50 distinct services will produce thousands of shards. If every service creates a daily index, the sheer count of active shards will overload the Elasticsearch cluster master node (leading to metadata heap bottlenecks).

Solutions:

Unified Data Streams: Combine small services into a single unified index stream, filtering by a service keyword field at query time.
Force Merge Rollovers: Shrink 5 shards into a single unified read-only shard during warm rollover phase, reducing active file handles.
Replica Pruning: Drop read replicas on indices older than 7 days, relying on automatic S3 snapshots for recovery.

4. Live Tail Scaling: Streaming Matching Logs to Users

The Problem: Users tailing logs expect real-time logs in their console. Polling Elasticsearch every second is expensive and causes severe search cluster load. Running a direct filter consumer on the main raw Kafka topic (1M lines/sec) per user session does not scale.

Solutions:

Pre-Filtered Event Topics: The log processor routes log events matching common high-frequency parameters (such as error alerts) to a secondary, pre-filtered low-volume Kafka topic.
Distributed Session Filtering: Restrict concurrent live tail sessions to 50 active users, fallback to short-poll indexing searches beyond this threshold to protect system integrity.

PII Redaction: Preventing Sensitive Data in Logs

Developers frequently log sensitive customer details by accident (such as emails, passwords, SSNs, credit cards). Leaving these in raw text violates security audits and data protection laws (GDPR, CCPA). Performing a retrospective deletion scan across petabytes of index logs is physically impossible.

Defense-in-Depth Solution:

Layer 1: Code Linting: Prevent PII commit via CI rules checking for typical sensitive parameter names.
Layer 2: Agent Redaction: Local agent executes high-speed regex sanitization matches:
- Email regex: [a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}
- SSN regex: \d{3}-\d{2}-\d{4}
- Credit Card regex: \d{4}[- ]?\d{4}[- ]?\d{4}[- ]?\d{4}
Layer 3: Tokenization Vaults: Replace identified PII parameters with reversible transaction tokens (e.g. tok_39d8c4). Only compliance auditors have decryption keys to reverse these tokens in secure vaults.

Structured vs Unstructured Logging: Why JSON Wins

Unstructured logs require complex, fragile Grok/Regex regex processing pipelines, which fail easily on format drift:

Unstructured raw log line:
2026-03-14 10:00:00 ERROR [payment-service] User 12345 payment failed: timeout after 5000ms

Structured JSON log line:
{"timestamp":"2026-03-14T10:00:00Z","level":"ERROR","service":"payment-service","user_id":"12345","message":"payment failed","error":"timeout","duration_ms":5000,"trace_id":"abc123"}

Why JSON wins:

Zero parsing overhead: Direct key-value injection into indexers, no Grok matches required.
Type preservation: Numeric values (e.g., duration_ms) remain queryable numbers rather than strings.
Compression friendly: JSON structures repeating identical field keys are highly deduplicated by compression algorithms (like LZ4/ZSTD).

Log-Trace-Metric Correlation: The Observability Triangle

Tracing a user request failure across multiple microservices is challenging without system-wide correlation.

Unified Troubleshooting Workflow:

The SDK auto-injects trace_id and span_id from OpenTelemetry contexts into every application log payload.
When an error log is flagged, click the trace_id link to load a Jaeger transaction waterfall trace.
Identify the slow DB call, then select "View metrics" to inspect the database engine dashboard instantly.

JSON

{"level":"ERROR","message":"payment timeout","trace_id":"abc123","span_id":"def456"}

Cost Optimization: Calculations & Levers

At 43 TB/day raw, naive SSD storage in Elasticsearch costs $1.57M / year. Applying an aggressive tiering lifecycle architecture shrinks this expenditure dramatically:

Annual Ingest Cost Breakdown with Tiering:
  - Hot tier (Elasticsearch, 2 days):   43 TB × 2 × $0.10 = $8,600
  - Warm tier (ES HDD, 28 days):        43 TB × 28 × $0.03 = $36,120
  - Cold tier (S3 Object, 335 days):    43 TB × 335 × $0.023 = $331,385
  -------------------------------------------------------------
  Annual Cost: ~$376,105 (vs $1,570,000 for pure hot storage).
  
  This represents a 76% cost reduction through architectural tiering.

Additional Cost Levers:

Intelligent Sampling: Retain 100% of ERROR and WARN logs, but sample only 10% of high-volume INFO and DEBUG lines, slashing log ingestion volume by 70%.
Heartbeat/Health Filtering: Filter out constant, redundant health check logs at the agent level before they ever hit the network.

Interview Walkthrough

Draw the pipeline: app → local agent (buffer + redact) → Kafka → indexer → hot/warm/cold storage tiers.
Mandate structured JSON logging — unstructured Grok parsing breaks on format drift and adds ingest latency.
Redact PII at the agent layer (regex + tokenization vault) before logs leave the host — retrospective deletion at petabyte scale is impossible.
Tier storage lifecycle: hot Elasticsearch (2 days) → warm HDD (28 days) → cold S3 (335 days) for 76% cost reduction.
Inject trace_id from OpenTelemetry into every log line to enable one-click jump from error log → trace waterfall → metrics.
Sample INFO/DEBUG at 10%, retain 100% of ERROR/WARN — cuts ingest volume ~70% with minimal observability loss.
Quantify scale: 500K logs/sec × 2 KB ≈ 1 GB/s ingest — agents must batch and compress before network upload.
Common pitfall: indexing all logs in hot Elasticsearch indefinitely — storage cost reaches ~$1.5M/year at 43 TB/day raw volume.

SLOs & Error Budgets

Metric	Target	Rationale
Core user-facing availability	99.95%	Budget for planned maintenance + unplanned failures without user-visible outage.
p99 latency (critical path)	Problem-specific — state target early and tie to capacity math	Interview credibility comes from connecting SLO to architecture choices.
Error rate (5xx)	< 0.1%	Distinguishes transient blips from systemic failure requiring rollback.
Data durability	99.999999999% (11 nines) for committed writes	Define which operations require fsync/quorum vs async replication.

Incident Scenarios (2am reality)

Scenario	How you detect	Mitigation
Primary database unavailable	Health check failures, connection pool exhaustion alerts, elevated 5xx	Failover to replica / promote standby; enable read-only degraded mode if writes impossible; queue writes if async path exists
Traffic spike (10× normal)	RPS anomaly alert, autoscaling lag, latency SLO burn rate	Rate limit non-critical endpoints; scale read path horizontally; pre-warm caches; shed load on expensive operations
Bad deploy causing elevated errors	Canary metric regression, error budget burn, deployment correlation	Automated rollback within 5 minutes; feature flag kill switch; maintain N-1 compatibility

Cost Drivers (Staff lens)

Egress bandwidth and CDN (often dominates media/data-heavy systems)
Database storage + IOPS at scale (plan compaction, TTL, tiering)
Compute for async pipelines (right-size workers, spot instances for batch)
Managed service premiums vs operational headcount trade-off

Multi-Region & DR

Start single-region with cross-AZ redundancy. Add read replicas in secondary region for DR. Move to active-active only when latency SLO or data residency requires it — accept conflict resolution complexity explicitly.

Interview Prompt

Clarifying Questions (ask before designing)

Scope

In scope

Out of scope (state explicitly)

Assumptions

Event Bus Design (Kafka)

Search Logs

Live Tail (WebSocket)

Common Error Responses

Elasticsearch Index Mapping

Index Lifecycle Management (ILM) Flow

1. Log Volume Spike: Cascading Failure Floods Logs

2. Multi-Line Log Parsing: Stack Traces

3. Index Management: The Elasticsearch Scale Problem

4. Live Tail Scaling: Streaming Matching Logs to Users

PII Redaction: Preventing Sensitive Data in Logs

Structured vs Unstructured Logging: Why JSON Wins

Log-Trace-Metric Correlation: The Observability Triangle

Cost Optimization: Calculations & Levers

Interview Walkthrough

Indexing Everything vs Indexing Labels Only

Phase 1: MVP (0 to 100K users)

Phase 2: Growth (100K to 10M users)

Phase 3: Scale (10M+ users)

SLOs & Error Budgets

Incident Scenarios (2am reality)

Cost Drivers (Staff lens)

Multi-Region & DR