Interview Prompt
Design Log Aggregation and Search System (like Splunk / ELK).
Clarifying Questions (ask before designing)
| Question | Why it matters |
|---|---|
| Query QPS vs document corpus size vs index freshness SLA? | Separates retrieval path from indexing pipeline design. |
| What scale should we design for — DAU, QPS, data volume? | Drives every capacity decision; shows structured thinking. |
| What are the read vs write patterns on the critical path? | Determines caching, DB choice, and replication topology. |
| What consistency and durability guarantees are required? | Separates strong-consistency paths from eventual ones — a senior differentiator. |
Scope
In scope
- Agent → collector → indexer pipeline
- Structured logging
- Index rotation & retention
- Full-text search at scale
- Hot-warm-cold tiering
- Log-based alerting
Out of scope (state explicitly)
- Application instrumentation SDK design
- Full distributed tracing system (#33)
- On-call paging and escalation policy (#37)
Assumptions
- Index staleness of minutes is acceptable unless real-time is stated
- Clarify query QPS vs index update rate early
- Managed search/stream stack (Elasticsearch, Kafka) is fine to propose
These foundational concepts underpin the patterns used in this problem. Review them before deep-diving into component-level trade-offs.
- Collect logs: Ingest logs from thousands of services, servers, containers, cloud resources
- Structured and unstructured: Handle JSON, syslog, plain text, multi-line stack traces
- Search: Full-text search across all logs with < 5 second latency
- Filter: By service, severity, time range, hostname, custom fields
- Live tail: Stream new logs matching a filter in real-time
- Dashboards: Visualize log volume, error rates, trends
- Alerting: Alert when error log rate exceeds threshold or specific patterns appear
- Retention: Configurable per log source (7 days to 1 year)
- High Throughput: Ingest 1M+ log lines/sec
- Low Latency Search: Query results in < 5 seconds for last 24 hours
- Scalability: Petabytes of logs, thousands of data sources
- Durability: Ingested logs must not be lost
- Cost Efficient: Storage costs are the dominant expense: compress and tier aggressively
- High Availability: 99.99%
At scale, storage costs and physical input/output throughput dominate the design. Aggressive compression and smart data lifecycle tiering are required.
| Metric | Calculation | Value |
|---|---|---|
| Log lines / sec | From Log lines / day ÷ 86400 (+ peak factor in value) | 1M |
| Avg log line size | Given (typical workload assumption) | 500 bytes |
| Ingestion throughput | Given (assumption documented in value) | 500 MB/s |
| Storage / day (raw) | Derived from upstream throughput × size | 43 TB |
| With compression (~10x) | Given (assumption documented in value) | 4.3 TB/day |
| 30-day retention | Given | ~130 TB |
| 1-year (with tiering) | Given | ~400 TB |
The system employs a pull-collect-push architecture: local agents tail log files, enrich and compress events, and stream them to Kafka. Ingestors consume from Kafka, parse raw payloads into structured documents, redact sensitive PII, and route them to tiered storage.
Event Bus Design (Kafka)
Topic: log_aggregation_search-events Partitions: 64 (scale consumers horizontally) Partition key: entity_id (user_id / order_id — preserves per-entity ordering) Retention: 7 days (compliance) or 24h (high-volume telemetry) Replication factor: 3, min.insync.replicas: 2 Producer: idempotent producer enabled (enable.idempotence=true) Consumer: consumer group "log_aggregation_search-processors" - At-least-once delivery + idempotent handlers (dedup by event_id) - DLQ topic: log_aggregation_search-events-dlq (poison messages after 3 retries) - Lag alert: consumer lag > 60s → scale workers Design a Log Aggregation and Search System (like Splunk / ELK): async side effects MUST NOT block the synchronous API response. Sync path: validate → persist source of truth → publish event → return 201 Async path: consumers update caches, indexes, notifications, aggregates
Search Logs
Queries logs using a powerful query syntax, supporting wildcards, logical operators, and precise filters.
POST /api/v1/logs/search
{
"query": "level:ERROR AND service:payment-service AND message:*timeout*",
"from": "2026-03-14T04:00:00Z",
"to": "2026-03-14T10:00:00Z",
"limit": 100,
"sort": "timestamp:desc"
}Live Tail (WebSocket)
Establishes a persistent bi-directional connection to stream matching logs to the user in real-time.
// Client sends subscription request:
{
"type": "subscribe",
"filter": "service=payment-service AND level=ERROR"
}
// Server pushes matching logs in real-time as they arrive in Kafka:
{
"type": "log",
"data": {
"timestamp": "2026-03-14T10:05:00.123Z",
"service": "payment-service",
"level": "ERROR",
"message": "Gateway timeout: Stripe API failed to respond"
}
}Common Error Responses
400 Bad Request: invalid input, missing fields, or malformed JSON 401 Unauthorized: missing or invalid auth token or API key 403 Forbidden: authenticated but insufficient permissions 404 Not Found: resource ID does not exist 409 Conflict: duplicate write or version conflict; retry with idempotency key 422 Unprocessable Entity: valid syntax but invalid business logic 429 Too Many Requests: rate limit exceeded; honor Retry-After header 500 Internal Error: unexpected server fault; retry with idempotency key 503 Service Unavailable: dependency down or overloaded; use exponential backoff 504 Gateway Timeout: index shard slow; narrow query or retry
Elasticsearch Index Mapping
Structured logs mapped to specific field data types to optimize space and query speeds. High-cardinality fields are marked as keyword (non-analyzed) to prevent indexing overhead.
{
"mappings": {
"properties": {
"timestamp": {"type": "date"},
"level": {"type": "keyword"},
"service": {"type": "keyword"},
"host": {"type": "keyword"},
"env": {"type": "keyword"},
"message": {"type": "text", "analyzer": "standard"},
"trace_id": {"type": "keyword"},
"custom_fields": {"type": "object", "dynamic": true}
}
},
"settings": {
"index.number_of_shards": 5,
"index.number_of_replicas": 1,
"index.lifecycle.name": "logs-policy"
}
}Index Lifecycle Management (ILM) Flow
Automates the migration of indices down the storage hierarchy:
Hot Phase: 0-2 days → 5 shards, 1 replica, SSD, force merge at rollover Warm Phase: 2-30 days → shrink to 1 shard, read-only, HDD, compressed Cold Phase: 30-90 days → freeze index, S3 snapshot, 0 replicas Delete Phase: > 90 days → delete index (or archive snapshot to Glacier)
The ingestion pipeline must handle physical bottlenecks, downstream system failures, and unexpected log volume surges.
| Concern | Solution |
|---|---|
| Agent failure | Agent buffers to local disk; retry on restart |
| Kafka down | Agent buffers locally (up to 1 GB); retries with backoff |
| Elasticsearch overload | Kafka absorbs bursts; auto-scale indexer workers |
| Index corruption | Elasticsearch replicas; daily S3 snapshots |
| Query overload | Query timeout (30s); rate limiting per user |
| Disk full | ILM auto-deletes oldest indices; alert on disk usage > 80% |
1. Log Volume Spike: Cascading Failure Floods Logs
The Problem: An application encounters an error (e.g., database connection timeout), logs it, retries immediately, and logs it again. In a cluster of 500 pods, this can suddenly flood the logging system with 50M log lines/sec (100x normal load). This causes Kafka topics to fill up, indexing queues to fall behind (hours of query lag), and node disks to run out of space.
Solutions:
- Agent-level Rate Limiting: The local agent throttles log shipping. Configured to emit maximum 1,000 log lines/sec per container. Excess logs are sampled or dropped with a summary line:
"Dropped 49,000 log entries from payment-service in the last minute". - Kafka Per-Service Quotas: Throttles producers sending too much data, introducing backpressure to the logging framework.
- Circuit Breaker in Log Libraries: If logging starts blocking or failing, the application-level log library drops logs gracefully or falls back to basic counters.
- Tiered Ingestion (Severity Shedding): During cluster spikes, indexers drop
DEBUGandINFOlogs to preserveERRORandCRITICALevents.
2. Multi-Line Log Parsing: Stack Traces
The Problem: Multi-line runtime exceptions (e.g., Java stack traces) are printed on separate lines. If parsed naively, a single stack trace gets split into 10 separate log entries, making it impossible to search or debug.
Solutions:
- Agent-level Regexp Assembly: The agent buffers multiline blocks. Configured rule:
"Lines starting with whitespace or 'at' are continuations"(e.g.,pattern: '^\s|^Caused by|^at '). - Flushing Timeouts: After 500ms without new continuation lines, the agent flushes the multiline buffer to prevent delay.
- Structured JSON Logging: By forcing applications to serialize exception objects directly into structured JSON strings on a single stdout line, multi-line agent parsing is eliminated entirely.
3. Index Management: The Elasticsearch Scale Problem
The Problem: A cluster ingesting 4.3 TB/day of compressed logs across 50 distinct services will produce thousands of shards. If every service creates a daily index, the sheer count of active shards will overload the Elasticsearch cluster master node (leading to metadata heap bottlenecks).
Solutions:
- Unified Data Streams: Combine small services into a single unified index stream, filtering by a
servicekeyword field at query time. - Force Merge Rollovers: Shrink 5 shards into a single unified read-only shard during warm rollover phase, reducing active file handles.
- Replica Pruning: Drop read replicas on indices older than 7 days, relying on automatic S3 snapshots for recovery.
4. Live Tail Scaling: Streaming Matching Logs to Users
The Problem: Users tailing logs expect real-time logs in their console. Polling Elasticsearch every second is expensive and causes severe search cluster load. Running a direct filter consumer on the main raw Kafka topic (1M lines/sec) per user session does not scale.
Solutions:
- Pre-Filtered Event Topics: The log processor routes log events matching common high-frequency parameters (such as error alerts) to a secondary, pre-filtered low-volume Kafka topic.
- Distributed Session Filtering: Restrict concurrent live tail sessions to 50 active users, fallback to short-poll indexing searches beyond this threshold to protect system integrity.
PII Redaction: Preventing Sensitive Data in Logs
Developers frequently log sensitive customer details by accident (such as emails, passwords, SSNs, credit cards). Leaving these in raw text violates security audits and data protection laws (GDPR, CCPA). Performing a retrospective deletion scan across petabytes of index logs is physically impossible.
Defense-in-Depth Solution:
- Layer 1: Code Linting: Prevent PII commit via CI rules checking for typical sensitive parameter names.
- Layer 2: Agent Redaction: Local agent executes high-speed regex sanitization matches:
- Email regex:
[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,} - SSN regex:
\d{3}-\d{2}-\d{4} - Credit Card regex:
\d{4}[- ]?\d{4}[- ]?\d{4}[- ]?\d{4}
- Email regex:
- Layer 3: Tokenization Vaults: Replace identified PII parameters with reversible transaction tokens (e.g.
tok_39d8c4). Only compliance auditors have decryption keys to reverse these tokens in secure vaults.
Structured vs Unstructured Logging: Why JSON Wins
Unstructured logs require complex, fragile Grok/Regex regex processing pipelines, which fail easily on format drift:
Unstructured raw log line:
2026-03-14 10:00:00 ERROR [payment-service] User 12345 payment failed: timeout after 5000ms
Structured JSON log line:
{"timestamp":"2026-03-14T10:00:00Z","level":"ERROR","service":"payment-service","user_id":"12345","message":"payment failed","error":"timeout","duration_ms":5000,"trace_id":"abc123"}Why JSON wins:
- Zero parsing overhead: Direct key-value injection into indexers, no Grok matches required.
- Type preservation: Numeric values (e.g.,
duration_ms) remain queryable numbers rather than strings. - Compression friendly: JSON structures repeating identical field keys are highly deduplicated by compression algorithms (like LZ4/ZSTD).
Log-Trace-Metric Correlation: The Observability Triangle
Tracing a user request failure across multiple microservices is challenging without system-wide correlation.
Unified Troubleshooting Workflow:
- The SDK auto-injects
trace_idandspan_idfrom OpenTelemetry contexts into every application log payload. - When an error log is flagged, click the
trace_idlink to load a Jaeger transaction waterfall trace. - Identify the slow DB call, then select "View metrics" to inspect the database engine dashboard instantly.
{"level":"ERROR","message":"payment timeout","trace_id":"abc123","span_id":"def456"}Cost Optimization: Calculations & Levers
At 43 TB/day raw, naive SSD storage in Elasticsearch costs $1.57M / year. Applying an aggressive tiering lifecycle architecture shrinks this expenditure dramatically:
Annual Ingest Cost Breakdown with Tiering: - Hot tier (Elasticsearch, 2 days): 43 TB × 2 × $0.10 = $8,600 - Warm tier (ES HDD, 28 days): 43 TB × 28 × $0.03 = $36,120 - Cold tier (S3 Object, 335 days): 43 TB × 335 × $0.023 = $331,385 ------------------------------------------------------------- Annual Cost: ~$376,105 (vs $1,570,000 for pure hot storage). This represents a 76% cost reduction through architectural tiering.
Additional Cost Levers:
- Intelligent Sampling: Retain 100% of
ERRORandWARNlogs, but sample only 10% of high-volumeINFOandDEBUGlines, slashing log ingestion volume by 70%. - Heartbeat/Health Filtering: Filter out constant, redundant health check logs at the agent level before they ever hit the network.
Interview Walkthrough
- Draw the pipeline: app → local agent (buffer + redact) → Kafka → indexer → hot/warm/cold storage tiers.
- Mandate structured JSON logging — unstructured Grok parsing breaks on format drift and adds ingest latency.
- Redact PII at the agent layer (regex + tokenization vault) before logs leave the host — retrospective deletion at petabyte scale is impossible.
- Tier storage lifecycle: hot Elasticsearch (2 days) → warm HDD (28 days) → cold S3 (335 days) for 76% cost reduction.
- Inject
trace_idfrom OpenTelemetry into every log line to enable one-click jump from error log → trace waterfall → metrics. - Sample INFO/DEBUG at 10%, retain 100% of ERROR/WARN — cuts ingest volume ~70% with minimal observability loss.
- Quantify scale: 500K logs/sec × 2 KB ≈ 1 GB/s ingest — agents must batch and compress before network upload.
- Common pitfall: indexing all logs in hot Elasticsearch indefinitely — storage cost reaches ~$1.5M/year at 43 TB/day raw volume.
Indexing Everything vs Indexing Labels Only
How much index metadata do we write? Full-text index on all message lines vs metadata indexing.
Full indexing (Elasticsearch): ✓ Every word in every log line is indexed into an inverted index. ✓ Ad-hoc search queries (e.g., "timeout in database transaction") execute instantly. ✗ Ingestion CPU overhead is massive. ✗ Index files occupy 1.5x - 2x the raw log size, doubling storage costs. Label-only indexing (Grafana Loki): ✓ Only metadata tags are indexed (service, level, host, environment). ✓ Log message bodies are stored as raw compressed chunks without indices. ✓ 10x less storage, and 10x higher ingestion rates. ✗ Searching log messages requires brute-force grep scan across chunks (slow for large data ranges).
Decision Matrix: Use **Loki** if your operational queries are highly structured around known services (e.g., "logs for payment pod"). Use **Elasticsearch** if your workflow demands arbitrary search engine capabilities across all system components simultaneously.
Staff interviews expect you to articulate how the system evolves under real growth — not jump straight to the final architecture.
Phase 1: MVP (0 to 100K users)
Monolith or minimal services proving core log aggregation search flows. Optimize for shipping speed and correctness over scale.
Key components: Single region · Primary DB + Redis cache · Synchronous core path · Basic monitoring
Move to next phase when: p99 latency exceeds SLO or DB CPU sustained above 70%
Phase 2: Growth (100K to 10M users)
Split read/write paths, introduce async processing for non-critical work, add caching layers and horizontal scaling.
Key components: Read replicas or CQRS · Message queue for async work · CDN / edge caching · Service-level SLOs
Move to next phase when: Hot keys, fan-out bottlenecks, or ops toil from manual scaling
Phase 3: Scale (10M+ users)
Shard data plane, multi-region active-active or active-passive, formal DR runbooks, cost optimization.
Key components: Database sharding / partitioning · Multi-region replication · Auto-scaling + chaos testing · Dedicated platform/SRE ownership
Move to next phase when: Regional failure domain risk, compliance data residency, or linear cost growth unsustainable
SLOs & Error Budgets
| Metric | Target | Rationale |
|---|---|---|
| Core user-facing availability | 99.95% | Budget for planned maintenance + unplanned failures without user-visible outage. |
| p99 latency (critical path) | Problem-specific — state target early and tie to capacity math | Interview credibility comes from connecting SLO to architecture choices. |
| Error rate (5xx) | < 0.1% | Distinguishes transient blips from systemic failure requiring rollback. |
| Data durability | 99.999999999% (11 nines) for committed writes | Define which operations require fsync/quorum vs async replication. |
Incident Scenarios (2am reality)
| Scenario | How you detect | Mitigation |
|---|---|---|
| Primary database unavailable | Health check failures, connection pool exhaustion alerts, elevated 5xx | Failover to replica / promote standby; enable read-only degraded mode if writes impossible; queue writes if async path exists |
| Traffic spike (10× normal) | RPS anomaly alert, autoscaling lag, latency SLO burn rate | Rate limit non-critical endpoints; scale read path horizontally; pre-warm caches; shed load on expensive operations |
| Bad deploy causing elevated errors | Canary metric regression, error budget burn, deployment correlation | Automated rollback within 5 minutes; feature flag kill switch; maintain N-1 compatibility |
Cost Drivers (Staff lens)
- Egress bandwidth and CDN (often dominates media/data-heavy systems)
- Database storage + IOPS at scale (plan compaction, TTL, tiering)
- Compute for async pipelines (right-size workers, spot instances for batch)
- Managed service premiums vs operational headcount trade-off
Multi-Region & DR
Start single-region with cross-AZ redundancy. Add read replicas in secondary region for DR. Move to active-active only when latency SLO or data residency requires it — accept conflict resolution complexity explicitly.