Design a Real-time Dashboard and Metrics System

Interview Prompt

Design Real-time Dashboard and Metrics System.

Clarifying Questions (ask before designing)

Question	Why it matters
Which of these is highest priority: WebSocket push, Materialized view refresh, Time-range queries?	Forces scope negotiation — senior candidates trim before drawing boxes.
What scale should we design for — DAU, QPS, data volume?	Drives every capacity decision; shows structured thinking.
What are the read vs write patterns on the critical path?	Determines caching, DB choice, and replication topology.
What consistency and durability guarantees are required?	Separates strong-consistency paths from eventual ones — a senior differentiator.

Scope

In scope

WebSocket push
Materialized view refresh
Time-range queries
Downsampling for display
Capacity estimation with shown math

Out of scope (state explicitly)

Application instrumentation SDK design
Full distributed tracing system (#33)
On-call paging and escalation policy (#37)

Assumptions

Clarify scale (DAU, QPS, data volume) for realtime dashboard metrics in the first 5 minutes
Standard reliability target 99.9%–99.99% unless problem implies higher (payments, booking)
Managed cloud services (RDS, S3, Kafka, Redis) are acceptable building blocks

Create dashboards: Users create dashboards with multiple panels
Widget types: Line charts, bar charts, pie charts, heatmaps, tables, single-stat, alerts
Data sources: Query from Prometheus, ClickHouse, Elasticsearch, PostgreSQL, custom APIs
Auto-refresh: Dashboards auto-refresh at configurable intervals
Templating: Dashboard variables (dropdown for service name, region)
Alerting: Define alert rules; trigger notifications when thresholds breached
Sharing: Share dashboards via link; embed in other tools
Annotations: Mark events on time-series charts

Metric	Calculation	Value
Active dashboards	Given	50K
Panels per dashboard (avg)	Given	10
Dashboard views / day	Given	2M
Queries / sec (peak)	Derived from daily volume ÷ 86400 (+ peak factor)	50K
Avg query response time	Given	500 ms
Dashboard definitions storage	Given	< 1 GB (JSON configs)

Loading...

Query Proxy: The Performance Layer

Query Proxy:
  1. Parse dashboard template variables
  2. Route to correct data source
  3. Check query cache (Redis): same query within last 30s → return cached
  4. If cache miss → execute query against data source
  5. Apply transformations
  6. Cache result with TTL = refresh_interval
  7. Return to frontend

Query caching is critical: 100 users viewing the same dashboard = 100 identical queries
With caching: 1 query to backend, 99 served from cache

Dashboard Definition Model

JSON

{
  "dashboard_id": "dash-uuid",
  "title": "Production Overview",
  "variables": [
    {"name": "service", "type": "query", "query": "label_values(service_name)"}
  ],
  "panels": [
    {
      "panel_id": 1,
      "title": "Request Rate",
      "type": "timeseries",
      "datasource": "prometheus",
      "query": "rate(http_requests_total{service="$service"}[5m])",
      "interval": "1m",
      "position": {"x": 0, "y": 0, "w": 12, "h": 8}
    }
  ]
}

Dashboard CRUD

HTTP

POST /api/v1/dashboards      ← Create dashboard
GET /api/v1/dashboards/{id}   ← Get dashboard config
PUT /api/v1/dashboards/{id}   ← Update dashboard
DELETE /api/v1/dashboards/{id}

Query Data for Panel

HTTP

POST /api/v1/query
{
  "datasource": "prometheus",
  "query": "rate(http_requests_total{service='user-service'}[5m])",
  "from": "2026-03-14T04:00:00Z",
  "to": "2026-03-14T10:00:00Z",
  "interval": "1m"
}
→ { "data": [{"timestamp": 1710320000, "value": 342.5}, ...] }

Common Error Responses

400 Bad Request: invalid input, missing fields, or malformed JSON
401 Unauthorized: missing or invalid auth token or API key
403 Forbidden: authenticated but insufficient permissions
404 Not Found: resource ID does not exist
409 Conflict: duplicate write or version conflict; retry with idempotency key
422 Unprocessable Entity: valid syntax but invalid business logic
429 Too Many Requests: rate limit exceeded; honor Retry-After header
500 Internal Error: unexpected server fault; retry with idempotency key
503 Service Unavailable: dependency down or overloaded; use exponential backoff
504 Gateway Timeout: index shard slow; narrow query or retry

PostgreSQL: Dashboard Definitions

SQL

CREATE TABLE dashboards (
    dashboard_id    UUID PRIMARY KEY,
    org_id          UUID NOT NULL,
    title           VARCHAR(256),
    config          JSONB NOT NULL,
    version         INT DEFAULT 1,
    created_by      UUID,
    updated_at      TIMESTAMP,
    INDEX idx_org (org_id)
);

CREATE TABLE dashboard_versions (
    dashboard_id    UUID,
    version         INT,
    config          JSONB,
    updated_by      UUID,
    updated_at      TIMESTAMP,
    PRIMARY KEY (dashboard_id, version)
);

Redis: Query Cache

Key:    query_cache:{hash(datasource + query + time_range)}
Value:  compressed JSON result
TTL:    30 seconds (or dashboard refresh interval)

Concern	Solution
Data source down	Show stale cached data with 'Data source unavailable' warning
Query timeout	30-second timeout; show partial results or error per panel
Dashboard store down	Read from PostgreSQL replica; dashboard configs are small and cacheable
Concurrent dashboard edits	Optimistic locking (version field)
Query abuse	Per-user query rate limiting; max query complexity limits

Dashboard Stampede: Incident Causes Everyone to Open Dashboards

500 engineers open the same dashboard → 5000 queries fired simultaneously!

Solutions:
  1. Query result caching ⭐: First request cache miss → 1 backend query; rest cached
  2. Request coalescing (singleflight): In-flight queries wait, don't duplicate
  3. Auto-refresh stagger: Add random jitter to refresh intervals
  4. Pre-computed dashboards for P0 incidents

Data Source Timeout: Panel Shows Error While Others Load

Bad UX: Entire dashboard waits for the slowest panel → 30s to load

Solution: Independent panel loading ⭐
  Each panel queries independently (parallel)
  Panels with fast sources render immediately
  Slow panels show "Loading..." then result or timeout

Alert Evaluation: Avoiding False Positives

Rule: "Alert if error rate > 5% for 5 minutes"

State machine: OK → (breached) → PENDING → (for duration) → FIRING → (normal) → OK

Hysteresis: fire at 5%, resolve at 3% → prevents flapping

Interview Walkthrough

Separate dashboard rendering (WebSocket push to browsers) from metric storage (time-series DB) — they scale independently.
Pre-aggregate common queries at ingest (1-minute rollups) so dashboard panels query pre-computed buckets, not raw samples.
Push live updates via WebSocket subscriptions keyed by dashboard ID — clients receive delta refreshes, not full re-queries.
Cache query results with TTL and singleflight coalescing so 500 engineers opening the same incident dashboard share one backend query.
Load each panel independently with per-panel timeouts — fast panels render immediately while slow data sources show loading states.
Implement alert evaluation as a state machine (OK → PENDING → FIRING) with hysteresis to prevent flapping on noisy metrics.
Add random jitter to auto-refresh intervals so all clients don't hit the backend on the same second.
Common pitfall: letting every dashboard viewer trigger a fresh backend query on each refresh — the query stampede during incidents takes down the metrics store.

Server-Side vs Client-Side Rendering

Approach	Interactivity	Server Load	Large Dataset Handling
Client-side (Grafana) ⭐	Rich (zoom, hover, pan)	Reduced load	Browser chokes on big data
Server-side (image-based)	None (static image)	Higher load	Fast (renders on server)

SLOs & Error Budgets

Metric	Target	Rationale
Core user-facing availability	99.95%	Budget for planned maintenance + unplanned failures without user-visible outage.
p99 latency (critical path)	Problem-specific — state target early and tie to capacity math	Interview credibility comes from connecting SLO to architecture choices.
Error rate (5xx)	< 0.1%	Distinguishes transient blips from systemic failure requiring rollback.
Data durability	99.999999999% (11 nines) for committed writes	Define which operations require fsync/quorum vs async replication.

Incident Scenarios (2am reality)

Scenario	How you detect	Mitigation
Primary database unavailable	Health check failures, connection pool exhaustion alerts, elevated 5xx	Failover to replica / promote standby; enable read-only degraded mode if writes impossible; queue writes if async path exists
Traffic spike (10× normal)	RPS anomaly alert, autoscaling lag, latency SLO burn rate	Rate limit non-critical endpoints; scale read path horizontally; pre-warm caches; shed load on expensive operations
Bad deploy causing elevated errors	Canary metric regression, error budget burn, deployment correlation	Automated rollback within 5 minutes; feature flag kill switch; maintain N-1 compatibility

Cost Drivers (Staff lens)

Egress bandwidth and CDN (often dominates media/data-heavy systems)
Database storage + IOPS at scale (plan compaction, TTL, tiering)
Compute for async pipelines (right-size workers, spot instances for batch)
Managed service premiums vs operational headcount trade-off

Multi-Region & DR

Start single-region with cross-AZ redundancy. Add read replicas in secondary region for DR. Move to active-active only when latency SLO or data residency requires it — accept conflict resolution complexity explicitly.