Interview Prompt
Design Real-time Dashboard and Metrics System.
Clarifying Questions (ask before designing)
| Question | Why it matters |
|---|---|
| Which of these is highest priority: WebSocket push, Materialized view refresh, Time-range queries? | Forces scope negotiation — senior candidates trim before drawing boxes. |
| What scale should we design for — DAU, QPS, data volume? | Drives every capacity decision; shows structured thinking. |
| What are the read vs write patterns on the critical path? | Determines caching, DB choice, and replication topology. |
| What consistency and durability guarantees are required? | Separates strong-consistency paths from eventual ones — a senior differentiator. |
Scope
In scope
- WebSocket push
- Materialized view refresh
- Time-range queries
- Downsampling for display
- Capacity estimation with shown math
Out of scope (state explicitly)
- Application instrumentation SDK design
- Full distributed tracing system (#33)
- On-call paging and escalation policy (#37)
Assumptions
- Clarify scale (DAU, QPS, data volume) for realtime dashboard metrics in the first 5 minutes
- Standard reliability target 99.9%–99.99% unless problem implies higher (payments, booking)
- Managed cloud services (RDS, S3, Kafka, Redis) are acceptable building blocks
These foundational concepts underpin the patterns used in this problem. Review them before deep-diving into component-level trade-offs.
- Create dashboards: Users create dashboards with multiple panels
- Widget types: Line charts, bar charts, pie charts, heatmaps, tables, single-stat, alerts
- Data sources: Query from Prometheus, ClickHouse, Elasticsearch, PostgreSQL, custom APIs
- Auto-refresh: Dashboards auto-refresh at configurable intervals
- Templating: Dashboard variables (dropdown for service name, region)
- Alerting: Define alert rules; trigger notifications when thresholds breached
- Sharing: Share dashboards via link; embed in other tools
- Annotations: Mark events on time-series charts
- Low Latency: Dashboard renders all panels in < 3 seconds
- High Availability: 99.99%: dashboards used for incident response
- Concurrent Users: Support 10K users viewing dashboards simultaneously
- Scalability: Thousands of dashboards, each querying billions of data points
- Responsive: Work on desktop and mobile
| Metric | Calculation | Value |
|---|---|---|
| Active dashboards | Given | 50K |
| Panels per dashboard (avg) | Given | 10 |
| Dashboard views / day | Given | 2M |
| Queries / sec (peak) | Derived from daily volume ÷ 86400 (+ peak factor) | 50K |
| Avg query response time | Given | 500 ms |
| Dashboard definitions storage | Given | < 1 GB (JSON configs) |
Query Proxy: The Performance Layer
Query Proxy: 1. Parse dashboard template variables 2. Route to correct data source 3. Check query cache (Redis): same query within last 30s → return cached 4. If cache miss → execute query against data source 5. Apply transformations 6. Cache result with TTL = refresh_interval 7. Return to frontend Query caching is critical: 100 users viewing the same dashboard = 100 identical queries With caching: 1 query to backend, 99 served from cache
Dashboard Definition Model
{
"dashboard_id": "dash-uuid",
"title": "Production Overview",
"variables": [
{"name": "service", "type": "query", "query": "label_values(service_name)"}
],
"panels": [
{
"panel_id": 1,
"title": "Request Rate",
"type": "timeseries",
"datasource": "prometheus",
"query": "rate(http_requests_total{service="$service"}[5m])",
"interval": "1m",
"position": {"x": 0, "y": 0, "w": 12, "h": 8}
}
]
}Dashboard CRUD
POST /api/v1/dashboards ← Create dashboard
GET /api/v1/dashboards/{id} ← Get dashboard config
PUT /api/v1/dashboards/{id} ← Update dashboard
DELETE /api/v1/dashboards/{id}Query Data for Panel
POST /api/v1/query
{
"datasource": "prometheus",
"query": "rate(http_requests_total{service='user-service'}[5m])",
"from": "2026-03-14T04:00:00Z",
"to": "2026-03-14T10:00:00Z",
"interval": "1m"
}
→ { "data": [{"timestamp": 1710320000, "value": 342.5}, ...] }Common Error Responses
400 Bad Request: invalid input, missing fields, or malformed JSON 401 Unauthorized: missing or invalid auth token or API key 403 Forbidden: authenticated but insufficient permissions 404 Not Found: resource ID does not exist 409 Conflict: duplicate write or version conflict; retry with idempotency key 422 Unprocessable Entity: valid syntax but invalid business logic 429 Too Many Requests: rate limit exceeded; honor Retry-After header 500 Internal Error: unexpected server fault; retry with idempotency key 503 Service Unavailable: dependency down or overloaded; use exponential backoff 504 Gateway Timeout: index shard slow; narrow query or retry
PostgreSQL: Dashboard Definitions
CREATE TABLE dashboards (
dashboard_id UUID PRIMARY KEY,
org_id UUID NOT NULL,
title VARCHAR(256),
config JSONB NOT NULL,
version INT DEFAULT 1,
created_by UUID,
updated_at TIMESTAMP,
INDEX idx_org (org_id)
);
CREATE TABLE dashboard_versions (
dashboard_id UUID,
version INT,
config JSONB,
updated_by UUID,
updated_at TIMESTAMP,
PRIMARY KEY (dashboard_id, version)
);Redis: Query Cache
Key: query_cache:{hash(datasource + query + time_range)}
Value: compressed JSON result
TTL: 30 seconds (or dashboard refresh interval)| Concern | Solution |
|---|---|
| Data source down | Show stale cached data with 'Data source unavailable' warning |
| Query timeout | 30-second timeout; show partial results or error per panel |
| Dashboard store down | Read from PostgreSQL replica; dashboard configs are small and cacheable |
| Concurrent dashboard edits | Optimistic locking (version field) |
| Query abuse | Per-user query rate limiting; max query complexity limits |
Dashboard Stampede: Incident Causes Everyone to Open Dashboards
500 engineers open the same dashboard → 5000 queries fired simultaneously! Solutions: 1. Query result caching ⭐: First request cache miss → 1 backend query; rest cached 2. Request coalescing (singleflight): In-flight queries wait, don't duplicate 3. Auto-refresh stagger: Add random jitter to refresh intervals 4. Pre-computed dashboards for P0 incidents
Data Source Timeout: Panel Shows Error While Others Load
Bad UX: Entire dashboard waits for the slowest panel → 30s to load Solution: Independent panel loading ⭐ Each panel queries independently (parallel) Panels with fast sources render immediately Slow panels show "Loading..." then result or timeout
Alert Evaluation: Avoiding False Positives
Rule: "Alert if error rate > 5% for 5 minutes" State machine: OK → (breached) → PENDING → (for duration) → FIRING → (normal) → OK Hysteresis: fire at 5%, resolve at 3% → prevents flapping
Interview Walkthrough
- Separate dashboard rendering (WebSocket push to browsers) from metric storage (time-series DB) — they scale independently.
- Pre-aggregate common queries at ingest (1-minute rollups) so dashboard panels query pre-computed buckets, not raw samples.
- Push live updates via WebSocket subscriptions keyed by dashboard ID — clients receive delta refreshes, not full re-queries.
- Cache query results with TTL and singleflight coalescing so 500 engineers opening the same incident dashboard share one backend query.
- Load each panel independently with per-panel timeouts — fast panels render immediately while slow data sources show loading states.
- Implement alert evaluation as a state machine (OK → PENDING → FIRING) with hysteresis to prevent flapping on noisy metrics.
- Add random jitter to auto-refresh intervals so all clients don't hit the backend on the same second.
- Common pitfall: letting every dashboard viewer trigger a fresh backend query on each refresh — the query stampede during incidents takes down the metrics store.
Server-Side vs Client-Side Rendering
| Approach | Interactivity | Server Load | Large Dataset Handling |
|---|---|---|---|
| Client-side (Grafana) ⭐ | Rich (zoom, hover, pan) | Reduced load | Browser chokes on big data |
| Server-side (image-based) | None (static image) | Higher load | Fast (renders on server) |
Staff interviews expect you to articulate how the system evolves under real growth — not jump straight to the final architecture.
Phase 1: MVP (0 to 100K users)
Monolith or minimal services proving core realtime dashboard metrics flows. Optimize for shipping speed and correctness over scale.
Key components: Single region · Primary DB + Redis cache · Synchronous core path · Basic monitoring
Move to next phase when: p99 latency exceeds SLO or DB CPU sustained above 70%
Phase 2: Growth (100K to 10M users)
Split read/write paths, introduce async processing for non-critical work, add caching layers and horizontal scaling.
Key components: Read replicas or CQRS · Message queue for async work · CDN / edge caching · Service-level SLOs
Move to next phase when: Hot keys, fan-out bottlenecks, or ops toil from manual scaling
Phase 3: Scale (10M+ users)
Shard data plane, multi-region active-active or active-passive, formal DR runbooks, cost optimization.
Key components: Database sharding / partitioning · Multi-region replication · Auto-scaling + chaos testing · Dedicated platform/SRE ownership
Move to next phase when: Regional failure domain risk, compliance data residency, or linear cost growth unsustainable
SLOs & Error Budgets
| Metric | Target | Rationale |
|---|---|---|
| Core user-facing availability | 99.95% | Budget for planned maintenance + unplanned failures without user-visible outage. |
| p99 latency (critical path) | Problem-specific — state target early and tie to capacity math | Interview credibility comes from connecting SLO to architecture choices. |
| Error rate (5xx) | < 0.1% | Distinguishes transient blips from systemic failure requiring rollback. |
| Data durability | 99.999999999% (11 nines) for committed writes | Define which operations require fsync/quorum vs async replication. |
Incident Scenarios (2am reality)
| Scenario | How you detect | Mitigation |
|---|---|---|
| Primary database unavailable | Health check failures, connection pool exhaustion alerts, elevated 5xx | Failover to replica / promote standby; enable read-only degraded mode if writes impossible; queue writes if async path exists |
| Traffic spike (10× normal) | RPS anomaly alert, autoscaling lag, latency SLO burn rate | Rate limit non-critical endpoints; scale read path horizontally; pre-warm caches; shed load on expensive operations |
| Bad deploy causing elevated errors | Canary metric regression, error budget burn, deployment correlation | Automated rollback within 5 minutes; feature flag kill switch; maintain N-1 compatibility |
Cost Drivers (Staff lens)
- Egress bandwidth and CDN (often dominates media/data-heavy systems)
- Database storage + IOPS at scale (plan compaction, TTL, tiering)
- Compute for async pipelines (right-size workers, spot instances for batch)
- Managed service premiums vs operational headcount trade-off
Multi-Region & DR
Start single-region with cross-AZ redundancy. Add read replicas in secondary region for DR. Move to active-active only when latency SLO or data residency requires it — accept conflict resolution complexity explicitly.