Interview Prompt
Design On-Call Escalation System (like PagerDuty / OpsGenie).
Clarifying Questions (ask before designing)
| Question | Why it matters |
|---|---|
| Which of these is highest priority: Escalation policy state machine, Schedule management, Acknowledgment timeout? | Forces scope negotiation — senior candidates trim before drawing boxes. |
| What scale should we design for — DAU, QPS, data volume? | Drives every capacity decision; shows structured thinking. |
| What are the read vs write patterns on the critical path? | Determines caching, DB choice, and replication topology. |
| What consistency and durability guarantees are required? | Separates strong-consistency paths from eventual ones — a senior differentiator. |
Scope
Assumptions
- Clarify scale (DAU, QPS, data volume) for oncall escalation system in the first 5 minutes
- Standard reliability target 99.9%–99.99% unless problem implies higher (payments, booking)
- Managed cloud services (RDS, S3, Kafka, Redis) are acceptable building blocks
These foundational concepts underpin the patterns used in this problem. Review them before deep-diving into component-level trade-offs.
- Alert ingestion: Accept incoming alarm webhooks from Prometheus, Datadog, CloudWatch, and custom monitoring backends.
- On-call rotations: Define calendar schedules (weekly, daily, custom) with primary and secondary engineers.
- Escalation chains: Automatically escalate alerts if the primary engineer fails to acknowledge within N minutes.
- Multi-channel alerts: Page responders across push notifications, SMS texts, interactive voice calls, emails, and Slack.
- Acknowledge & Resolve: Provide clear ACK APIs to silence active alerts, stopping downstream escalation loops.
- Alert aggregation: Correlate related alerts into unified incidents to prevent on-call notification overload.
- Maintenance silencing: Allow scheduling maintenance windows to suppress active alerts during scheduled down times.
- Incident timeline: Log complete histories from initial trigger to final resolution, tracking team MTTR.
- Five Nines Availability: Deliver 99.999% availability. If the alerting engine fails, system health goes unnoticed.
- Low Propagation Latency: Route alerts from initial ingestion to device notification in under 30 seconds.
- Durable Notification: Ensure notifications reach the on-call person, falling back through channels if delivery fails.
- Cascading Scale: Handle 100K+ concurrent alerts per minute during severe, cascading site outages.
- Alert deduplication: At-least-once delivery with dedup window (5 min) per incident+level: prevents redundant pages without claiming transport exactly-once
- Immutable Audit Logs: Record an unalterable log tracking alert timing, responders paged, and ACK response actions.
| Metric | Calculation | Value |
|---|---|---|
| Alerts / day | 10,000,000 ÷ 86400 ≈ 116/s | 10,000,000 |
| Alerts / sec | From Alerts / day ÷ 86400 (+ peak factor in value) | ~115 avg (peak 10,000 during cascading failures) |
| Active On-Call Schedules | Derived | 50,000 teams |
| Notifications / day | 5,000,000 ÷ 86400 ≈ 58/s | 5,000,000 (alerts × channels) |
| Escalations / day | 500,000 ÷ 86400 | 500,000 |
I/O and Scale Derivations: - Average Alerts Ingestion: 10,000,000 / 86,400 sec = ~115 alerts/sec. - Peak Load Ingestion: 10,000 alerts/sec during cascading core failures. - Daily Notification Dispatch Volume: - 5,000,000 notifications/day. - Requires deep SMS/Voice API connection concurrency with providers (Twilio, MessageBird) to avoid throttle queues.
Alarms from Prometheus or Datadog land on the Alert Ingestion Service (normalized, validated, and de-duplicated). Events flow through Kafka topics to Alert Routers and Escalation Engines. The stateful Escalation Enginetracks incident lifecycles, and a Redis Sorted Set manages timers. Non-ACK events route through Notification Routersacross multiple delivery channels.
1. Stateful Escalation Policy State Machine
Escalating alarms through responder levels must occur reliably, regardless of user interaction. The escalation path transitions alerts across states:
Alert received for team "backend-infra": Escalation Policy Definition: Level 0: Primary on-call (Alice) — notify via push + SMS. Wait 5 min. Level 1: Secondary on-call (Bob) — notify via push + SMS + phone call. Wait 10 min. Level 2: Engineering Manager (Carol) — notify via phone call. Wait 15 min. Level 3: VP of Engineering (Dave) — notify via phone call + SMS. Escalation Execution Timeline: T=0:00 Alert Ingested → Dispatch alerts to Alice (Push + SMS) T=5:00 No ACK received from Alice → Escalate to Level 1 → Dispatch to Bob (Push + SMS + Call) T=15:00 No ACK received from Bob → Escalate to Level 2 → Dispatch to Carol (Phone Call) T=30:00 No ACK received from Carol → Escalate to Level 3 → Dispatch to Dave (Phone Call + SMS) Stop Condition: If any engineer acknowledges (ACK) at any point, the escalation timer is immediately terminated.
2. Escalation Timer Implementation (Redis Sorted Set) ⭐
Relying on database table scanning creates heavy query overhead. We implement a sub-second, highly scalable timer using Redis Sorted Sets:
Redis Sorted Set: ZADD escalation_timers {fire_epoch_timestamp} {alert_id}:{level}
Worker execution loop (polling every 10 seconds):
# Fetch all timers that are due
due_timers = ZRANGEBYSCORE escalation_timers 0 {current_epoch_time}
for each timer in due_timers:
trigger_escalation(timer.alert_id, timer.level)
ZREM escalation_timers timer
if has_next_escalation_level(timer.alert_id):
next_lvl = get_next_level(timer.alert_id)
timeout = get_level_timeout(timer.alert_id, next_lvl)
ZADD escalation_timers {now + timeout} {timer.alert_id}:{next_lvl}3. Alert Grouping & Noise Reduction
During cascading infrastructure failures, hundreds of downstream microservices will fire alerts simultaneously. We group incoming alarms:
Without grouping: On-call engineer receives 500 pages → alarm fatigue → ignores alerts. With Grouping Architecture: - Collect alerts in sliding windows (e.g., 5 minutes). - Group by shared attributes: service, alert_name, or custom labels. - Outcome: A single consolidated Incident ticket: "DB connection timeout (503 occurrences across 50 downstream microservices)" - Action: The engineer receives 1 notification page; acknowledging it silences all related alerts.
Event Bus Design (Kafka)
Topic: oncall_escalation_system-events Partitions: 64 (scale consumers horizontally) Partition key: entity_id (user_id / order_id — preserves per-entity ordering) Retention: 7 days (compliance) or 24h (high-volume telemetry) Replication factor: 3, min.insync.replicas: 2 Producer: idempotent producer enabled (enable.idempotence=true) Consumer: consumer group "oncall_escalation_system-processors" - At-least-once delivery + idempotent handlers (dedup by event_id) - DLQ topic: oncall_escalation_system-events-dlq (poison messages after 3 retries) - Lag alert: consumer lag > 60s → scale workers Design an On-Call Escalation System (like PagerDuty / OpsGenie): async side effects MUST NOT block the synchronous API response. Sync path: validate → persist source of truth → publish event → return 201 Async path: consumers update caches, indexes, notifications, aggregates
1. Ingest Alert
POST /api/v1/alerts
Content-Type: application/json
Authorization: Bearer <integration_token>
{
"source": "prometheus",
"alert_name": "HighCPUUsage",
"severity": "critical",
"service": "payment-service",
"labels": {
"env": "production",
"cluster": "us-east-1"
},
"description": "CPU usage > 90% for 5 consecutive minutes",
"dedup_key": "high_cpu_payment_us_east"
}
Response: 202 Accepted
{
"alert_id": "8a219b1b-640a-4289-9812-42171542fca1",
"status": "triggered"
}2. Acknowledge Alert
POST /api/v1/alerts/8a219b1b-640a-4289-9812-42171542fca1/acknowledge
Content-Type: application/json
{
"acknowledged_by": "alice@company.com"
}
Response: 200 OK
{
"alert_id": "8a219b1b-640a-4289-9812-42171542fca1",
"status": "acknowledged",
"acknowledged_at": "2026-03-14T10:05:12Z"
}3. Retrieve Active On-Call Schedule
GET /api/v1/schedules/team-backend-infra/on-call?at=2026-03-14T10:00:00Z
Response: 200 OK
{
"team_id": "team-backend-infra",
"primary": {
"user_id": "user-881",
"email": "alice@company.com",
"phone": "+15550199"
},
"secondary": {
"user_id": "user-902",
"email": "bob@company.com",
"phone": "+15550212"
}
}Common Error Responses
400 Bad Request: invalid input, missing fields, or malformed JSON 401 Unauthorized: missing or invalid auth token or API key 403 Forbidden: authenticated but insufficient permissions 404 Not Found: resource ID does not exist 409 Conflict: duplicate write or version conflict; retry with idempotency key 422 Unprocessable Entity: valid syntax but invalid business logic 429 Too Many Requests: rate limit exceeded; honor Retry-After header 500 Internal Error: unexpected server fault; retry with idempotency key 503 Service Unavailable: dependency down or overloaded; use exponential backoff
PostgreSQL Database Schema
-- Primary alerts store
CREATE TABLE alerts (
alert_id UUID PRIMARY KEY,
dedup_key VARCHAR(256) NOT NULL,
alert_name VARCHAR(256) NOT NULL,
severity VARCHAR(20) NOT NULL,
service VARCHAR(128) NOT NULL,
labels JSONB,
description TEXT,
status VARCHAR(20) DEFAULT 'triggered', -- triggered, acknowledged, resolved
escalation_level INT DEFAULT 0,
team_id UUID NOT NULL,
acknowledged_by VARCHAR(256),
acknowledged_at TIMESTAMP,
resolved_at TIMESTAMP,
created_at TIMESTAMP DEFAULT NOW()
);
-- Enforce single active alert per unique dedup_key (partial unique index)
CREATE UNIQUE INDEX idx_alerts_active_dedup ON alerts(dedup_key) WHERE status != 'resolved';
CREATE INDEX idx_alerts_status ON alerts(status) WHERE status = 'triggered';
-- Escalation Policy configuration table
CREATE TABLE escalation_policies (
policy_id UUID PRIMARY KEY,
team_id UUID NOT NULL,
levels JSONB NOT NULL -- [{level: 0, wait_min: 5, channels: ["push", "sms"]}, ...]
);
-- Rotation Schedules table
CREATE TABLE on_call_schedules (
schedule_id UUID PRIMARY KEY,
team_id UUID NOT NULL,
rotation_type VARCHAR(20) NOT NULL, -- weekly, daily, custom
participants JSONB NOT NULL, -- [{user_id: "usr-1", start: "timestamp", end: "timestamp"}]
timezone VARCHAR(64) DEFAULT 'UTC'
);| Concern Scenario | System Solution Design |
|---|---|
| Timer Worker Node Crashes | Timer state resides in highly-available Redis. New instances leverage leader election to resume polling pending keys immediately. |
| Telecom Provider Blackouts | Integrate redundant message gateways. Switch Twilio SMS to Vonage or MessageBird automatically on timeout. |
| Timer Resolution Drift | Align system clocks using NTP across all hosts, maintaining precision within < 10ms. |
1. Concurrent Acknowledge Race Conditions ⭐
If two engineers click to acknowledge an alert simultaneously, both must be handled cleanly. We employ atomic Compare-And-Swap (CAS) updates:
Alice (primary) and Bob (secondary) both receive notifications. Alice clicks ACK at T=5:00.000. Bob clicks ACK at T=5:00.050. Atomic CAS Query Execution: UPDATE alerts SET status = 'acknowledged', acknowledged_by = 'alice', acknowledged_at = NOW() WHERE alert_id = 'alert-9912' AND status = 'triggered'; Results: - Alice's Query: Finds status = 'triggered' → Updates database. rows_affected = 1. Success! - Bob's Query: status is now 'acknowledged' → Query matches 0 rows. rows_affected = 0. - Return friendly error: "Alert already acknowledged by Alice". - Cancel any remaining escalation timers.
2. Alert Storm Failures ⭐
When a core database crashes, thousands of connection alerts can overwhelm downstream responders. We defend the team using multi-level filtering:
Mitigation Hierarchy: 1. Ingestion Deduplication: - Calculate a hash (dedup_key) from alert name + service + labels. - If an active alert exists with that key, discard/merge the payload, incrementing the counter instead of generating new pages. 2. Sliding-Window Grouping: - Window related alerts within 5 minutes into single consolidated incidents. 3. Downstream Correlation Engines: - Map dependencies (e.g. if the physical database is down, suppress downstream connection failure alarms). 4. Strict Personal Rate Limiting: - Cap paging frequency to max 10 pages per hour per responder. - Exceeded counts queue into hourly digest summaries (excluding P0 blockages).
3. Handling Schedule Gaps ⭐
If team transitions leave a calendar gap where no responder is active, alarms must not be lost:
Alice's shift ends Saturday 18:00. Bob's shift starts Sunday 09:00. An alert fires Saturday 23:00. Who gets paged? Pre-emptive Health Checks: - Run background cron jobs daily to scan all team schedules for gaps 7 days out. - Alert administrators if schedule gaps are identified. Active Escalation Resolution Rules: 1. Fallback to Secondary: Try paging the secondary on-call directly. 2. Fallback to Team Lead: Escalate to the team's designated Engineering Manager. 3. VP Emergency: Root escalation escalates to the VP of Engineering as a safety net. 4. Alerts MUST be delivered; ignoring them is not permitted.
1. Interactive Voice Response (IVR) Verification ⭐
Phone calls provide the most reliable way to wake engineers during off-hours incidents. We implement Twilio Voice workflows:
Push notifications and SMS texts are easily ignored or suppressed by "Do Not Disturb" profiles.
Voice calls force audible phone rings on most modern devices.
Twilio Interactive Voice Response (IVR) Implementation:
1. Ingest alert → triggering Twilio REST API voice call.
2. Twilio establishes a connection and streams synthesized Text-to-Speech (TTS):
"Alert: payment-service CPU usage exceeded 90%.
Press 1 to acknowledge this alert. Press 2 to escalate to the secondary responder."
3. User interaction captures DTMF tones:
- User presses 1 → triggers Twilio webhook payload → updates DB using atomic CAS → terminates timer.
- User hangs up or presses 2 → triggers immediate escalation, paging secondary.
4. Call Failure Recovery:
- If the call hits busy signals or fails, retry within 60 seconds.
- Max 3 call attempts per escalation level.2. Schedule Overrides & Swap Handling
On-call engineers frequently need to cover slots for peers temporarily. The routing engine handles overrides dynamically:
Alice is on-call but has a doctor's appointment Tuesday 14:00 to 16:00. She swaps coverage with Bob.
System Data Architecture:
- on_call_schedules: Base rotations (weekly/daily).
- schedule_overrides: Table containing override records:
{ user: "bob", start: "Tuesday 14:00", end: "Tuesday 16:00", covers_for: "alice" }
Resolution Hierarchy:
1. Check schedule_overrides for active overrides matching current timestamp (highest priority).
2. If none, check the primary rotation schedules.
3. Fall back to designated managers if both paths are empty.3. Achieving Five Nines (99.999% SLA) ⭐
Building a system that can only fail for 5 minutes per year requires deep infrastructure redundancy:
Target: 99.999% Availability (No more than 5 minutes of downtime per calendar year). Architectural Path to Five Nines: 1. Active-Active Cross-Region Deployment: - Replicate primary database clusters synchronously across regions (e.g., us-east-1 and eu-west-1). - If a regional infrastructure failure occurs, Route53 dns queries failover traffic automatically. 2. Provider Redundancy (No Single Point of Failure): - Push APIs: Support both Apple APNs and Google FCM pipelines. - SMS Gateways: Route through Twilio, fallback to MessageBird or Sinch if Twilio suffers latency. - Voice Calls: Route calls through Twilio, falling back to Vonage / Nexmo. 3. Chaos Engineering: - Regularly execute automated tests cutting regional data links, dropping primary database writes, or simulating complete Twilio API blackouts to confirm flawless fallback routing.
Interview Walkthrough
- Clarify the core loop: alert ingested → match on-call schedule → notify primary → start escalation timer if unacknowledged.
- Model on-call schedules as time-bounded rotations with override support for temporary swaps and holiday coverage.
- Design an escalation ladder: push notification → SMS → phone call (IVR press-1-to-ack) with configurable per-step timeouts.
- Deduplicate alerts by fingerprint (service + error signature) to prevent notification storms during cascading failures.
- Use a durable timer engine (Redis ZSET or dedicated scheduler) so escalation steps fire even if the alerting service restarts.
- Track delivery and acknowledgment state per channel — an unacked push should trigger the next escalation tier automatically.
- For 99.999% SLA, describe multi-region active-active deployment with external paging provider failover (Twilio/PagerDuty).
- Common pitfall: synchronous phone-call initiation blocking the alert ingestion path — the hot path must enqueue and return immediately.
1. Timer Engine Implementations
| Approach | Precision | Durability | DB Load | Complexity |
|---|---|---|---|---|
| Redis Sorted Set ⭐ | Sub-second | Volatile (mitigated via AOF/Snapshotting) | Extremely Low | Low |
| Database Polling | Low (bounded by polling frequency) | High | High (frequent sequential queries) | Low |
| Kafka Delayed Messages | Low (requires bucketed queues) | High | None | High |
Staff interviews expect you to articulate how the system evolves under real growth — not jump straight to the final architecture.
Phase 1: MVP (0 to 100K users)
Monolith or minimal services proving core oncall escalation system flows. Optimize for shipping speed and correctness over scale.
Key components: Single region · Primary DB + Redis cache · Synchronous core path · Basic monitoring
Move to next phase when: p99 latency exceeds SLO or DB CPU sustained above 70%
Phase 2: Growth (100K to 10M users)
Split read/write paths, introduce async processing for non-critical work, add caching layers and horizontal scaling.
Key components: Read replicas or CQRS · Message queue for async work · CDN / edge caching · Service-level SLOs
Move to next phase when: Hot keys, fan-out bottlenecks, or ops toil from manual scaling
Phase 3: Scale (10M+ users)
Shard data plane, multi-region active-active or active-passive, formal DR runbooks, cost optimization.
Key components: Database sharding / partitioning · Multi-region replication · Auto-scaling + chaos testing · Dedicated platform/SRE ownership
Move to next phase when: Regional failure domain risk, compliance data residency, or linear cost growth unsustainable
SLOs & Error Budgets
| Metric | Target | Rationale |
|---|---|---|
| Core user-facing availability | 99.95% | Budget for planned maintenance + unplanned failures without user-visible outage. |
| p99 latency (critical path) | Problem-specific — state target early and tie to capacity math | Interview credibility comes from connecting SLO to architecture choices. |
| Error rate (5xx) | < 0.1% | Distinguishes transient blips from systemic failure requiring rollback. |
| Data durability | 99.999999999% (11 nines) for committed writes | Define which operations require fsync/quorum vs async replication. |
Incident Scenarios (2am reality)
| Scenario | How you detect | Mitigation |
|---|---|---|
| Primary database unavailable | Health check failures, connection pool exhaustion alerts, elevated 5xx | Failover to replica / promote standby; enable read-only degraded mode if writes impossible; queue writes if async path exists |
| Traffic spike (10× normal) | RPS anomaly alert, autoscaling lag, latency SLO burn rate | Rate limit non-critical endpoints; scale read path horizontally; pre-warm caches; shed load on expensive operations |
| Bad deploy causing elevated errors | Canary metric regression, error budget burn, deployment correlation | Automated rollback within 5 minutes; feature flag kill switch; maintain N-1 compatibility |
Cost Drivers (Staff lens)
- Egress bandwidth and CDN (often dominates media/data-heavy systems)
- Database storage + IOPS at scale (plan compaction, TTL, tiering)
- Compute for async pipelines (right-size workers, spot instances for batch)
- Managed service premiums vs operational headcount trade-off
Multi-Region & DR
Start single-region with cross-AZ redundancy. Add read replicas in secondary region for DR. Move to active-active only when latency SLO or data residency requires it — accept conflict resolution complexity explicitly.