Design an On-Call Escalation System (like PagerDuty / OpsGenie)

Interview Prompt

Design On-Call Escalation System (like PagerDuty / OpsGenie).

Clarifying Questions (ask before designing)

Question	Why it matters
Which of these is highest priority: Escalation policy state machine, Schedule management, Acknowledgment timeout?	Forces scope negotiation — senior candidates trim before drawing boxes.
What scale should we design for — DAU, QPS, data volume?	Drives every capacity decision; shows structured thinking.
What are the read vs write patterns on the critical path?	Determines caching, DB choice, and replication topology.
What consistency and durability guarantees are required?	Separates strong-consistency paths from eventual ones — a senior differentiator.

Scope

In scope

Escalation policy state machine
Schedule management
Acknowledgment timeout
Multi-channel alert
Capacity estimation with shown math

Out of scope (state explicitly)

Full incident management war-room UI
Building PagerDuty from scratch vs integrating
Log search and trace analysis (#41, #33)

Assumptions

Clarify scale (DAU, QPS, data volume) for oncall escalation system in the first 5 minutes
Standard reliability target 99.9%–99.99% unless problem implies higher (payments, booking)
Managed cloud services (RDS, S3, Kafka, Redis) are acceptable building blocks

Alert ingestion: Accept incoming alarm webhooks from Prometheus, Datadog, CloudWatch, and custom monitoring backends.
On-call rotations: Define calendar schedules (weekly, daily, custom) with primary and secondary engineers.
Escalation chains: Automatically escalate alerts if the primary engineer fails to acknowledge within N minutes.
Multi-channel alerts: Page responders across push notifications, SMS texts, interactive voice calls, emails, and Slack.
Acknowledge & Resolve: Provide clear ACK APIs to silence active alerts, stopping downstream escalation loops.
Alert aggregation: Correlate related alerts into unified incidents to prevent on-call notification overload.
Maintenance silencing: Allow scheduling maintenance windows to suppress active alerts during scheduled down times.
Incident timeline: Log complete histories from initial trigger to final resolution, tracking team MTTR.

Metric	Calculation	Value
Alerts / day	10,000,000 ÷ 86400 ≈ 116/s	10,000,000
Alerts / sec	From Alerts / day ÷ 86400 (+ peak factor in value)	~115 avg (peak 10,000 during cascading failures)
Active On-Call Schedules	Derived	50,000 teams
Notifications / day	5,000,000 ÷ 86400 ≈ 58/s	5,000,000 (alerts × channels)
Escalations / day	500,000 ÷ 86400	500,000

I/O and Scale Derivations:
- Average Alerts Ingestion: 10,000,000 / 86,400 sec = ~115 alerts/sec.
- Peak Load Ingestion: 10,000 alerts/sec during cascading core failures.
- Daily Notification Dispatch Volume:
  - 5,000,000 notifications/day.
  - Requires deep SMS/Voice API connection concurrency with providers (Twilio, MessageBird) to avoid throttle queues.

Alarms from Prometheus or Datadog land on the Alert Ingestion Service (normalized, validated, and de-duplicated). Events flow through Kafka topics to Alert Routers and Escalation Engines. The stateful Escalation Enginetracks incident lifecycles, and a Redis Sorted Set manages timers. Non-ACK events route through Notification Routersacross multiple delivery channels.

Loading...

1. Stateful Escalation Policy State Machine

Escalating alarms through responder levels must occur reliably, regardless of user interaction. The escalation path transitions alerts across states:

Alert received for team "backend-infra":
Escalation Policy Definition:
  Level 0: Primary on-call (Alice) — notify via push + SMS. Wait 5 min.
  Level 1: Secondary on-call (Bob) — notify via push + SMS + phone call. Wait 10 min.
  Level 2: Engineering Manager (Carol) — notify via phone call. Wait 15 min.
  Level 3: VP of Engineering (Dave) — notify via phone call + SMS.

Escalation Execution Timeline:
  T=0:00  Alert Ingested → Dispatch alerts to Alice (Push + SMS)
  T=5:00  No ACK received from Alice → Escalate to Level 1 → Dispatch to Bob (Push + SMS + Call)
  T=15:00 No ACK received from Bob → Escalate to Level 2 → Dispatch to Carol (Phone Call)
  T=30:00 No ACK received from Carol → Escalate to Level 3 → Dispatch to Dave (Phone Call + SMS)

Stop Condition:
  If any engineer acknowledges (ACK) at any point, the escalation timer is immediately terminated.

2. Escalation Timer Implementation (Redis Sorted Set) ⭐

Relying on database table scanning creates heavy query overhead. We implement a sub-second, highly scalable timer using Redis Sorted Sets:

PYTHON

Redis Sorted Set: ZADD escalation_timers {fire_epoch_timestamp} {alert_id}:{level}

Worker execution loop (polling every 10 seconds):
  # Fetch all timers that are due
  due_timers = ZRANGEBYSCORE escalation_timers 0 {current_epoch_time}
  
  for each timer in due_timers:
      trigger_escalation(timer.alert_id, timer.level)
      ZREM escalation_timers timer
      
      if has_next_escalation_level(timer.alert_id):
          next_lvl = get_next_level(timer.alert_id)
          timeout = get_level_timeout(timer.alert_id, next_lvl)
          ZADD escalation_timers {now + timeout} {timer.alert_id}:{next_lvl}

3. Alert Grouping & Noise Reduction

During cascading infrastructure failures, hundreds of downstream microservices will fire alerts simultaneously. We group incoming alarms:

Without grouping: On-call engineer receives 500 pages → alarm fatigue → ignores alerts.

With Grouping Architecture:
- Collect alerts in sliding windows (e.g., 5 minutes).
- Group by shared attributes: service, alert_name, or custom labels.
- Outcome: A single consolidated Incident ticket:
  "DB connection timeout (503 occurrences across 50 downstream microservices)"
- Action: The engineer receives 1 notification page; acknowledging it silences all related alerts.

Event Bus Design (Kafka)

Topic: oncall_escalation_system-events
  Partitions: 64 (scale consumers horizontally)
  Partition key: entity_id (user_id / order_id — preserves per-entity ordering)
  Retention: 7 days (compliance) or 24h (high-volume telemetry)
  Replication factor: 3, min.insync.replicas: 2

Producer: idempotent producer enabled (enable.idempotence=true)
Consumer: consumer group "oncall_escalation_system-processors"
  - At-least-once delivery + idempotent handlers (dedup by event_id)
  - DLQ topic: oncall_escalation_system-events-dlq (poison messages after 3 retries)
  - Lag alert: consumer lag > 60s → scale workers

Design an On-Call Escalation System (like PagerDuty / OpsGenie): async side effects MUST NOT block the synchronous API response.
  Sync path: validate → persist source of truth → publish event → return 201
  Async path: consumers update caches, indexes, notifications, aggregates

1. Ingest Alert

HTTP

POST /api/v1/alerts
Content-Type: application/json
Authorization: Bearer <integration_token>

{
  "source": "prometheus",
  "alert_name": "HighCPUUsage",
  "severity": "critical",
  "service": "payment-service",
  "labels": {
    "env": "production",
    "cluster": "us-east-1"
  },
  "description": "CPU usage > 90% for 5 consecutive minutes",
  "dedup_key": "high_cpu_payment_us_east"
}

Response: 202 Accepted
{
  "alert_id": "8a219b1b-640a-4289-9812-42171542fca1",
  "status": "triggered"
}

2. Acknowledge Alert

HTTP

POST /api/v1/alerts/8a219b1b-640a-4289-9812-42171542fca1/acknowledge
Content-Type: application/json

{
  "acknowledged_by": "alice@company.com"
}

Response: 200 OK
{
  "alert_id": "8a219b1b-640a-4289-9812-42171542fca1",
  "status": "acknowledged",
  "acknowledged_at": "2026-03-14T10:05:12Z"
}

3. Retrieve Active On-Call Schedule

HTTP

GET /api/v1/schedules/team-backend-infra/on-call?at=2026-03-14T10:00:00Z

Response: 200 OK
{
  "team_id": "team-backend-infra",
  "primary": {
    "user_id": "user-881",
    "email": "alice@company.com",
    "phone": "+15550199"
  },
  "secondary": {
    "user_id": "user-902",
    "email": "bob@company.com",
    "phone": "+15550212"
  }
}

Common Error Responses

400 Bad Request: invalid input, missing fields, or malformed JSON
401 Unauthorized: missing or invalid auth token or API key
403 Forbidden: authenticated but insufficient permissions
404 Not Found: resource ID does not exist
409 Conflict: duplicate write or version conflict; retry with idempotency key
422 Unprocessable Entity: valid syntax but invalid business logic
429 Too Many Requests: rate limit exceeded; honor Retry-After header
500 Internal Error: unexpected server fault; retry with idempotency key
503 Service Unavailable: dependency down or overloaded; use exponential backoff

PostgreSQL Database Schema

SQL

-- Primary alerts store
CREATE TABLE alerts (
    alert_id         UUID PRIMARY KEY,
    dedup_key        VARCHAR(256) NOT NULL,
    alert_name       VARCHAR(256) NOT NULL,
    severity         VARCHAR(20) NOT NULL,
    service          VARCHAR(128) NOT NULL,
    labels           JSONB,
    description      TEXT,
    status           VARCHAR(20) DEFAULT 'triggered', -- triggered, acknowledged, resolved
    escalation_level INT DEFAULT 0,
    team_id          UUID NOT NULL,
    acknowledged_by  VARCHAR(256),
    acknowledged_at  TIMESTAMP,
    resolved_at      TIMESTAMP,
    created_at       TIMESTAMP DEFAULT NOW()
);
-- Enforce single active alert per unique dedup_key (partial unique index)
CREATE UNIQUE INDEX idx_alerts_active_dedup ON alerts(dedup_key) WHERE status != 'resolved';
CREATE INDEX idx_alerts_status ON alerts(status) WHERE status = 'triggered';

-- Escalation Policy configuration table
CREATE TABLE escalation_policies (
    policy_id        UUID PRIMARY KEY,
    team_id          UUID NOT NULL,
    levels           JSONB NOT NULL  -- [{level: 0, wait_min: 5, channels: ["push", "sms"]}, ...]
);

-- Rotation Schedules table
CREATE TABLE on_call_schedules (
    schedule_id      UUID PRIMARY KEY,
    team_id          UUID NOT NULL,
    rotation_type    VARCHAR(20) NOT NULL, -- weekly, daily, custom
    participants     JSONB NOT NULL,       -- [{user_id: "usr-1", start: "timestamp", end: "timestamp"}]
    timezone         VARCHAR(64) DEFAULT 'UTC'
);

Concern Scenario	System Solution Design
Timer Worker Node Crashes	Timer state resides in highly-available Redis. New instances leverage leader election to resume polling pending keys immediately.
Telecom Provider Blackouts	Integrate redundant message gateways. Switch Twilio SMS to Vonage or MessageBird automatically on timeout.
Timer Resolution Drift	Align system clocks using NTP across all hosts, maintaining precision within < 10ms.

1. Concurrent Acknowledge Race Conditions ⭐

If two engineers click to acknowledge an alert simultaneously, both must be handled cleanly. We employ atomic Compare-And-Swap (CAS) updates:

Alice (primary) and Bob (secondary) both receive notifications.
Alice clicks ACK at T=5:00.000. Bob clicks ACK at T=5:00.050.

Atomic CAS Query Execution:
  UPDATE alerts 
  SET status = 'acknowledged', acknowledged_by = 'alice', acknowledged_at = NOW()
  WHERE alert_id = 'alert-9912' AND status = 'triggered';

Results:
- Alice's Query: Finds status = 'triggered' → Updates database. rows_affected = 1. Success!
- Bob's Query: status is now 'acknowledged' → Query matches 0 rows. rows_affected = 0.
  - Return friendly error: "Alert already acknowledged by Alice".
  - Cancel any remaining escalation timers.

2. Alert Storm Failures ⭐

When a core database crashes, thousands of connection alerts can overwhelm downstream responders. We defend the team using multi-level filtering:

Mitigation Hierarchy:
1. Ingestion Deduplication:
   - Calculate a hash (dedup_key) from alert name + service + labels.
   - If an active alert exists with that key, discard/merge the payload, incrementing the counter instead of generating new pages.

2. Sliding-Window Grouping:
   - Window related alerts within 5 minutes into single consolidated incidents.

3. Downstream Correlation Engines:
   - Map dependencies (e.g. if the physical database is down, suppress downstream connection failure alarms).

4. Strict Personal Rate Limiting:
   - Cap paging frequency to max 10 pages per hour per responder.
   - Exceeded counts queue into hourly digest summaries (excluding P0 blockages).

3. Handling Schedule Gaps ⭐

If team transitions leave a calendar gap where no responder is active, alarms must not be lost:

Alice's shift ends Saturday 18:00. Bob's shift starts Sunday 09:00.
An alert fires Saturday 23:00. Who gets paged?

Pre-emptive Health Checks:
- Run background cron jobs daily to scan all team schedules for gaps 7 days out.
- Alert administrators if schedule gaps are identified.

Active Escalation Resolution Rules:
1. Fallback to Secondary: Try paging the secondary on-call directly.
2. Fallback to Team Lead: Escalate to the team's designated Engineering Manager.
3. VP Emergency: Root escalation escalates to the VP of Engineering as a safety net.
4. Alerts MUST be delivered; ignoring them is not permitted.

1. Interactive Voice Response (IVR) Verification ⭐

Phone calls provide the most reliable way to wake engineers during off-hours incidents. We implement Twilio Voice workflows:

Push notifications and SMS texts are easily ignored or suppressed by "Do Not Disturb" profiles. 
Voice calls force audible phone rings on most modern devices.

Twilio Interactive Voice Response (IVR) Implementation:
1. Ingest alert → triggering Twilio REST API voice call.
2. Twilio establishes a connection and streams synthesized Text-to-Speech (TTS):
   "Alert: payment-service CPU usage exceeded 90%. 
    Press 1 to acknowledge this alert. Press 2 to escalate to the secondary responder."
3. User interaction captures DTMF tones:
   - User presses 1 → triggers Twilio webhook payload → updates DB using atomic CAS → terminates timer.
   - User hangs up or presses 2 → triggers immediate escalation, paging secondary.
4. Call Failure Recovery:
   - If the call hits busy signals or fails, retry within 60 seconds.
   - Max 3 call attempts per escalation level.

2. Schedule Overrides & Swap Handling

On-call engineers frequently need to cover slots for peers temporarily. The routing engine handles overrides dynamically:

Alice is on-call but has a doctor's appointment Tuesday 14:00 to 16:00. She swaps coverage with Bob.

System Data Architecture:
- on_call_schedules: Base rotations (weekly/daily).
- schedule_overrides: Table containing override records:
  { user: "bob", start: "Tuesday 14:00", end: "Tuesday 16:00", covers_for: "alice" }

Resolution Hierarchy:
1. Check schedule_overrides for active overrides matching current timestamp (highest priority).
2. If none, check the primary rotation schedules.
3. Fall back to designated managers if both paths are empty.

3. Achieving Five Nines (99.999% SLA) ⭐

Building a system that can only fail for 5 minutes per year requires deep infrastructure redundancy:

Target: 99.999% Availability (No more than 5 minutes of downtime per calendar year).

Architectural Path to Five Nines:
1. Active-Active Cross-Region Deployment:
   - Replicate primary database clusters synchronously across regions (e.g., us-east-1 and eu-west-1).
   - If a regional infrastructure failure occurs, Route53 dns queries failover traffic automatically.

2. Provider Redundancy (No Single Point of Failure):
   - Push APIs: Support both Apple APNs and Google FCM pipelines.
   - SMS Gateways: Route through Twilio, fallback to MessageBird or Sinch if Twilio suffers latency.
   - Voice Calls: Route calls through Twilio, falling back to Vonage / Nexmo.

3. Chaos Engineering:
   - Regularly execute automated tests cutting regional data links, dropping primary database writes, or simulating complete Twilio API blackouts to confirm flawless fallback routing.

Interview Walkthrough

Clarify the core loop: alert ingested → match on-call schedule → notify primary → start escalation timer if unacknowledged.
Model on-call schedules as time-bounded rotations with override support for temporary swaps and holiday coverage.
Design an escalation ladder: push notification → SMS → phone call (IVR press-1-to-ack) with configurable per-step timeouts.
Deduplicate alerts by fingerprint (service + error signature) to prevent notification storms during cascading failures.
Use a durable timer engine (Redis ZSET or dedicated scheduler) so escalation steps fire even if the alerting service restarts.
Track delivery and acknowledgment state per channel — an unacked push should trigger the next escalation tier automatically.
For 99.999% SLA, describe multi-region active-active deployment with external paging provider failover (Twilio/PagerDuty).
Common pitfall: synchronous phone-call initiation blocking the alert ingestion path — the hot path must enqueue and return immediately.

1. Timer Engine Implementations

Approach	Precision	Durability	DB Load	Complexity
Redis Sorted Set ⭐	Sub-second	Volatile (mitigated via AOF/Snapshotting)	Extremely Low	Low
Database Polling	Low (bounded by polling frequency)	High	High (frequent sequential queries)	Low
Kafka Delayed Messages	Low (requires bucketed queues)	High	None	High

SLOs & Error Budgets

Metric	Target	Rationale
Core user-facing availability	99.95%	Budget for planned maintenance + unplanned failures without user-visible outage.
p99 latency (critical path)	Problem-specific — state target early and tie to capacity math	Interview credibility comes from connecting SLO to architecture choices.
Error rate (5xx)	< 0.1%	Distinguishes transient blips from systemic failure requiring rollback.
Data durability	99.999999999% (11 nines) for committed writes	Define which operations require fsync/quorum vs async replication.

Incident Scenarios (2am reality)

Scenario	How you detect	Mitigation
Primary database unavailable	Health check failures, connection pool exhaustion alerts, elevated 5xx	Failover to replica / promote standby; enable read-only degraded mode if writes impossible; queue writes if async path exists
Traffic spike (10× normal)	RPS anomaly alert, autoscaling lag, latency SLO burn rate	Rate limit non-critical endpoints; scale read path horizontally; pre-warm caches; shed load on expensive operations
Bad deploy causing elevated errors	Canary metric regression, error budget burn, deployment correlation	Automated rollback within 5 minutes; feature flag kill switch; maintain N-1 compatibility

Cost Drivers (Staff lens)

Egress bandwidth and CDN (often dominates media/data-heavy systems)
Database storage + IOPS at scale (plan compaction, TTL, tiering)
Compute for async pipelines (right-size workers, spot instances for batch)
Managed service premiums vs operational headcount trade-off

Multi-Region & DR

Start single-region with cross-AZ redundancy. Add read replicas in secondary region for DR. Move to active-active only when latency SLO or data residency requires it — accept conflict resolution complexity explicitly.

Interview Prompt

Clarifying Questions (ask before designing)

Scope

In scope

Out of scope (state explicitly)

Assumptions

1. Stateful Escalation Policy State Machine

2. Escalation Timer Implementation (Redis Sorted Set) ⭐

3. Alert Grouping & Noise Reduction

Event Bus Design (Kafka)

1. Ingest Alert

2. Acknowledge Alert

3. Retrieve Active On-Call Schedule

Common Error Responses

PostgreSQL Database Schema

1. Concurrent Acknowledge Race Conditions ⭐

2. Alert Storm Failures ⭐

3. Handling Schedule Gaps ⭐

1. Interactive Voice Response (IVR) Verification ⭐

2. Schedule Overrides & Swap Handling

3. Achieving Five Nines (99.999% SLA) ⭐

Interview Walkthrough

1. Timer Engine Implementations

Phase 1: MVP (0 to 100K users)

Phase 2: Growth (100K to 10M users)

Phase 3: Scale (10M+ users)

SLOs & Error Budgets

Incident Scenarios (2am reality)

Cost Drivers (Staff lens)

Multi-Region & DR