System Design Problem

Design an On-Call Escalation System (like PagerDuty / OpsGenie)

Commonly Asked By:PagerDutyAtlassianMicrosoftGoogle

  • Alert ingestion: Accept incoming alarm webhooks from Prometheus, Datadog, CloudWatch, and custom monitoring backends.
  • On-call rotations: Define calendar schedules (weekly, daily, custom) with primary and secondary engineers.
  • Escalation chains: Automatically escalate alerts if the primary engineer fails to acknowledge within N minutes.
  • Multi-channel alerts: Page responders across push notifications, SMS texts, interactive voice calls, emails, and Slack.
  • Acknowledge & Resolve: Provide clear ACK APIs to silence active alerts, stopping downstream escalation loops.
  • Alert aggregation: Correlate related alerts into unified incidents to prevent on-call notification overload.
  • Maintenance silencing: Allow scheduling maintenance windows to suppress active alerts during scheduled down times.
  • Incident timeline: Log complete histories from initial trigger to final resolution, tracking team MTTR.

Alarms from Prometheus or Datadog land on the Alert Ingestion Service (normalized, validated, and de-duplicated). Events flow through Kafka topics to Alert Routers and Escalation Engines. The stateful Escalation Enginetracks incident lifecycles, and a Redis Sorted Set manages timers. Non-ACK events route through Notification Routersacross multiple delivery channels.

Loading...