Design a Notification System (Push, Email, SMS) – System Design Walkthrough

This problem appears in multiple sheets. Depth expectations increase as you progress:

Track	What to demonstrate
Arch 25	Multi-channel fan-out (push/SMS/email), priority queues, idempotent dedup keys, and per-user rate limits. Separate hot path (enqueue) from delivery workers talking to FCM/APNs/Twilio/SES.
Arch 50	Template rendering pipeline, quiet hours, device token lifecycle, bounce handling, and poison-message DLQ for bad payloads.
Arch 75	Staff: global incident blast vs per-user caps, cross-channel dedup ('already notified via push'), and compliance (opt-out, regional SMS rules, PII in payload logs).

Interview Prompt

Design a notification system that sends push notifications, SMS, and email to users based on events (order shipped, friend request, security alert). Support priorities, user preferences, and delivery status tracking.

Clarifying Questions (ask before designing)

Question	Why it matters
Real-time (< 5s) or best-effort batch delivery?	Security alerts need priority queues; marketing email can lag minutes.
Exactly-once delivery or at-least-once with dedup?	Providers duplicate; idempotency keys and dedup store are standard.
Per-user daily caps and quiet hours?	Prevents notification fatigue and legal issues (TCPA, GDPR marketing consent).
Who produces events — 100 microservices or one monolith?	Kafka topic design, schema registry, and fan-in to notification service.

Scope

In scope

Multi-channel delivery (push, SMS, email)
Priority and scheduling
User preferences and opt-out
Dedup and per-user rate limiting
Delivery status / retry

Out of scope (state explicitly)

In-app notification UI component library
Full email marketing campaign builder
ML for send-time optimization

Assumptions

100M DAU, 500M notifications/day peak
10:1 push : SMS : email mix by volume
At-least-once from event bus; dedup window 24h

Send notifications via multiple channels: Push (iOS/Android/Web), Email, SMS
Support both real-time and scheduled notifications
Support user preferences: users choose which channels they want, per notification type
Support template-based notifications with variable substitution
Support bulk/batch notifications (e.g., marketing campaign to 10M users)
Track notification delivery status: sent, delivered, read, failed, bounced
Rate limit notifications per user to prevent spam
Support notification grouping/bundling (e.g., "5 people liked your photo")
Priority levels: critical (immediately), high, normal, low (batched)

Metric	Calculation	Value
DAU	Given (product assumption)	100M
Notifications / day	100M DAU × 10/user	1B (10 per user/day average)
Notifications / sec	1B ÷ 86400	~12K (peak 5×: ~60K)
Push notifications	Given (assumption documented in value)	60% → 600M/day
Emails	Given (assumption documented in value)	30% → 300M/day
SMS	Given (assumption documented in value)	10% → 100M/day
Notification record size	Given	~500 bytes
Storage / day	1B × 500B	500 GB
Storage / year	Given	~180 TB

Notification Service (API Layer)

Purpose: Entry point for all notification requests from internal services
Responsibilities:
1. Validate request (required fields, valid user IDs)
2. Check user preferences (does user want push? email?)
3. Apply rate limiting (max 10 push notifications/hour per user)
4. Check quiet hours (don't send at 3am unless critical)
5. Render template with variables
6. Fan out to appropriate Kafka topics per channel
Idempotency: Each notification has a request_id. Check Redis dedup cache before processing
Why separate service: Decouples notification logic from business services. Any service just sends a notification request; this service handles all complexity

User Preferences & Template Service

User Preferences: Stored in a user preferences DB (PostgreSQL)
- Per notification type (e.g., "marketing", "order_update", "social")
- Per channel (push: yes, email: yes, SMS: no)
- Quiet hours (10pm - 8am)
- Language preference
Templates: Stored in a template DB with version control
- Support variable substitution: Hello {{user_name}}, your order {{order_id}} has shipped
- Templates per channel (push is short, email is HTML, SMS is 160 chars)
- A/B testing support for template variants

Kafka (Message Queue)

Why Kafka over RabbitMQ/SQS:
- Massive throughput (millions of messages/sec)
- Durable: messages persisted to disk with replication
- Consumer groups: easy to scale workers independently per channel
- Replay capability: if a worker has a bug, fix it and replay
Topics: Separate topic per channel (push_notifications, email_notifications, sms_notifications)
- Allows independent scaling of each channel's worker pool
- If email provider is slow, email queue backs up without affecting push
Partitioning: By user_id → ensures notifications for the same user are processed in order
Config: RF=3, min.insync.replicas=2, retention = 7 days

Push Worker Pool

APNs (Apple Push Notification Service): For iOS devices. HTTP/2 persistent connections. Must handle token invalidation (user uninstalled app)
FCM (Firebase Cloud Messaging): For Android and web. REST API. Supports topic messaging for broadcast
Flow:
1. Consume from push_notifications topic
2. Look up device tokens for user from Device Token DB
3. Send to APNs/FCM
4. Handle response: success → update status; invalid token → remove token; rate limited → retry with backoff
Connection pooling: Maintain persistent HTTP/2 connections to APNs/FCM (connection setup is expensive)

Email Worker Pool

Providers: SendGrid, AWS SES, Mailgun (use multiple for redundancy)
Flow:
1. Consume from email_notifications topic
2. Render HTML template
3. Send via primary provider (SendGrid)
4. If primary fails → failover to secondary (SES)
5. Track bounce/complaint callbacks via webhooks
Considerations: SPF, DKIM, DMARC for deliverability. Warm up IPs for bulk sends

SMS Worker Pool

Providers: Twilio, Nexmo/Vonage (use multiple; some are better in certain regions)
Flow: Similar to email worker
Considerations: SMS costs money ($0.01-0.05 per SMS). Apply strict rate limiting. Support country-specific routing (cheapest provider per country)

Delivery Tracker

Purpose: Receive delivery receipts from providers (webhooks/callbacks)
Tracks: QUEUED → SENT → DELIVERED → READ → FAILED → BOUNCED
Webhooks: Providers call back with delivery status updates
Stores: Status updates in Cassandra notification log

Redis (Deduplication Cache)

Purpose: Prevent duplicate notifications
How: SET notification:{request_id} 1 EX 86400 NX: if key exists, it's a duplicate
Also used for: Rate limiting counters per user per channel

Event Bus Design (Kafka)

Topic: notification_system-events
  Partitions: 64 (scale consumers horizontally)
  Partition key: entity_id (user_id / order_id — preserves per-entity ordering)
  Retention: 7 days (compliance) or 24h (high-volume telemetry)
  Replication factor: 3, min.insync.replicas: 2

Producer: idempotent producer enabled (enable.idempotence=true)
Consumer: consumer group "notification_system-processors"
  - At-least-once delivery + idempotent handlers (dedup by event_id)
  - DLQ topic: notification_system-events-dlq (poison messages after 3 retries)
  - Lag alert: consumer lag > 60s → scale workers

Design a Notification System (Push, Email, SMS): async side effects MUST NOT block the synchronous API response.
  Sync path: validate → persist source of truth → publish event → return 201
  Async path: consumers update caches, indexes, notifications, aggregates

Send Notification

HTTP

POST /api/v1/notifications
Authorization: Bearer <service_token>

{
  "request_id": "uuid-v4",        // idempotency key
  "user_ids": ["user123", "user456"],
  "notification_type": "order_shipped",
  "priority": "high",             // critical, high, normal, low
  "channels": ["push", "email"],  // optional; if omitted, use user prefs
  "template_id": "order_shipped_v2",
  "template_vars": {
    "order_id": "ORD-12345",
    "tracking_url": "https://track.ly/abc"
  },
  "scheduled_at": null,           // null = immediate
  "metadata": {
    "campaign_id": "spring_sale_2026"
  }
}

Response: 202 Accepted
{
  "notification_id": "notif-uuid",
  "status": "queued",
  "channels_targeted": ["push", "email"]
}

Get Notification Status

HTTP

GET /api/v1/notifications/{notification_id}

Response: 200 OK
{
  "notification_id": "notif-uuid",
  "user_id": "user123",
  "channels": {
    "push": {"status": "delivered", "delivered_at": "..."},
    "email": {"status": "sent", "sent_at": "..."}
  }
}

Update User Preferences

HTTP

PUT /api/v1/users/{user_id}/notification-preferences
{
  "channels": {
    "push": true,
    "email": true,
    "sms": false
  },
  "quiet_hours": {
    "start": "22:00",
    "end": "08:00",
    "timezone": "America/New_York"
  },
  "notification_types": {
    "marketing": {"push": false, "email": true},
    "social": {"push": true, "email": false},
    "order_updates": {"push": true, "email": true, "sms": true}
  }
}

Get User's Notification History

HTTP

GET /api/v1/users/{user_id}/notifications?page=1&limit=20

Response: 200 OK
{
  "notifications": [
    {
      "notification_id": "...",
      "type": "order_shipped",
      "title": "Your order has shipped!",
      "body": "Order ORD-12345 is on its way.",
      "channel": "push",
      "status": "read",
      "created_at": "2026-03-13T10:00:00Z"
    }
  ],
  "pagination": {"page": 1, "total": 150}
}

Common Error Responses

400 Bad Request: invalid input, missing fields, or malformed JSON
401 Unauthorized: missing or invalid auth token or API key
403 Forbidden: authenticated but insufficient permissions
404 Not Found: resource ID does not exist
409 Conflict: duplicate write or version conflict; retry with idempotency key
422 Unprocessable Entity: valid syntax but invalid business logic
429 Too Many Requests: rate limit exceeded; honor Retry-After header
500 Internal Error: unexpected server fault; retry with idempotency key
503 Service Unavailable: dependency down or overloaded; use exponential backoff
202 Accepted: job queued; poll GET /jobs/{id} for status
408 Request Timeout: job still processing; continue polling

PostgreSQL: User Preferences

Why PostgreSQL: Relational data with clear schema, strong consistency needed for preferences.

SQL

CREATE TABLE user_notification_preferences (
    user_id             UUID PRIMARY KEY,
    push_enabled        BOOLEAN DEFAULT TRUE,
    email_enabled       BOOLEAN DEFAULT TRUE,
    sms_enabled         BOOLEAN DEFAULT FALSE,
    quiet_hours_start   TIME,
    quiet_hours_end     TIME,
    timezone            VARCHAR(64),
    language            VARCHAR(10) DEFAULT 'en',
    updated_at          TIMESTAMP
);

CREATE TABLE user_type_preferences (
    user_id             UUID,
    notification_type   VARCHAR(64),
    push_enabled        BOOLEAN DEFAULT TRUE,
    email_enabled       BOOLEAN DEFAULT TRUE,
    sms_enabled         BOOLEAN DEFAULT FALSE,
    PRIMARY KEY (user_id, notification_type)
);

Cassandra: Notification Log

Why Cassandra: High write throughput (billions of notifications), time-series access pattern (user's recent notifications), TTL support.

SQL

CREATE TABLE notification_log (
    user_id           UUID,
    created_at        TIMESTAMP,
    notification_id   UUID,
    notification_type VARCHAR,
    channel           VARCHAR,
    title             TEXT,
    body              TEXT,
    status            VARCHAR,   -- queued, sent, delivered, read, failed
    metadata          MAP<TEXT, TEXT>,
    PRIMARY KEY (user_id, created_at, notification_id)
) WITH CLUSTERING ORDER BY (created_at DESC)
  AND default_time_to_live = 7776000;  -- 90 days retention

Redis: Dedup & Rate Limiting

# Deduplication
Key:    notif:dedup:{request_id}
Value:  1
TTL:    86400 (24 hours)

# Rate limiting (per user per channel)
Key:    notif:rate:{user_id}:{channel}:{hour}
Value:  counter (INCR)
TTL:    3600

Redis: Device Tokens

Key:    device_tokens:{user_id}
Value:  SET of {device_token, platform, app_version, last_active}

Alternatively, store in PostgreSQL if you need complex queries.

Kafka Topics

Topic: push_notifications    (partitioned by user_id)
Topic: email_notifications   (partitioned by user_id)
Topic: sms_notifications     (partitioned by user_id)
Topic: notification_status   (delivery status callbacks)

Message Schema (push_notifications):
{
  "notification_id": "uuid",
  "user_id": "user123",
  "title": "Your order has shipped!",
  "body": "Order ORD-12345 is on its way.",
  "data": {"order_id": "ORD-12345", "deep_link": "app://orders/12345"},
  "priority": "high",
  "created_at": "2026-03-13T10:00:00Z"
}

MySQL: Notification Templates

SQL

CREATE TABLE notification_templates (
    template_id     VARCHAR(128) PRIMARY KEY,
    version         INT,
    channel         ENUM('push', 'email', 'sms'),
    subject         TEXT,           -- for email
    title           TEXT,           -- for push
    body_template   TEXT,           -- "Hello {{user_name}}, ..."
    html_template   TEXT,           -- for email HTML
    language        VARCHAR(10),
    active          BOOLEAN,
    created_at      TIMESTAMP,
    UNIQUE KEY (template_id, version, channel, language)
);

General

Technique	Application
Kafka durability	RF=3, min.insync.replicas=2. Notifications survive broker failures
Consumer group rebalancing	If a push worker dies, Kafka rebalances partitions to surviving workers
Retry with exponential backoff	On provider errors (APNs, SendGrid timeouts)
Dead Letter Queue (DLQ)	After N retries, move to DLQ for manual investigation
Idempotent processing	Dedup by notification_id in Redis before sending
Circuit breaker	Per provider — if Twilio is down, stop sending to it, alert ops

Problem-Specific Fault Tolerance

1. Provider Outage (e.g., SendGrid is down)

Circuit breaker trips after 5 consecutive failures
Automatic failover to secondary provider (AWS SES)
Provider abstraction layer makes switching transparent
Queue keeps growing → process backlog when provider recovers

2. Push Token Invalidation

APNs/FCM return "invalid token" (user uninstalled app)
Worker removes the invalid token from device token store
If all tokens invalid → can't send push; fall back to email/SMS based on preferences

3. Duplicate Notifications

Kafka consumer commits offset after processing → if crash before commit, message re-processed
Solution: Check Redis dedup cache (notification_id) before calling provider
Also: providers themselves deduplicate (APNs has apns-collapse-id)

4. Notification Storm (Bulk Campaign)

Marketing sends a campaign to 50M users at once
Solution:
- Separate Kafka topic/partition for bulk vs. transactional notifications
- Bulk notifications are throttled (processed at controlled rate)
- Transactional notifications (order confirmation) always prioritized

5. User Device Offline

Push notification is sent to APNs/FCM but device is offline
APNs/FCM handle this: they store the notification and deliver when device comes online
We track status as "sent" (not "delivered") until device acks

Notification Grouping / Bundling

Instead of "User A liked your photo", "User B liked your photo" × 50 times
Group into: "User A, User B, and 48 others liked your photo"
Implementation: Hold notifications in a buffer (Redis sorted set by user_id) for 5 minutes. If multiple notifications of same type arrive, merge them. Timer triggers the bundled notification

Priority Queue Implementation

Kafka topics by priority:
  notifications_critical  → immediate processing, dedicated worker pool
  notifications_high      → normal processing
  notifications_normal    → best-effort
  notifications_low       → batched processing (hourly digest)

Analytics

Track delivery rate per channel, per provider
Track open rates, click-through rates for emails
Track notification-to-action conversion
Store in ClickHouse for dashboarding

Quiet Hours / Timezone Handling

Store user's timezone in preferences
Before sending, check if current time in user's timezone is within quiet hours
If yes → schedule for delivery at quiet hours end (unless priority = critical)
Use a scheduled notification queue backed by a distributed scheduler

Unsubscribe / Compliance

Every email must have an unsubscribe link (CAN-SPAM / GDPR)
SMS requires opt-in (TCPA compliance)
Push can be disabled at OS level (handle gracefully)
Maintain a suppression list (bounced emails, unsubscribed users)

Interview Walkthrough

Clarify channels upfront — push, email, SMS, in-app — each has different latency, cost, and delivery guarantees.
Separate the hot path (accept notification request, return 202) from the async delivery pipeline via a durable message queue.
Design per-channel workers with provider-specific rate limits (APNs, FCM, SendGrid, Twilio) and circuit breakers on provider failures.
Store user preferences and suppression lists before enqueueing — GDPR/CAN-SPAM compliance is a first-class filter, not an afterthought.
Use idempotency keys on the enqueue API so retries from client apps do not duplicate notifications.
Quantify throughput with Back-of-the-Envelope Estimation: batch similar notifications, prioritize transactional over marketing.
Common pitfall: synchronous delivery in the API handler — a slow SMS provider blocks the entire request and causes cascading timeouts.

Push vs Pull for Notification Delivery

Push (Server-initiated) ⭐:
  Server pushes notification to client via APNs/FCM/WebSocket
  ✓ Real-time delivery (< 1 second)
  ✓ No wasted bandwidth (only sent when there's something to send)
  ✗ Requires persistent connection or OS-level push service
  ✗ APNs/FCM can throttle or drop notifications under load
  Best for: Real-time alerts, chat messages, critical notifications

Pull (Client-initiated):
  Client polls server: "Any new notifications?"
  ✓ Simple server-side implementation
  ✓ Client controls frequency
  ✗ Wastes bandwidth (most polls return nothing)
  ✗ Latency = poll interval (if polling every 30s, avg delay = 15s)
  Best for: Non-real-time (email digests, weekly summaries)

Hybrid:
  Push a lightweight "you have notifications" signal → client pulls full details
  ✓ Push is tiny (no payload, just a trigger)
  ✓ Client fetches exactly what it needs
  Used by: Instagram, Twitter (push wake-up → pull feed)

Why Kafka Per Channel (Not a Single Topic)?

Single topic approach:
  All notifications → one "notifications" topic → one consumer group

  Problem: SMS delivery takes 2s; push takes 50ms; email takes 500ms
  A burst of 100K emails BLOCKS push notifications waiting in the same queue

Per-channel topics ⭐:
  notifications_push   → fast consumers (50ms avg)
  notifications_email  → medium consumers (500ms avg)
  notifications_sms    → slow consumers (2s avg)

  Each channel scales independently:
    Push: 20 consumers (high volume, fast)
    Email: 50 consumers (high volume, medium speed)
    SMS: 10 consumers (lower volume, slow provider)

  Bonus: If email provider is down, only email topic backs up.
         Push and SMS continue normally. Circuit breaker per channel.

At-Least-Once vs Exactly-Once vs At-Most-Once

At-Most-Once:
  Send and forget. If delivery fails, don't retry.
  ✓ No duplicates ever
  ✗ Notifications can be lost
  Use for: Marketing, non-critical (better to miss than annoy with duplicates)

At-Least-Once ⭐ (Recommended for most):
  Retry on failure. May result in duplicates.
  ✓ No notification is ever lost
  ✗ User might get "You have a new message" twice
  Mitigation: Client-side deduplication by notification_id
  Use for: Transactional (order confirmation, OTP, alerts)

Exactly-Once:
  Guarantee each notification delivered exactly once.
  ✗ Extremely hard in distributed systems (requires idempotency + dedup)
  ✗ Higher latency (need to check dedup before every delivery)
  Implementation:
    Store notification_id in Redis SET → before sending, check if already sent
    TTL = 24 hours (dedup window)
  Use for: Financial alerts, critical one-time codes

Provider Failover Strategy

For each channel, maintain primary and fallback providers:
  Email:  Primary: AWS SES  → Fallback: SendGrid → Fallback: Mailgun
  SMS:    Primary: Twilio   → Fallback: AWS SNS  → Fallback: MessageBird
  Push:   Primary: APNs/FCM → (no fallback — platform-specific)

Failover logic:
  try:
    primary_provider.send(notification)
  except ProviderError, Timeout:
    circuit_breaker.record_failure(primary)
    if circuit_breaker.is_open(primary):
      fallback_provider.send(notification)
    else:
      retry(primary, max_retries=2, backoff=exponential)

Circuit breaker thresholds:
  - Open after 5 consecutive failures OR > 50% error rate in last 60s
  - Half-open after 30 seconds → try 1 request → if success, close
  - Closed (healthy) → normal operation

Why not always use the cheapest provider?
  - Reliability varies (provider-specific outages)
  - Deliverability differs (SES has better inbox placement than some)
  - Regional performance (Twilio better in US, MessageBird better in EU)
  - Cost optimization: Route 80% through cheapest, 20% through most reliable

Template Engine: Server-Side vs Client-Side Rendering

Server-side rendering (compile template + data → final content):
  Template: "Hi {{name}}, your order #{{order_id}} is confirmed!"
  Data: {name: "Alice", order_id: "12345"}
  Output: "Hi Alice, your order #12345 is confirmed!"

  ✓ Works for all channels (email, SMS, push — all get final content)
  ✓ Consistent rendering regardless of client
  ✗ Template changes require redeployment (unless stored in DB)

  Best practice: Store templates in DB (versioned), compile at runtime
  Cache compiled templates in Redis (TTL = 5 min)

Client-side rendering:
  Send template + data separately → client renders
  ✓ Reduces payload size (template cached on client)
  ✗ Only works for push/in-app (not email/SMS)
  ✗ Client must handle rendering logic

SLOs & Error Budgets

Metric	Target	Rationale
P0 delivery latency	p99 < 5s	Security and fraud alerts
Transactional delivery success	99.5%	After retries; excludes invalid tokens
Dedup false-send rate	0	Duplicate OTP is UX and cost disaster
Ingress availability	99.99%	Enqueue must survive provider outages

Incident Scenarios (2am reality)

Scenario	How you detect	Mitigation
FCM global outage	Push worker error rate 100%, provider status page	Fail over to SMS for P0 only if user opted in; queue push with TTL; extend retry window; status comms
Bug sends duplicate billing emails to 2M users	SES complaint rate spike, support tickets	Kill email consumer; dedup retroactive by campaign_id; apology campaign; root cause in idempotency key gap
Redis dedup cluster down	Dedup check fails open → duplicate sends	Fail closed (drop non-P0) or fallback to DB dedup with higher latency; restore Redis; reconcile duplicate metrics

Cost Drivers (Staff lens)

SMS: $0.01–0.08 per message × volume — dominant if SMS ratio high
SES email: cheap at scale but attachment storage adds up
Kafka retention for 7-day replay and audit

Multi-Region & DR

Users homed to region for data residency; events processed in home region. Cross-region only for global P0 (security) with explicit routing. Provider credentials per region (EU SMS sender ID). Async replication of preferences is eventual — accept stale opt-out up to 60s with fail-closed on marketing.