This problem appears in multiple sheets. Depth expectations increase as you progress:
| Track | What to demonstrate |
|---|---|
| Arch 25 | Multi-channel fan-out (push/SMS/email), priority queues, idempotent dedup keys, and per-user rate limits. Separate hot path (enqueue) from delivery workers talking to FCM/APNs/Twilio/SES. |
| Arch 50 | Template rendering pipeline, quiet hours, device token lifecycle, bounce handling, and poison-message DLQ for bad payloads. |
| Arch 75 | Staff: global incident blast vs per-user caps, cross-channel dedup ('already notified via push'), and compliance (opt-out, regional SMS rules, PII in payload logs). |
Interview Prompt
Design a notification system that sends push notifications, SMS, and email to users based on events (order shipped, friend request, security alert). Support priorities, user preferences, and delivery status tracking.
Clarifying Questions (ask before designing)
| Question | Why it matters |
|---|---|
| Real-time (< 5s) or best-effort batch delivery? | Security alerts need priority queues; marketing email can lag minutes. |
| Exactly-once delivery or at-least-once with dedup? | Providers duplicate; idempotency keys and dedup store are standard. |
| Per-user daily caps and quiet hours? | Prevents notification fatigue and legal issues (TCPA, GDPR marketing consent). |
| Who produces events — 100 microservices or one monolith? | Kafka topic design, schema registry, and fan-in to notification service. |
Scope
In scope
- Multi-channel delivery (push, SMS, email)
- Priority and scheduling
- User preferences and opt-out
- Dedup and per-user rate limiting
- Delivery status / retry
Out of scope (state explicitly)
- In-app notification UI component library
- Full email marketing campaign builder
- ML for send-time optimization
Assumptions
- 100M DAU, 500M notifications/day peak
- 10:1 push : SMS : email mix by volume
- At-least-once from event bus; dedup window 24h
These foundational concepts underpin the patterns used in this problem. Review them before deep-diving into component-level trade-offs.
- Send notifications via multiple channels: Push (iOS/Android/Web), Email, SMS
- Support both real-time and scheduled notifications
- Support user preferences: users choose which channels they want, per notification type
- Support template-based notifications with variable substitution
- Support bulk/batch notifications (e.g., marketing campaign to 10M users)
- Track notification delivery status: sent, delivered, read, failed, bounced
- Rate limit notifications per user to prevent spam
- Support notification grouping/bundling (e.g., "5 people liked your photo")
- Priority levels: critical (immediately), high, normal, low (batched)
- High Throughput: Send 1M+ notifications per minute
- Low Latency: Real-time notifications delivered within 1-2 seconds
- Reliability: No notification should be lost (at-least-once delivery)
- Scalability: Handle 100M+ users, billions of notifications/day
- Fault Tolerant: If one channel (e.g., SMS provider) is down, other channels still work
- Ordered: Notifications for a user should arrive in chronological order (best-effort)
- Idempotent: Same notification should not be sent twice to the same user
- Extensible: Easy to add new channels (WhatsApp, Slack, etc.)
| Metric | Calculation | Value |
|---|---|---|
| DAU | Given (product assumption) | 100M |
| Notifications / day | 100M DAU × 10/user | 1B (10 per user/day average) |
| Notifications / sec | 1B ÷ 86400 | ~12K (peak 5×: ~60K) |
| Push notifications | Given (assumption documented in value) | 60% → 600M/day |
| Emails | Given (assumption documented in value) | 30% → 300M/day |
| SMS | Given (assumption documented in value) | 10% → 100M/day |
| Notification record size | Given | ~500 bytes |
| Storage / day | 1B × 500B | 500 GB |
| Storage / year | Given | ~180 TB |
Notification Service (API Layer)
- Purpose: Entry point for all notification requests from internal services
- Responsibilities:
- Validate request (required fields, valid user IDs)
- Check user preferences (does user want push? email?)
- Apply rate limiting (max 10 push notifications/hour per user)
- Check quiet hours (don't send at 3am unless critical)
- Render template with variables
- Fan out to appropriate Kafka topics per channel
- Idempotency: Each notification has a
request_id. Check Redis dedup cache before processing - Why separate service: Decouples notification logic from business services. Any service just sends a notification request; this service handles all complexity
User Preferences & Template Service
- User Preferences: Stored in a user preferences DB (PostgreSQL)
- Per notification type (e.g., "marketing", "order_update", "social")
- Per channel (push: yes, email: yes, SMS: no)
- Quiet hours (10pm - 8am)
- Language preference
- Templates: Stored in a template DB with version control
- Support variable substitution:
Hello {{user_name}}, your order {{order_id}} has shipped - Templates per channel (push is short, email is HTML, SMS is 160 chars)
- A/B testing support for template variants
- Support variable substitution:
Kafka (Message Queue)
- Why Kafka over RabbitMQ/SQS:
- Massive throughput (millions of messages/sec)
- Durable: messages persisted to disk with replication
- Consumer groups: easy to scale workers independently per channel
- Replay capability: if a worker has a bug, fix it and replay
- Topics: Separate topic per channel (
push_notifications,email_notifications,sms_notifications)- Allows independent scaling of each channel's worker pool
- If email provider is slow, email queue backs up without affecting push
- Partitioning: By
user_id→ ensures notifications for the same user are processed in order - Config: RF=3,
min.insync.replicas=2, retention = 7 days
Push Worker Pool
- APNs (Apple Push Notification Service): For iOS devices. HTTP/2 persistent connections. Must handle token invalidation (user uninstalled app)
- FCM (Firebase Cloud Messaging): For Android and web. REST API. Supports topic messaging for broadcast
- Flow:
- Consume from
push_notificationstopic - Look up device tokens for user from Device Token DB
- Send to APNs/FCM
- Handle response: success → update status; invalid token → remove token; rate limited → retry with backoff
- Consume from
- Connection pooling: Maintain persistent HTTP/2 connections to APNs/FCM (connection setup is expensive)
Email Worker Pool
- Providers: SendGrid, AWS SES, Mailgun (use multiple for redundancy)
- Flow:
- Consume from
email_notificationstopic - Render HTML template
- Send via primary provider (SendGrid)
- If primary fails → failover to secondary (SES)
- Track bounce/complaint callbacks via webhooks
- Consume from
- Considerations: SPF, DKIM, DMARC for deliverability. Warm up IPs for bulk sends
SMS Worker Pool
- Providers: Twilio, Nexmo/Vonage (use multiple; some are better in certain regions)
- Flow: Similar to email worker
- Considerations: SMS costs money ($0.01-0.05 per SMS). Apply strict rate limiting. Support country-specific routing (cheapest provider per country)
Delivery Tracker
- Purpose: Receive delivery receipts from providers (webhooks/callbacks)
- Tracks:
QUEUED → SENT → DELIVERED → READ → FAILED → BOUNCED - Webhooks: Providers call back with delivery status updates
- Stores: Status updates in Cassandra notification log
Redis (Deduplication Cache)
- Purpose: Prevent duplicate notifications
- How:
SET notification:{request_id} 1 EX 86400 NX: if key exists, it's a duplicate - Also used for: Rate limiting counters per user per channel
Event Bus Design (Kafka)
Topic: notification_system-events Partitions: 64 (scale consumers horizontally) Partition key: entity_id (user_id / order_id — preserves per-entity ordering) Retention: 7 days (compliance) or 24h (high-volume telemetry) Replication factor: 3, min.insync.replicas: 2 Producer: idempotent producer enabled (enable.idempotence=true) Consumer: consumer group "notification_system-processors" - At-least-once delivery + idempotent handlers (dedup by event_id) - DLQ topic: notification_system-events-dlq (poison messages after 3 retries) - Lag alert: consumer lag > 60s → scale workers Design a Notification System (Push, Email, SMS): async side effects MUST NOT block the synchronous API response. Sync path: validate → persist source of truth → publish event → return 201 Async path: consumers update caches, indexes, notifications, aggregates
Send Notification
POST /api/v1/notifications
Authorization: Bearer <service_token>
{
"request_id": "uuid-v4", // idempotency key
"user_ids": ["user123", "user456"],
"notification_type": "order_shipped",
"priority": "high", // critical, high, normal, low
"channels": ["push", "email"], // optional; if omitted, use user prefs
"template_id": "order_shipped_v2",
"template_vars": {
"order_id": "ORD-12345",
"tracking_url": "https://track.ly/abc"
},
"scheduled_at": null, // null = immediate
"metadata": {
"campaign_id": "spring_sale_2026"
}
}
Response: 202 Accepted
{
"notification_id": "notif-uuid",
"status": "queued",
"channels_targeted": ["push", "email"]
}Get Notification Status
GET /api/v1/notifications/{notification_id}
Response: 200 OK
{
"notification_id": "notif-uuid",
"user_id": "user123",
"channels": {
"push": {"status": "delivered", "delivered_at": "..."},
"email": {"status": "sent", "sent_at": "..."}
}
}Update User Preferences
PUT /api/v1/users/{user_id}/notification-preferences
{
"channels": {
"push": true,
"email": true,
"sms": false
},
"quiet_hours": {
"start": "22:00",
"end": "08:00",
"timezone": "America/New_York"
},
"notification_types": {
"marketing": {"push": false, "email": true},
"social": {"push": true, "email": false},
"order_updates": {"push": true, "email": true, "sms": true}
}
}Get User's Notification History
GET /api/v1/users/{user_id}/notifications?page=1&limit=20
Response: 200 OK
{
"notifications": [
{
"notification_id": "...",
"type": "order_shipped",
"title": "Your order has shipped!",
"body": "Order ORD-12345 is on its way.",
"channel": "push",
"status": "read",
"created_at": "2026-03-13T10:00:00Z"
}
],
"pagination": {"page": 1, "total": 150}
}Common Error Responses
400 Bad Request: invalid input, missing fields, or malformed JSON
401 Unauthorized: missing or invalid auth token or API key
403 Forbidden: authenticated but insufficient permissions
404 Not Found: resource ID does not exist
409 Conflict: duplicate write or version conflict; retry with idempotency key
422 Unprocessable Entity: valid syntax but invalid business logic
429 Too Many Requests: rate limit exceeded; honor Retry-After header
500 Internal Error: unexpected server fault; retry with idempotency key
503 Service Unavailable: dependency down or overloaded; use exponential backoff
202 Accepted: job queued; poll GET /jobs/{id} for status
408 Request Timeout: job still processing; continue pollingPostgreSQL: User Preferences
Why PostgreSQL: Relational data with clear schema, strong consistency needed for preferences.
CREATE TABLE user_notification_preferences (
user_id UUID PRIMARY KEY,
push_enabled BOOLEAN DEFAULT TRUE,
email_enabled BOOLEAN DEFAULT TRUE,
sms_enabled BOOLEAN DEFAULT FALSE,
quiet_hours_start TIME,
quiet_hours_end TIME,
timezone VARCHAR(64),
language VARCHAR(10) DEFAULT 'en',
updated_at TIMESTAMP
);
CREATE TABLE user_type_preferences (
user_id UUID,
notification_type VARCHAR(64),
push_enabled BOOLEAN DEFAULT TRUE,
email_enabled BOOLEAN DEFAULT TRUE,
sms_enabled BOOLEAN DEFAULT FALSE,
PRIMARY KEY (user_id, notification_type)
);Cassandra: Notification Log
Why Cassandra: High write throughput (billions of notifications), time-series access pattern (user's recent notifications), TTL support.
CREATE TABLE notification_log (
user_id UUID,
created_at TIMESTAMP,
notification_id UUID,
notification_type VARCHAR,
channel VARCHAR,
title TEXT,
body TEXT,
status VARCHAR, -- queued, sent, delivered, read, failed
metadata MAP<TEXT, TEXT>,
PRIMARY KEY (user_id, created_at, notification_id)
) WITH CLUSTERING ORDER BY (created_at DESC)
AND default_time_to_live = 7776000; -- 90 days retentionRedis: Dedup & Rate Limiting
# Deduplication
Key: notif:dedup:{request_id}
Value: 1
TTL: 86400 (24 hours)
# Rate limiting (per user per channel)
Key: notif:rate:{user_id}:{channel}:{hour}
Value: counter (INCR)
TTL: 3600Redis: Device Tokens
Key: device_tokens:{user_id}
Value: SET of {device_token, platform, app_version, last_active}Alternatively, store in PostgreSQL if you need complex queries.
Kafka Topics
Topic: push_notifications (partitioned by user_id)
Topic: email_notifications (partitioned by user_id)
Topic: sms_notifications (partitioned by user_id)
Topic: notification_status (delivery status callbacks)
Message Schema (push_notifications):
{
"notification_id": "uuid",
"user_id": "user123",
"title": "Your order has shipped!",
"body": "Order ORD-12345 is on its way.",
"data": {"order_id": "ORD-12345", "deep_link": "app://orders/12345"},
"priority": "high",
"created_at": "2026-03-13T10:00:00Z"
}MySQL: Notification Templates
CREATE TABLE notification_templates (
template_id VARCHAR(128) PRIMARY KEY,
version INT,
channel ENUM('push', 'email', 'sms'),
subject TEXT, -- for email
title TEXT, -- for push
body_template TEXT, -- "Hello {{user_name}}, ..."
html_template TEXT, -- for email HTML
language VARCHAR(10),
active BOOLEAN,
created_at TIMESTAMP,
UNIQUE KEY (template_id, version, channel, language)
);General
| Technique | Application |
|---|---|
| Kafka durability | RF=3, min.insync.replicas=2. Notifications survive broker failures |
| Consumer group rebalancing | If a push worker dies, Kafka rebalances partitions to surviving workers |
| Retry with exponential backoff | On provider errors (APNs, SendGrid timeouts) |
| Dead Letter Queue (DLQ) | After N retries, move to DLQ for manual investigation |
| Idempotent processing | Dedup by notification_id in Redis before sending |
| Circuit breaker | Per provider — if Twilio is down, stop sending to it, alert ops |
Problem-Specific Fault Tolerance
1. Provider Outage (e.g., SendGrid is down)
- Circuit breaker trips after 5 consecutive failures
- Automatic failover to secondary provider (AWS SES)
- Provider abstraction layer makes switching transparent
- Queue keeps growing → process backlog when provider recovers
2. Push Token Invalidation
- APNs/FCM return "invalid token" (user uninstalled app)
- Worker removes the invalid token from device token store
- If all tokens invalid → can't send push; fall back to email/SMS based on preferences
3. Duplicate Notifications
- Kafka consumer commits offset after processing → if crash before commit, message re-processed
- Solution: Check Redis dedup cache (
notification_id) before calling provider - Also: providers themselves deduplicate (APNs has
apns-collapse-id)
4. Notification Storm (Bulk Campaign)
- Marketing sends a campaign to 50M users at once
- Solution:
- Separate Kafka topic/partition for bulk vs. transactional notifications
- Bulk notifications are throttled (processed at controlled rate)
- Transactional notifications (order confirmation) always prioritized
5. User Device Offline
- Push notification is sent to APNs/FCM but device is offline
- APNs/FCM handle this: they store the notification and deliver when device comes online
- We track status as "sent" (not "delivered") until device acks
Notification Grouping / Bundling
- Instead of "User A liked your photo", "User B liked your photo" × 50 times
- Group into: "User A, User B, and 48 others liked your photo"
- Implementation: Hold notifications in a buffer (Redis sorted set by user_id) for 5 minutes. If multiple notifications of same type arrive, merge them. Timer triggers the bundled notification
Priority Queue Implementation
Kafka topics by priority: notifications_critical → immediate processing, dedicated worker pool notifications_high → normal processing notifications_normal → best-effort notifications_low → batched processing (hourly digest)
Analytics
- Track delivery rate per channel, per provider
- Track open rates, click-through rates for emails
- Track notification-to-action conversion
- Store in ClickHouse for dashboarding
Quiet Hours / Timezone Handling
- Store user's timezone in preferences
- Before sending, check if current time in user's timezone is within quiet hours
- If yes → schedule for delivery at quiet hours end (unless priority = critical)
- Use a scheduled notification queue backed by a distributed scheduler
Unsubscribe / Compliance
- Every email must have an unsubscribe link (CAN-SPAM / GDPR)
- SMS requires opt-in (TCPA compliance)
- Push can be disabled at OS level (handle gracefully)
- Maintain a suppression list (bounced emails, unsubscribed users)
Interview Walkthrough
- Clarify channels upfront — push, email, SMS, in-app — each has different latency, cost, and delivery guarantees.
- Separate the hot path (accept notification request, return 202) from the async delivery pipeline via a durable message queue.
- Design per-channel workers with provider-specific rate limits (APNs, FCM, SendGrid, Twilio) and circuit breakers on provider failures.
- Store user preferences and suppression lists before enqueueing — GDPR/CAN-SPAM compliance is a first-class filter, not an afterthought.
- Use idempotency keys on the enqueue API so retries from client apps do not duplicate notifications.
- Quantify throughput with Back-of-the-Envelope Estimation: batch similar notifications, prioritize transactional over marketing.
- Common pitfall: synchronous delivery in the API handler — a slow SMS provider blocks the entire request and causes cascading timeouts.
Push vs Pull for Notification Delivery
Push (Server-initiated) ⭐: Server pushes notification to client via APNs/FCM/WebSocket ✓ Real-time delivery (< 1 second) ✓ No wasted bandwidth (only sent when there's something to send) ✗ Requires persistent connection or OS-level push service ✗ APNs/FCM can throttle or drop notifications under load Best for: Real-time alerts, chat messages, critical notifications Pull (Client-initiated): Client polls server: "Any new notifications?" ✓ Simple server-side implementation ✓ Client controls frequency ✗ Wastes bandwidth (most polls return nothing) ✗ Latency = poll interval (if polling every 30s, avg delay = 15s) Best for: Non-real-time (email digests, weekly summaries) Hybrid: Push a lightweight "you have notifications" signal → client pulls full details ✓ Push is tiny (no payload, just a trigger) ✓ Client fetches exactly what it needs Used by: Instagram, Twitter (push wake-up → pull feed)
Why Kafka Per Channel (Not a Single Topic)?
Single topic approach:
All notifications → one "notifications" topic → one consumer group
Problem: SMS delivery takes 2s; push takes 50ms; email takes 500ms
A burst of 100K emails BLOCKS push notifications waiting in the same queue
Per-channel topics ⭐:
notifications_push → fast consumers (50ms avg)
notifications_email → medium consumers (500ms avg)
notifications_sms → slow consumers (2s avg)
Each channel scales independently:
Push: 20 consumers (high volume, fast)
Email: 50 consumers (high volume, medium speed)
SMS: 10 consumers (lower volume, slow provider)
Bonus: If email provider is down, only email topic backs up.
Push and SMS continue normally. Circuit breaker per channel.At-Least-Once vs Exactly-Once vs At-Most-Once
At-Most-Once:
Send and forget. If delivery fails, don't retry.
✓ No duplicates ever
✗ Notifications can be lost
Use for: Marketing, non-critical (better to miss than annoy with duplicates)
At-Least-Once ⭐ (Recommended for most):
Retry on failure. May result in duplicates.
✓ No notification is ever lost
✗ User might get "You have a new message" twice
Mitigation: Client-side deduplication by notification_id
Use for: Transactional (order confirmation, OTP, alerts)
Exactly-Once:
Guarantee each notification delivered exactly once.
✗ Extremely hard in distributed systems (requires idempotency + dedup)
✗ Higher latency (need to check dedup before every delivery)
Implementation:
Store notification_id in Redis SET → before sending, check if already sent
TTL = 24 hours (dedup window)
Use for: Financial alerts, critical one-time codesProvider Failover Strategy
For each channel, maintain primary and fallback providers:
Email: Primary: AWS SES → Fallback: SendGrid → Fallback: Mailgun
SMS: Primary: Twilio → Fallback: AWS SNS → Fallback: MessageBird
Push: Primary: APNs/FCM → (no fallback — platform-specific)
Failover logic:
try:
primary_provider.send(notification)
except ProviderError, Timeout:
circuit_breaker.record_failure(primary)
if circuit_breaker.is_open(primary):
fallback_provider.send(notification)
else:
retry(primary, max_retries=2, backoff=exponential)
Circuit breaker thresholds:
- Open after 5 consecutive failures OR > 50% error rate in last 60s
- Half-open after 30 seconds → try 1 request → if success, close
- Closed (healthy) → normal operation
Why not always use the cheapest provider?
- Reliability varies (provider-specific outages)
- Deliverability differs (SES has better inbox placement than some)
- Regional performance (Twilio better in US, MessageBird better in EU)
- Cost optimization: Route 80% through cheapest, 20% through most reliableTemplate Engine: Server-Side vs Client-Side Rendering
Server-side rendering (compile template + data → final content):
Template: "Hi {{name}}, your order #{{order_id}} is confirmed!"
Data: {name: "Alice", order_id: "12345"}
Output: "Hi Alice, your order #12345 is confirmed!"
✓ Works for all channels (email, SMS, push — all get final content)
✓ Consistent rendering regardless of client
✗ Template changes require redeployment (unless stored in DB)
Best practice: Store templates in DB (versioned), compile at runtime
Cache compiled templates in Redis (TTL = 5 min)
Client-side rendering:
Send template + data separately → client renders
✓ Reduces payload size (template cached on client)
✗ Only works for push/in-app (not email/SMS)
✗ Client must handle rendering logicStaff interviews expect you to articulate how the system evolves under real growth — not jump straight to the final architecture.
Phase 1 — Monolith + direct provider calls
Synchronous send on event handler. Works to ~100/sec then blocks request path.
Key components: Monolith · FCM/SES SDK inline
Move to next phase when: Provider latency and failures cascade to core API
Phase 2 — Async pipeline
Kafka ingress, channel workers, PostgreSQL for preferences and delivery log, Redis dedup + rate limits.
Key components: Kafka · Channel workers · PostgreSQL · Redis
Move to next phase when: Priority mixing and multi-region users need scheduling
Phase 3 — Global + compliance
Regional SMS sender IDs, template versioning, digest queue, analytics on delivery funnel. Separate blast infrastructure from transactional tier.
Key components: Regional providers · Digest service · Blast tier · Opt-out sync
Move to next phase when: Regulatory audit requires provable consent and channel-specific caps
SLOs & Error Budgets
| Metric | Target | Rationale |
|---|---|---|
| P0 delivery latency | p99 < 5s | Security and fraud alerts |
| Transactional delivery success | 99.5% | After retries; excludes invalid tokens |
| Dedup false-send rate | 0 | Duplicate OTP is UX and cost disaster |
| Ingress availability | 99.99% | Enqueue must survive provider outages |
Incident Scenarios (2am reality)
| Scenario | How you detect | Mitigation |
|---|---|---|
| FCM global outage | Push worker error rate 100%, provider status page | Fail over to SMS for P0 only if user opted in; queue push with TTL; extend retry window; status comms |
| Bug sends duplicate billing emails to 2M users | SES complaint rate spike, support tickets | Kill email consumer; dedup retroactive by campaign_id; apology campaign; root cause in idempotency key gap |
| Redis dedup cluster down | Dedup check fails open → duplicate sends | Fail closed (drop non-P0) or fallback to DB dedup with higher latency; restore Redis; reconcile duplicate metrics |
Cost Drivers (Staff lens)
- SMS: $0.01–0.08 per message × volume — dominant if SMS ratio high
- SES email: cheap at scale but attachment storage adds up
- Kafka retention for 7-day replay and audit
Multi-Region & DR
Users homed to region for data residency; events processed in home region. Cross-region only for global P0 (security) with explicit routing. Provider credentials per region (EU SMS sender ID). Async replication of preferences is eventual — accept stale opt-out up to 60s with fail-closed on marketing.