Design an Email Service (like Gmail)

This problem appears in multiple sheets. Depth expectations increase as you progress:

Track	What to demonstrate
Arch 75	Staff level: multi-region, cost at scale, migration path, and production metrics.

Interview Prompt

Design Email Service (like Gmail).

Clarifying Questions (ask before designing)

Question	Why it matters
Which of these is highest priority: SMTP relay, Mailbox storage, Search indexing?	Forces scope negotiation — senior candidates trim before drawing boxes.
What scale should we design for — DAU, QPS, data volume?	Drives every capacity decision; shows structured thinking.
What are the read vs write patterns on the critical path?	Determines caching, DB choice, and replication topology.
What consistency and durability guarantees are required?	Separates strong-consistency paths from eventual ones — a senior differentiator.

Scope

In scope

SMTP relay
Mailbox storage
Search indexing
Spam filtering (Bayesian + ML)
Threading
Attachment storage

Out of scope (state explicitly)

Detailed frontend/UI pixel implementation
Org structure, staffing, and hiring plan

Assumptions

Clarify scale (DAU, QPS, data volume) for email service gmail in the first 5 minutes
Standard reliability target 99.9%–99.99% unless problem implies higher (payments, booking)
Managed cloud services (RDS, S3, Kafka, Redis) are acceptable building blocks

Send and receive emails (SMTP) with attachments (up to 25 MB)
Inbox, Sent, Drafts, Spam, Trash folders + custom labels/folders
Full-text search across all emails (subject, body, sender, attachments)
Conversation threading: group related emails into threads
Spam filtering using ML + rule-based system
Push notifications for new emails
Rich text compose (HTML email) with inline images
Contact management and autocomplete
Filters and rules: auto-label, auto-archive, auto-forward
Calendar integration (event invitations, RSVP)

Metric	Calculation	Value
Users	Given (assumption documented in value)	1B
Emails sent / day	300B ÷ 86400	300B (50% spam)
Emails received per user / day	50 ÷ 86400	50 (after spam filtering)
Avg email size	Given (typical workload assumption)	50 KB (body) + 200 KB avg attachment
Storage per user	Given (assumption documented in value)	15 GB
Total storage	1B × 15 GB	15 EB
Emails / sec (inbound)	From Emails / day ÷ 86400 (+ peak factor in value)	3.5M
Search queries / sec	From Search queries / day ÷ 86400 (+ peak factor in value)	100K

Loading...

Email Send Flow

1. User clicks "Send" → POST /api/messages/send
2. Validate: recipients exist, attachment size < 25MB, rate limit check
3. Store email body + attachments to Blob Store (S3)
4. Store email metadata to Bigtable/Cassandra
5. Enqueue to send queue (Kafka topic: outgoing-emails)
6. SMTP Sender Worker picks up from queue:
   a. DNS MX lookup: recipient domain → find receiving mail server
   b. Open TLS connection to receiving server (STARTTLS)
   c. Authenticate: sign with DKIM key for sender's domain
   d. Transmit email via SMTP protocol
   e. Receiving server ACKs → mark as delivered
   f. If rejected → generate bounce (DSN) → deliver to sender's inbox
7. If temporary failure: retry with exponential backoff (1min → 72h)
   After 72h of retries → permanent failure → bounce to sender

Optimization for internal emails (sender@gmail → recipient@gmail):
   Skip SMTP entirely → directly store in recipient's mailbox → 100ms delivery

Incoming Email Flow (SMTP Receive)

1. External sender's MTA connects to our SMTP gateway (MX record)
2. SMTP handshake: EHLO, MAIL FROM, RCPT TO
3. Before accepting DATA:
   a. SPF check: is sender's IP authorized for their domain?
   b. Rate limiting: too many emails from this IP? → 421 temporary reject
   c. Recipient exists? → 550 user unknown if not
4. Accept DATA → email content streamed
5. DKIM verification: check cryptographic signature
6. DMARC evaluation: combine SPF + DKIM → pass/fail policy
7. Spam classification: ML model scores email (0-1)
   Score > 0.7 → spam folder; 0.3-0.7 → show warning; < 0.3 → inbox
8. Virus scan: check attachments for malware (ClamAV)
9. Store body to Blob Store, metadata to Bigtable
10. Index in Elasticsearch for search
11. Push notification to recipient (if enabled)
12. Return 250 OK to sender's MTA

Spam Filtering Architecture

Layer 1: Connection-level filters (< 1ms per connection)
  - IP reputation: is this IP in known spammer lists? (Spamhaus ZEN)
  - Volume limits: > 1000 emails from this IP in last hour → rate limit
  - DNS checks: does IP reverse-resolve to a legitimate hostname?
  Catches: ~70% of spam before content inspection

Layer 2: Content analysis (10-50ms per email)
  - SPF/DKIM/DMARC: cryptographic proof of sender legitimacy
  - Spam rules: regex patterns, known spam phrases, HTML structure
  - URL analysis: links in email → check against phishing databases
  - Attachment scanning: ClamAV for malware
  Catches: additional ~20% of spam

Layer 3: ML classification (100-300ms per email)
  - NLP model: BERT fine-tuned on spam/ham corpus
  - Features: text, sender reputation, social graph, historical interaction
  - Score > 0.85 → spam folder; 0.5-0.85 → warning; < 0.5 → inbox
  Catches: additional ~9% of spam

Result: < 0.1% false positive rate (legitimate email marked as spam)

Push vs Pull (IMAP IDLE vs Polling)

IMAP IDLE: server holds connection open, pushes EXISTS notification. Google Sync: FCM (Android) / APNs (iOS) for battery-efficient push. Web client: WebSocket for real-time inbox updates.

SPF, DKIM, DMARC

SPF: DNS record listing authorized IPs for a domain. DKIM: Cryptographic signature in email header. DMARC: Policy telling receivers what to do if SPF/DKIM fails (reject/quarantine).

HTTP

# Email operations
GET    /api/messages?label=INBOX&page_token=...    → List emails
GET    /api/messages/{id}                          → Get email
POST   /api/messages/send                          → Send email
PUT    /api/messages/{id}/labels                   → Add/remove labels
PUT    /api/messages/{id}/read                     → Mark read/unread
DELETE /api/messages/{id}                          → Move to trash
POST   /api/messages/{id}/reply                    → Reply
POST   /api/messages/{id}/forward                  → Forward
GET    /api/threads/{thread_id}                    → Get thread
GET    /api/messages/search?q=from:alice+subject:meeting  → Full-text search

Common Error Responses

400 Bad Request: invalid input, missing fields, or malformed JSON
401 Unauthorized: missing or invalid auth token or API key
403 Forbidden: authenticated but insufficient permissions
404 Not Found: resource ID does not exist
409 Conflict: duplicate write or version conflict; retry with idempotency key
422 Unprocessable Entity: valid syntax but invalid business logic
429 Too Many Requests: rate limit exceeded; honor Retry-After header
500 Internal Error: unexpected server fault; retry with idempotency key
503 Service Unavailable: dependency down or overloaded; use exponential backoff
202 Accepted: job queued; poll GET /jobs/{id} for status
408 Request Timeout: job still processing; continue polling

SLOs & Error Budgets

Metric	Target	Rationale
Core user-facing availability	99.95%	Budget for planned maintenance + unplanned failures without user-visible outage.
p99 latency (critical path)	Problem-specific — state target early and tie to capacity math	Interview credibility comes from connecting SLO to architecture choices.
Error rate (5xx)	< 0.1%	Distinguishes transient blips from systemic failure requiring rollback.
Data durability	99.999999999% (11 nines) for committed writes	Define which operations require fsync/quorum vs async replication.

Incident Scenarios (2am reality)

Scenario	How you detect	Mitigation
Primary database unavailable	Health check failures, connection pool exhaustion alerts, elevated 5xx	Failover to replica / promote standby; enable read-only degraded mode if writes impossible; queue writes if async path exists
Traffic spike (10× normal)	RPS anomaly alert, autoscaling lag, latency SLO burn rate	Rate limit non-critical endpoints; scale read path horizontally; pre-warm caches; shed load on expensive operations
Bad deploy causing elevated errors	Canary metric regression, error budget burn, deployment correlation	Automated rollback within 5 minutes; feature flag kill switch; maintain N-1 compatibility

Cost Drivers (Staff lens)

Egress bandwidth and CDN (often dominates media/data-heavy systems)
Database storage + IOPS at scale (plan compaction, TTL, tiering)
Compute for async pipelines (right-size workers, spot instances for batch)
Managed service premiums vs operational headcount trade-off

Multi-Region & DR

Start single-region with cross-AZ redundancy. Add read replicas in secondary region for DR. Move to active-active only when latency SLO or data residency requires it — accept conflict resolution complexity explicitly.

Interview Prompt

Clarifying Questions (ask before designing)

Scope

In scope

Out of scope (state explicitly)

Assumptions

Email Send Flow

Incoming Email Flow (SMTP Receive)

Spam Filtering Architecture

Push vs Pull (IMAP IDLE vs Polling)

SPF, DKIM, DMARC

Common Error Responses

Email Metadata (Bigtable / Cassandra)

Conversation Threading

Email Delivery Guarantees

Data Loss Prevention

Email Deduplication

Interview Walkthrough

Storage Architecture: Bigtable vs Cassandra vs Sharded MySQL

Why Email Architecture Is Unique

Phase 1: MVP (0 to 100K users)

Phase 2: Growth (100K to 10M users)

Phase 3: Scale (10M+ users)

SLOs & Error Budgets

Incident Scenarios (2am reality)

Cost Drivers (Staff lens)

Multi-Region & DR