Design a Backup and Disaster Recovery System

Interview Prompt

Design Backup and Disaster Recovery System.

Clarifying Questions (ask before designing)

Question	Why it matters
Which of these is highest priority: RPO/RTO definitions, Incremental backups, Cross-region replication?	Forces scope negotiation — senior candidates trim before drawing boxes.
What scale should we design for — DAU, QPS, data volume?	Drives every capacity decision; shows structured thinking.
What are the read vs write patterns on the critical path?	Determines caching, DB choice, and replication topology.
What consistency and durability guarantees are required?	Separates strong-consistency paths from eventual ones — a senior differentiator.

Scope

In scope

RPO/RTO definitions
Incremental backups
Cross-region replication
Failover orchestration
DR drills
Capacity estimation with shown math

Out of scope (state explicitly)

Detailed frontend/UI pixel implementation
Org structure, staffing, and hiring plan

Assumptions

Clarify scale (DAU, QPS, data volume) for backup disaster recovery in the first 5 minutes
Standard reliability target 99.9%–99.99% unless problem implies higher (payments, booking)
Managed cloud services (RDS, S3, Kafka, Redis) are acceptable building blocks

Full backups: Complete copy of all data (databases, object storage, config)
Incremental backups: Only changes since last backup
Point-in-time recovery (PITR): Restore to any second within retention window
Cross-region replication: Backups stored in geographically separate region
Backup scheduling: Configurable policies (hourly incremental, daily full, weekly archive)
Restore testing: Automated periodic restore verification
Multi-tier storage: Recent on fast storage, older on cold/archive
RPO and RTO enforcement
Disaster recovery runbook: Automated failover to DR region

Metric	Calculation	Value
Total production data	Given	500 TB
Daily change rate	Given	5% → 25 TB/day incremental
Full backup size	Given	500 TB (compressed: ~200 TB)
Full backup frequency	Given	Weekly
Incremental backup frequency	Given	Hourly
Retention	Given	30 days hot, 1 year warm, 7 years cold
Total backup storage	Given	~5 PB (with retention + dedup)
Restore throughput needed (RTO=15min)	Given	556 GB/sec

Loading...

DR Strategy Tiers

Strategy	RPO	RTO	Cost	Description
Backup & Restore	Hours	Hours	$	Restore from S3 on demand
Pilot Light	Minutes	30 min	$$	Minimal infra, scale on failover
Warm Standby	Seconds	15 min	$$$	Reduced capacity, always running
Active-Active	0	~0	$$$$	Both regions serve traffic

HTTP

POST /api/backups/trigger        → Trigger manual backup {source, type}
GET  /api/backups                → List backups with status
GET  /api/backups/{id}/status    → Backup job status
POST /api/restore                → Initiate restore {backup_id, target}
POST /api/dr/failover            → Initiate DR failover (manual trigger)
GET  /api/dr/status              → DR region health, replication lag

Common Error Responses

400 Bad Request: invalid input, missing fields, or malformed JSON
401 Unauthorized: missing or invalid auth token or API key
403 Forbidden: authenticated but insufficient permissions
404 Not Found: resource ID does not exist
409 Conflict: duplicate write or version conflict; retry with idempotency key
422 Unprocessable Entity: valid syntax but invalid business logic
429 Too Many Requests: rate limit exceeded; honor Retry-After header
500 Internal Error: unexpected server fault; retry with idempotency key
503 Service Unavailable: dependency down or overloaded; use exponential backoff

PostgreSQL (Backup Metadata: Control Plane)

SQL

CREATE TABLE backup_jobs (
    backup_id      UUID PRIMARY KEY,
    source_system  TEXT NOT NULL,
    backup_type    TEXT NOT NULL,
    status         TEXT DEFAULT 'running',
    storage_path   TEXT,
    size_bytes     BIGINT,
    checksum       TEXT,
    started_at     TIMESTAMPTZ DEFAULT NOW(),
    completed_at   TIMESTAMPTZ,
    retention_tier TEXT DEFAULT 'hot',
    expires_at     TIMESTAMPTZ,
    encrypted      BOOLEAN DEFAULT TRUE
);

CREATE TABLE restore_jobs (
    restore_id     UUID PRIMARY KEY,
    backup_id      UUID REFERENCES backup_jobs(backup_id),
    target_system  TEXT NOT NULL,
    restore_point  TIMESTAMPTZ,
    status         TEXT DEFAULT 'running',
    validated      BOOLEAN DEFAULT FALSE
);

Automated Failover Runbook

Trigger: Primary region health check fails for > 60 seconds

Automated sequence:
  T=0s:   Health check failure detected (Route53 health checker)
  T=5s:   Alert PagerDuty + Slack channel
  T=10s:  Fence old primary (revoke DNS, block writes via security group)
  T=15s:  Promote DR PostgreSQL replica to primary (pg_promote(), < 5s)
  T=20s:  Update service discovery (Consul/etcd) → DR endpoints
  T=25s:  Scale DR auto-scaling groups to full capacity
  T=30s:  DNS failover: Route53 switches to DR region
  T=60s:  Traffic flowing to DR region
  T=120s: Run smoke tests
  T=180s: Declare DR active

Total RTO: ~3 minutes (automated) vs 30+ minutes (manual)

CRITICAL: Fence old primary BEFORE promoting DR
  If old primary comes back online → split-brain → data corruption

Restore Testing

"A backup that has never been tested is not a backup"

Weekly automated restore test:
1. Pick random recent backup
2. Restore to isolated environment (separate VPC)
3. Run validation checks:
   a. Row counts match expected (within 0.1%)
   b. Checksums of critical tables match
   c. Application can connect and query
   d. Measure actual restore time vs RTO target
4. Report results → dashboard + alert on failure
5. Tear down isolated environment

Immutable Backups (Ransomware Protection)

Solution: S3 Object Lock (WORM — Write Once Read Many)
  COMPLIANCE mode: NOBODY can delete (not even root/admin) for 30 days

  Combined with:
  - Cross-account replication: backups in separate AWS account
  - MFA delete: require MFA token to delete any backup
  - Separate KMS keys: backup encryption keys in different account

Crypto-Shredding (GDPR Right to Erasure from Backups)

Problem: User requests data deletion, but data exists in 90 days of backups

Crypto-shredding:
  1. Each user's data encrypted with per-user key before backup
  2. User deletion request → delete their encryption key from KMS
  3. Backup data still exists but is unreadable (key is gone)
  4. Effectively deleted without modifying backup files
  5. Immutable backups + GDPR compliance = both satisfied

RPO vs RTO: The Fundamental Trade-off

RPO 0 (zero data loss):
  Requires: synchronous replication to DR site
  Sync replication: primary WAITS for DR to confirm write
  Latency cost: +20-100ms per write
  Use for: Financial transactions, ledger systems, payment data

RPO 1 minute:
  Requires: async replication + WAL shipping every minute
  Primary writes freely → ships WAL logs to DR every 60 seconds
  Use for: E-commerce orders, user data

RPO 1 hour:
  Requires: hourly backups to S3 (snapshot + incremental)
  Cheapest approach — no replication, just scheduled backups
  Use for: Analytics, non-critical data

RTO → cost relationship:
  RTO = 4 hours: ~$500/month (just S3 storage)
  RTO = 15 minutes: ~$10,000/month (duplicate infrastructure)
  RTO = 30 seconds: ~$50,000/month (full infra in 2+ regions)

Split-Brain Prevention During Failover

1. Fencing (STONITH — Shoot The Other Node In The Head):
   Before promoting DR: revoke IAM write permissions, block traffic, shut DB
   Only AFTER fencing succeeds → promote DR

2. Witness/Quorum:
   Deploy "witness" node in third region. Failover requires 2-of-3 agreement.

3. Lease-based leadership:
   Primary holds lease in distributed lock, expires every 30s
   If primary can't renew → safe to promote

4. Epoch-based writes:
   Every write includes epoch number. On failover: new primary increments epoch.
   Old primary's epoch < current ? writes rejected.

Failback: Returning to Primary After DR

Step 1: Rebuild primary as replica. Step 2: Validate (row counts, checksums). Step 3: Planned failover during maintenance window. Step 4: Post-failback monitoring. Total failback time: 2-12 hours (depending on data volume). Danger: if you failback too quickly without full sync, transactions done on DR during outage are LOST.

Active-Active Multi-Region

Instead of active-passive, run BOTH regions as active simultaneously.

Why it's hard:
  Conflict resolution: Last-write-wins (LWW), application-level merge, CRDTs, region-owned data
  Referential integrity: route user + all related entities to same region

When active-active makes sense:
  ✓ Read-heavy workloads (global CDN-like caching)
  ✓ Partition-able data (each user assigned to home region)
  ✓ CRDT-compatible operations (counters, sets, append-only logs)

When active-passive is better:
  ✓ Strong consistency required (financial, inventory)
  ✓ Complex transactions spanning multiple entities
  ? Write-heavy workloads
  ? Simpler operations

SLOs & Error Budgets

Metric	Target	Rationale
Core user-facing availability	99.95%	Budget for planned maintenance + unplanned failures without user-visible outage.
p99 latency (critical path)	Problem-specific — state target early and tie to capacity math	Interview credibility comes from connecting SLO to architecture choices.
Error rate (5xx)	< 0.1%	Distinguishes transient blips from systemic failure requiring rollback.
Data durability	99.999999999% (11 nines) for committed writes	Define which operations require fsync/quorum vs async replication.

Incident Scenarios (2am reality)

Scenario	How you detect	Mitigation
Primary database unavailable	Health check failures, connection pool exhaustion alerts, elevated 5xx	Failover to replica / promote standby; enable read-only degraded mode if writes impossible; queue writes if async path exists
Traffic spike (10× normal)	RPS anomaly alert, autoscaling lag, latency SLO burn rate	Rate limit non-critical endpoints; scale read path horizontally; pre-warm caches; shed load on expensive operations
Bad deploy causing elevated errors	Canary metric regression, error budget burn, deployment correlation	Automated rollback within 5 minutes; feature flag kill switch; maintain N-1 compatibility

Cost Drivers (Staff lens)

Egress bandwidth and CDN (often dominates media/data-heavy systems)
Database storage + IOPS at scale (plan compaction, TTL, tiering)
Compute for async pipelines (right-size workers, spot instances for batch)
Managed service premiums vs operational headcount trade-off

Multi-Region & DR

Start single-region with cross-AZ redundancy. Add read replicas in secondary region for DR. Move to active-active only when latency SLO or data residency requires it — accept conflict resolution complexity explicitly.