Interview Prompt
Design Backup and Disaster Recovery System.
Clarifying Questions (ask before designing)
| Question | Why it matters |
|---|---|
| Which of these is highest priority: RPO/RTO definitions, Incremental backups, Cross-region replication? | Forces scope negotiation — senior candidates trim before drawing boxes. |
| What scale should we design for — DAU, QPS, data volume? | Drives every capacity decision; shows structured thinking. |
| What are the read vs write patterns on the critical path? | Determines caching, DB choice, and replication topology. |
| What consistency and durability guarantees are required? | Separates strong-consistency paths from eventual ones — a senior differentiator. |
Scope
In scope
- RPO/RTO definitions
- Incremental backups
- Cross-region replication
- Failover orchestration
- DR drills
- Capacity estimation with shown math
Out of scope (state explicitly)
- Detailed frontend/UI pixel implementation
- Org structure, staffing, and hiring plan
Assumptions
- Clarify scale (DAU, QPS, data volume) for backup disaster recovery in the first 5 minutes
- Standard reliability target 99.9%–99.99% unless problem implies higher (payments, booking)
- Managed cloud services (RDS, S3, Kafka, Redis) are acceptable building blocks
These foundational concepts underpin the patterns used in this problem. Review them before deep-diving into component-level trade-offs.
- Full backups: Complete copy of all data (databases, object storage, config)
- Incremental backups: Only changes since last backup
- Point-in-time recovery (PITR): Restore to any second within retention window
- Cross-region replication: Backups stored in geographically separate region
- Backup scheduling: Configurable policies (hourly incremental, daily full, weekly archive)
- Restore testing: Automated periodic restore verification
- Multi-tier storage: Recent on fast storage, older on cold/archive
- RPO and RTO enforcement
- Disaster recovery runbook: Automated failover to DR region
- RPO: < 1 minute for critical data
- RTO: < 15 minutes for critical services
- Durability: 11 nines for backup data
- Consistency: Backup must represent a consistent point-in-time snapshot
- Encryption: All backups encrypted; key management via KMS
- Compliance: Retention policies per regulation (GDPR)
| Metric | Calculation | Value |
|---|---|---|
| Total production data | Given | 500 TB |
| Daily change rate | Given | 5% → 25 TB/day incremental |
| Full backup size | Given | 500 TB (compressed: ~200 TB) |
| Full backup frequency | Given | Weekly |
| Incremental backup frequency | Given | Hourly |
| Retention | Given | 30 days hot, 1 year warm, 7 years cold |
| Total backup storage | Given | ~5 PB (with retention + dedup) |
| Restore throughput needed (RTO=15min) | Given | 556 GB/sec |
DR Strategy Tiers
| Strategy | RPO | RTO | Cost | Description |
|---|---|---|---|---|
| Backup & Restore | Hours | Hours | $ | Restore from S3 on demand |
| Pilot Light | Minutes | 30 min | $$ | Minimal infra, scale on failover |
| Warm Standby | Seconds | 15 min | $$$ | Reduced capacity, always running |
| Active-Active | 0 | ~0 | $$$$ | Both regions serve traffic |
PITR (Point-in-Time Recovery) for PostgreSQL
Base backup via pg_basebackup. Continuous WAL archiving to S3. Restore: restore latest base backup before target time, replay WAL segments up to target time. Result: exact database state at time T (second-level precision).
Backup Consistency
Database-native: pg_dump --serializable. Filesystem: LVM snapshot ? backup from frozen snapshot. Cloud: EBS snapshot (crash-consistent). Application-level: quiesce writes ? snapshot ? resume.
Backup Layout (S3)
s3://backups/
+-- postgresql/
¦ +-- 2026-03-14/
¦ ¦ +-- full/base.tar.gz.enc (weekly full backup)
¦ ¦ +-- wal/000000010000001A000000FF (continuous WAL archive)
¦ ¦ +-- manifest.json
+-- cassandra/
¦ +-- keyspace_orders/sstable-*.gz
+-- redis/
¦ +-- rdb-snapshot.rdb.enc
+-- config/
+-- etcd-snapshot.dbPOST /api/backups/trigger → Trigger manual backup {source, type}
GET /api/backups → List backups with status
GET /api/backups/{id}/status → Backup job status
POST /api/restore → Initiate restore {backup_id, target}
POST /api/dr/failover → Initiate DR failover (manual trigger)
GET /api/dr/status → DR region health, replication lagCommon Error Responses
400 Bad Request: invalid input, missing fields, or malformed JSON 401 Unauthorized: missing or invalid auth token or API key 403 Forbidden: authenticated but insufficient permissions 404 Not Found: resource ID does not exist 409 Conflict: duplicate write or version conflict; retry with idempotency key 422 Unprocessable Entity: valid syntax but invalid business logic 429 Too Many Requests: rate limit exceeded; honor Retry-After header 500 Internal Error: unexpected server fault; retry with idempotency key 503 Service Unavailable: dependency down or overloaded; use exponential backoff
PostgreSQL (Backup Metadata: Control Plane)
CREATE TABLE backup_jobs (
backup_id UUID PRIMARY KEY,
source_system TEXT NOT NULL,
backup_type TEXT NOT NULL,
status TEXT DEFAULT 'running',
storage_path TEXT,
size_bytes BIGINT,
checksum TEXT,
started_at TIMESTAMPTZ DEFAULT NOW(),
completed_at TIMESTAMPTZ,
retention_tier TEXT DEFAULT 'hot',
expires_at TIMESTAMPTZ,
encrypted BOOLEAN DEFAULT TRUE
);
CREATE TABLE restore_jobs (
restore_id UUID PRIMARY KEY,
backup_id UUID REFERENCES backup_jobs(backup_id),
target_system TEXT NOT NULL,
restore_point TIMESTAMPTZ,
status TEXT DEFAULT 'running',
validated BOOLEAN DEFAULT FALSE
);Automated Failover Runbook
Trigger: Primary region health check fails for > 60 seconds Automated sequence: T=0s: Health check failure detected (Route53 health checker) T=5s: Alert PagerDuty + Slack channel T=10s: Fence old primary (revoke DNS, block writes via security group) T=15s: Promote DR PostgreSQL replica to primary (pg_promote(), < 5s) T=20s: Update service discovery (Consul/etcd) → DR endpoints T=25s: Scale DR auto-scaling groups to full capacity T=30s: DNS failover: Route53 switches to DR region T=60s: Traffic flowing to DR region T=120s: Run smoke tests T=180s: Declare DR active Total RTO: ~3 minutes (automated) vs 30+ minutes (manual) CRITICAL: Fence old primary BEFORE promoting DR If old primary comes back online → split-brain → data corruption
Restore Testing
"A backup that has never been tested is not a backup" Weekly automated restore test: 1. Pick random recent backup 2. Restore to isolated environment (separate VPC) 3. Run validation checks: a. Row counts match expected (within 0.1%) b. Checksums of critical tables match c. Application can connect and query d. Measure actual restore time vs RTO target 4. Report results → dashboard + alert on failure 5. Tear down isolated environment
Immutable Backups (Ransomware Protection)
Solution: S3 Object Lock (WORM — Write Once Read Many) COMPLIANCE mode: NOBODY can delete (not even root/admin) for 30 days Combined with: - Cross-account replication: backups in separate AWS account - MFA delete: require MFA token to delete any backup - Separate KMS keys: backup encryption keys in different account
Crypto-Shredding (GDPR Right to Erasure from Backups)
Problem: User requests data deletion, but data exists in 90 days of backups Crypto-shredding: 1. Each user's data encrypted with per-user key before backup 2. User deletion request → delete their encryption key from KMS 3. Backup data still exists but is unreadable (key is gone) 4. Effectively deleted without modifying backup files 5. Immutable backups + GDPR compliance = both satisfied
- Cost optimization: Dedup + compression + lifecycle policies reduce storage 5-10×
- Chaos engineering: Regularly test DR failover (GameDay exercises)
- Multi-database consistency: Coordinated snapshot across systems (or accept small inconsistency window)
- Backup bandwidth: Use incremental + local snapshot + async replication
- Monitoring: Track backup success rate, size trends, replication lag, last successful restore test date
Interview Walkthrough
- Start by defining RPO (max acceptable data loss) and RTO (max acceptable downtime) — every architectural choice flows from these two numbers.
- 3-2-1 backup rule: 3 copies, 2 media types, 1 offsite — immutable/object-lock backups protect against ransomware that encrypts live systems.
- Incremental backups + WAL/binlog archiving for databases — full snapshots weekly, incrementals daily, continuous log shipping for point-in-time recovery.
- Active-passive DR with documented failover runbooks — DNS/load balancer switch, promote replica, validate data integrity before accepting traffic.
- Quarterly restore tests are non-negotiable — a backup never tested is a backup you cannot trust during an actual outage.
- Crypto-shredding for GDPR: encrypt PII with per-user keys, destroy keys on erasure request — data in backups becomes unreadable without rewriting tapes.
- Common pitfall: backing up without ever performing a full restore test — corrupted or incomplete backups are discovered only during a real disaster.
RPO vs RTO: The Fundamental Trade-off
RPO 0 (zero data loss): Requires: synchronous replication to DR site Sync replication: primary WAITS for DR to confirm write Latency cost: +20-100ms per write Use for: Financial transactions, ledger systems, payment data RPO 1 minute: Requires: async replication + WAL shipping every minute Primary writes freely → ships WAL logs to DR every 60 seconds Use for: E-commerce orders, user data RPO 1 hour: Requires: hourly backups to S3 (snapshot + incremental) Cheapest approach — no replication, just scheduled backups Use for: Analytics, non-critical data RTO → cost relationship: RTO = 4 hours: ~$500/month (just S3 storage) RTO = 15 minutes: ~$10,000/month (duplicate infrastructure) RTO = 30 seconds: ~$50,000/month (full infra in 2+ regions)
Split-Brain Prevention During Failover
1. Fencing (STONITH — Shoot The Other Node In The Head): Before promoting DR: revoke IAM write permissions, block traffic, shut DB Only AFTER fencing succeeds → promote DR 2. Witness/Quorum: Deploy "witness" node in third region. Failover requires 2-of-3 agreement. 3. Lease-based leadership: Primary holds lease in distributed lock, expires every 30s If primary can't renew → safe to promote 4. Epoch-based writes: Every write includes epoch number. On failover: new primary increments epoch. Old primary's epoch < current ? writes rejected.
Failback: Returning to Primary After DR
Step 1: Rebuild primary as replica. Step 2: Validate (row counts, checksums). Step 3: Planned failover during maintenance window. Step 4: Post-failback monitoring. Total failback time: 2-12 hours (depending on data volume). Danger: if you failback too quickly without full sync, transactions done on DR during outage are LOST.
Active-Active Multi-Region
Instead of active-passive, run BOTH regions as active simultaneously. Why it's hard: Conflict resolution: Last-write-wins (LWW), application-level merge, CRDTs, region-owned data Referential integrity: route user + all related entities to same region When active-active makes sense: ✓ Read-heavy workloads (global CDN-like caching) ✓ Partition-able data (each user assigned to home region) ✓ CRDT-compatible operations (counters, sets, append-only logs) When active-passive is better: ✓ Strong consistency required (financial, inventory) ✓ Complex transactions spanning multiple entities ? Write-heavy workloads ? Simpler operations
Staff interviews expect you to articulate how the system evolves under real growth — not jump straight to the final architecture.
Phase 1: MVP (0 to 100K users)
Monolith or minimal services proving core backup disaster recovery flows. Optimize for shipping speed and correctness over scale.
Key components: Single region · Primary DB + Redis cache · Synchronous core path · Basic monitoring
Move to next phase when: p99 latency exceeds SLO or DB CPU sustained above 70%
Phase 2: Growth (100K to 10M users)
Split read/write paths, introduce async processing for non-critical work, add caching layers and horizontal scaling.
Key components: Read replicas or CQRS · Message queue for async work · CDN / edge caching · Service-level SLOs
Move to next phase when: Hot keys, fan-out bottlenecks, or ops toil from manual scaling
Phase 3: Scale (10M+ users)
Shard data plane, multi-region active-active or active-passive, formal DR runbooks, cost optimization.
Key components: Database sharding / partitioning · Multi-region replication · Auto-scaling + chaos testing · Dedicated platform/SRE ownership
Move to next phase when: Regional failure domain risk, compliance data residency, or linear cost growth unsustainable
SLOs & Error Budgets
| Metric | Target | Rationale |
|---|---|---|
| Core user-facing availability | 99.95% | Budget for planned maintenance + unplanned failures without user-visible outage. |
| p99 latency (critical path) | Problem-specific — state target early and tie to capacity math | Interview credibility comes from connecting SLO to architecture choices. |
| Error rate (5xx) | < 0.1% | Distinguishes transient blips from systemic failure requiring rollback. |
| Data durability | 99.999999999% (11 nines) for committed writes | Define which operations require fsync/quorum vs async replication. |
Incident Scenarios (2am reality)
| Scenario | How you detect | Mitigation |
|---|---|---|
| Primary database unavailable | Health check failures, connection pool exhaustion alerts, elevated 5xx | Failover to replica / promote standby; enable read-only degraded mode if writes impossible; queue writes if async path exists |
| Traffic spike (10× normal) | RPS anomaly alert, autoscaling lag, latency SLO burn rate | Rate limit non-critical endpoints; scale read path horizontally; pre-warm caches; shed load on expensive operations |
| Bad deploy causing elevated errors | Canary metric regression, error budget burn, deployment correlation | Automated rollback within 5 minutes; feature flag kill switch; maintain N-1 compatibility |
Cost Drivers (Staff lens)
- Egress bandwidth and CDN (often dominates media/data-heavy systems)
- Database storage + IOPS at scale (plan compaction, TTL, tiering)
- Compute for async pipelines (right-size workers, spot instances for batch)
- Managed service premiums vs operational headcount trade-off
Multi-Region & DR
Start single-region with cross-AZ redundancy. Add read replicas in secondary region for DR. Move to active-active only when latency SLO or data residency requires it — accept conflict resolution complexity explicitly.