This problem appears in multiple sheets. Depth expectations increase as you progress:
Interview Prompt
Design Blob Storage System (like S3).
Clarifying Questions (ask before designing)
| Question | Why it matters |
|---|---|
| Which of these is highest priority: Object metadata store, Erasure coding vs replication, Multipart upload? | Forces scope negotiation — senior candidates trim before drawing boxes. |
| What scale should we design for — DAU, QPS, data volume? | Drives every capacity decision; shows structured thinking. |
| What are the read vs write patterns on the critical path? | Determines caching, DB choice, and replication topology. |
| What consistency and durability guarantees are required? | Separates strong-consistency paths from eventual ones — a senior differentiator. |
Scope
In scope
- Object metadata store
- Erasure coding vs replication
- Multipart upload
- Garbage collection
- Pre-signed URLs
- Consistency model (strong read-after-write)
Out of scope (state explicitly)
- Client desktop/mobile app implementation
- End-user file preview rendering for every format
- Building raw block storage hardware
Assumptions
- Clarify scale (DAU, QPS, data volume) for blob storage s3 in the first 5 minutes
- Standard reliability target 99.9%–99.99% unless problem implies higher (payments, booking)
- Managed cloud services (RDS, S3, Kafka, Redis) are acceptable building blocks
These foundational concepts underpin the patterns used in this problem. Review them before deep-diving into component-level trade-offs.
- Upload objects (blobs) of any size (1 byte to 5 TB) via PUT
- Download objects via GET with support for range requests (partial downloads)
- Delete objects (immediate metadata removal, lazy storage reclamation)
- List objects in a bucket with prefix filtering and pagination
- Multipart upload: upload large objects in parts, assemble on completion
- Versioning: maintain multiple versions of the same object key
- Object metadata: custom key-value headers, content-type, content-encoding
- Access control: per-bucket and per-object ACLs, IAM policies
- Pre-signed URLs: generate time-limited URLs for upload/download without credentials
- Lifecycle policies: auto-transition to cheaper storage tiers, auto-delete after N days
- Durability: 99.999999999% (11 nines): virtually never lose data
- Availability: 99.99% for reads, 99.9% for writes
- Scalability: Exabytes of storage, millions of objects per bucket
- Throughput: 100K+ requests/sec per bucket, multi-Gbps per object
- Consistency: Strong read-after-write consistency
- Low Latency: First-byte in < 100ms for most objects
- Cost Efficiency: Tiered storage (hot, warm, cold, archive)
| Metric | Calculation | Value |
|---|---|---|
| Total stored objects | Given (assumption documented in value) | 100+ trillion (S3-scale) |
| Total storage | Given (assumption documented in value) | Exabytes |
| Requests / sec (global) | From Requests / day ÷ 86400 (+ peak factor in value) | 100M+ |
| Avg object size | Given (typical workload assumption) | 100 KB (highly variable) |
| Metadata per object | Given | ~1 KB |
| Metadata storage | 100T × 1KB | 100 PB |
| Write throughput | Given (assumption documented in value) | 10M objects/sec |
| Replication factor | Given (assumption documented in value) | 3 (cross-AZ) + erasure coding for cold |
S3-style blob storage with Object Service, Placement Service, Metadata Store (KV DB), and replicated Data Nodes. Write path streams data to primary node which replicates, then metadata is written last for strong consistency.
Write Path (PUT Object)
- Client → API Gateway → Object Service (authenticates + authorizes)
- If object > 5GB → require multipart upload
- Object Service asks Placement Service: "Where to store 3 replicas?"
- Object Service streams data to primary data node
- Primary replicates to secondary nodes (chain replication)
- After all 3 replicas written + checksums verified: write metadata
- Metadata write is LAST step → ensures strong read-after-write consistency
Read Path (GET Object)
- Client → API Gateway → Object Service
- Object Service queries Metadata Store with (bucket, key)
- Gets data locations:
[{dn-17, vol-3, offset-4096}, ...] - Selects closest/healthiest data node
- Streams data directly from data node to client
- Verifies checksum on the fly
- If checksum mismatch → try next replica + trigger repair
Optimization for small objects (< 256 KB): cache in Redis/in-memory. For large objects: range requests for parallel download.
Data Node: Block Format
Storage is organized as an append-only volume file (log-structured). Volume file contains fixed-size 64 MB blocks appended sequentially. Objects ≥ 64 MB span multiple blocks. Multiple small objects are packed into a single 64 MB block to avoid wasted space. An in-memory block index maps object_id → (volume_id, block_offset, length) for O(1) random access. Deletion is soft-delete (remove index entry); compaction reclaims space.
Event Bus Design (Kafka)
Topic: blob_storage_s3-events Partitions: 64 (scale consumers horizontally) Partition key: entity_id (user_id / order_id — preserves per-entity ordering) Retention: 7 days (compliance) or 24h (high-volume telemetry) Replication factor: 3, min.insync.replicas: 2 Producer: idempotent producer enabled (enable.idempotence=true) Consumer: consumer group "blob_storage_s3-processors" - At-least-once delivery + idempotent handlers (dedup by event_id) - DLQ topic: blob_storage_s3-events-dlq (poison messages after 3 retries) - Lag alert: consumer lag > 60s → scale workers Design a Blob Storage System (like S3): async side effects MUST NOT block the synchronous API response. Sync path: validate → persist source of truth → publish event → return 201 Async path: consumers update caches, indexes, notifications, aggregates
# Object operations
PUT /{bucket}/{key} → Upload object
GET /{bucket}/{key} → Download object
HEAD /{bucket}/{key} → Get object metadata
DELETE /{bucket}/{key} → Delete object
GET /{bucket}/{key}?versionId={v} → Get specific version
# Multipart upload
POST /{bucket}/{key}?uploads → Initiate multipart upload
PUT /{bucket}/{key}?partNumber={n}&uploadId={id} → Upload part
POST /{bucket}/{key}?uploadId={id} → Complete multipart upload
# Bucket operations
PUT /{bucket} → Create bucket
DELETE /{bucket} → Delete bucket (must be empty)
GET /{bucket}?list-type=2 → List objects
# Pre-signed URLs
POST /api/presign
{ "method": "PUT", "bucket": "my-bucket", "key": "photo.jpg", "expires": 3600 }
→ Returns: "https://s3.example.com/my-bucket/photo.jpg?X-Sig=xxx&X-Expires=3600"Common Error Responses
400 Bad Request: invalid input, missing fields, or malformed JSON 401 Unauthorized: missing or invalid auth token or API key 403 Forbidden: authenticated but insufficient permissions 404 Not Found: resource ID does not exist 409 Conflict: duplicate write or version conflict; retry with idempotency key 422 Unprocessable Entity: valid syntax but invalid business logic 429 Too Many Requests: rate limit exceeded; honor Retry-After header 500 Internal Error: unexpected server fault; retry with idempotency key 503 Service Unavailable: dependency down or overloaded; use exponential backoff
Metadata Store (FoundationDB / DynamoDB)
Primary key: (bucket_name, object_key, version_id)
{
"bucket": "my-photos",
"key": "2026/march/sunset.jpg",
"version_id": "v_abc123",
"is_latest": true,
"size": 2457600,
"etag": "\"d41d8cd98f00b204e9800998ecf8427e\"",
"content_type": "image/jpeg",
"storage_class": "STANDARD",
"checksum_sha256": "abc123...",
"data_locations": [
{"node_id": "dn-17", "volume_id": "vol-3", "offset": 4096, "length": 2457600},
{"node_id": "dn-42", "volume_id": "vol-1", "offset": 8192, "length": 2457600},
{"node_id": "dn-63", "volume_id": "vol-7", "offset": 2048, "length": 2457600}
],
"custom_metadata": { "x-amz-meta-photographer": "Alice" },
"delete_marker": false,
"created_at": "2026-03-14T10:00:00Z"
}Bucket Metadata
CREATE TABLE buckets (
bucket_name TEXT PRIMARY KEY,
owner_id UUID NOT NULL,
region TEXT NOT NULL,
versioning TEXT DEFAULT 'disabled',
storage_class TEXT DEFAULT 'STANDARD',
lifecycle_rules JSONB,
cors_config JSONB,
policy JSONB,
created_at TIMESTAMPTZ DEFAULT NOW()
);How 11-Nines Durability Is Achieved
- 3× replication across AZs: Probability of all 3 AZs failing simultaneously ~10^-9
- Checksums at every layer: Object-level SHA-256, block-level checksum, network checksum (TCP + TLS). Verified on every read → detect bit-rot.
- Background scrubbing: Weekly comparison of all replica checksums. Any mismatch → re-replicate from healthy replica.
- Rebuild on failure: Placement Service detects via heartbeat, triggers re-replication. Throttled at 100 MB/s per node.
- Erasure coding for cold data: RS(10,4): lose 4 fragments, still recoverable. 1.4× overhead vs 3× for replication.
- Cross-region replication (optional): Asynchronous replication to another region.
Strong Read-After-Write Consistency
Pre-2020 S3 had eventual consistency: PUT → 200 OK → GET might return 404. How S3 achieved strong consistency: PUT writes data to storage nodes FIRST, then writes metadata atomically with version number. GET reads from metadata store (always returns latest). If metadata says object exists → data must exist (data written first). This is the "data-before-metadata" pattern.
Multipart Upload Recovery
5TB file upload fails at 90%? No problem: 1) Initiate → get upload_id. 2) Upload parts independently (each 5MB-5GB). 3) Failed parts retried individually. 4) Complete → server assembles parts (metadata only, no data copy!). Incomplete uploads cleaned up by lifecycle rule after 7 days.
Handling Hot Objects
Solutions: CDN caching (CloudFront), request coalescing at API layer (1000 concurrent GETs → 1 read from data node), read replicas expansion (dynamically add more for hot objects), and pre-signed URLs pointing to CDN.
Garbage Collection
Lazy GC: Scanner reads metadata store for live objects, reads data nodes for stored blocks. Blocks NOT referenced → marked as garbage. After 48h grace period: reclaimed. Similar to JVM GC, LSM tree compaction.
Storage Tiers
| Tier | Use Case | Durability | Availability | Access Time |
|---|---|---|---|---|
| Standard | Frequently accessed | 11 nines | 99.99% | < 100ms |
| Infrequent Access | Monthly access | 11 nines | 99.9% | < 100ms |
| Glacier | Quarterly access | 11 nines | 99.99% | 1-5 min |
| Deep Archive | Yearly access | 11 nines | 99.99% | 12 hours |
Lifecycle automation: "After 30 days → IA, after 90 days → Glacier, after 365 days → Deep Archive": applied daily by Lifecycle Manager.
Event Notifications
S3 emits events for ObjectCreated, ObjectRemoved, and lifecycle transitions. Sent to Kafka/SQS or Lambda. Use cases: image uploaded → trigger thumbnail generation, log uploaded → trigger ETL, object deleted → update search index.
Interview Walkthrough
- Split metadata store (bucket, key, version, replica locations) from blob data (append-only volumes on data nodes) — the defining architectural boundary.
- Write path order matters: stream data to 3 replicas and verify checksums FIRST, write metadata LAST — this is how strong read-after-write consistency is achieved.
- Multipart upload for objects > 5 GB: upload parts independently, complete assembles metadata only with no data copy.
- Erasure coding (RS 10+4) for cold/archive tiers — 1.4× storage overhead vs 3× for hot replication.
- Background scrubbing compares replica checksums weekly; bit-rot detected on read triggers re-replication from healthy copy.
- Lifecycle policies automate tier transitions (Standard → IA → Glacier) to optimize cost without manual intervention.
- Common pitfall: writing metadata before all replicas confirm — GET after PUT can return 404 or stale data, the classic eventual-consistency trap.
Erasure Coding vs Replication for Durability
Pure replication (3 copies): 3x storage overhead. Simple, fast recovery. Used for hot tier.
Erasure Coding ⭐ (Reed-Solomon): S3 Standard uses 6+3 coding (6 data + 3 parity = 9 chunks, survives loss of any 3). Storage overhead: 1.5x vs 3x. S3 Glacier uses 20+3 coding (1.15x overhead). Trade-off: lower storage cost (saves billions/year at exabyte scale) but higher I/O during reconstruction (must read k chunks). Hot tier: 3-copy replication for instant recovery. Cold tier: erasure coding for cost efficiency.
Consistency: Eventual vs Strong Read-After-Write
S3 achieved strong read-after-write consistency in December 2020. Before that, GET after PUT might return 404 for 1-2 seconds. How: Metadata service atomically updates "current version" pointer on PUT. GET reads through metadata service. Metadata service uses Raft-like protocol for replication. Caveat: as of the Dec 2020 change, LIST is also strongly consistent (a write is immediately reflected in subsequent listings); only bucket-level configuration changes (versioning, policies) remain eventually consistent. Interview tip: S3 is strongly consistent for GET/PUT/DELETE and LIST alike.
Multipart Upload: How Large File Uploads Work
Single PUT of 5 TB is impractical: network interruption at 99% wastes hours, client needs 5TB in memory, single TCP connection saturated. Multipart upload: upload parts in PARALLEL (e.g., 100 parts at once → 100x throughput). Failed parts retried independently. Server side: no data copy on completion (just metadata update). Best practice: lifecycle rule to abort incomplete uploads after 7 days to prevent orphaned storage.
Staff interviews expect you to articulate how the system evolves under real growth — not jump straight to the final architecture.
Phase 1: MVP (0 to 100K users)
Monolith or minimal services proving core blob storage s3 flows. Optimize for shipping speed and correctness over scale.
Key components: Single region · Primary DB + Redis cache · Synchronous core path · Basic monitoring
Move to next phase when: p99 latency exceeds SLO or DB CPU sustained above 70%
Phase 2: Growth (100K to 10M users)
Split read/write paths, introduce async processing for non-critical work, add caching layers and horizontal scaling.
Key components: Read replicas or CQRS · Message queue for async work · CDN / edge caching · Service-level SLOs
Move to next phase when: Hot keys, fan-out bottlenecks, or ops toil from manual scaling
Phase 3: Scale (10M+ users)
Shard data plane, multi-region active-active or active-passive, formal DR runbooks, cost optimization.
Key components: Database sharding / partitioning · Multi-region replication · Auto-scaling + chaos testing · Dedicated platform/SRE ownership
Move to next phase when: Regional failure domain risk, compliance data residency, or linear cost growth unsustainable
SLOs & Error Budgets
| Metric | Target | Rationale |
|---|---|---|
| Core user-facing availability | 99.95% | Budget for planned maintenance + unplanned failures without user-visible outage. |
| p99 latency (critical path) | Problem-specific — state target early and tie to capacity math | Interview credibility comes from connecting SLO to architecture choices. |
| Error rate (5xx) | < 0.1% | Distinguishes transient blips from systemic failure requiring rollback. |
| Data durability | 99.999999999% (11 nines) for committed writes | Define which operations require fsync/quorum vs async replication. |
Incident Scenarios (2am reality)
| Scenario | How you detect | Mitigation |
|---|---|---|
| Primary database unavailable | Health check failures, connection pool exhaustion alerts, elevated 5xx | Failover to replica / promote standby; enable read-only degraded mode if writes impossible; queue writes if async path exists |
| Traffic spike (10× normal) | RPS anomaly alert, autoscaling lag, latency SLO burn rate | Rate limit non-critical endpoints; scale read path horizontally; pre-warm caches; shed load on expensive operations |
| Bad deploy causing elevated errors | Canary metric regression, error budget burn, deployment correlation | Automated rollback within 5 minutes; feature flag kill switch; maintain N-1 compatibility |
Cost Drivers (Staff lens)
- Egress bandwidth and CDN (often dominates media/data-heavy systems)
- Database storage + IOPS at scale (plan compaction, TTL, tiering)
- Compute for async pipelines (right-size workers, spot instances for batch)
- Managed service premiums vs operational headcount trade-off
Multi-Region & DR
Start single-region with cross-AZ redundancy. Add read replicas in secondary region for DR. Move to active-active only when latency SLO or data residency requires it — accept conflict resolution complexity explicitly.