Interview Prompt
Design Design Ephemeral Stories (Instagram Stories).
Clarifying Questions (ask before designing)
| Question | Why it matters |
|---|---|
| Which of these is highest priority: 24-hour TTL, Viewer list tracking, Story ring ordering? | Forces scope negotiation — senior candidates trim before drawing boxes. |
| What scale should we design for — DAU, QPS, data volume? | Drives every capacity decision; shows structured thinking. |
| What are the read vs write patterns on the critical path? | Determines caching, DB choice, and replication topology. |
| What consistency and durability guarantees are required? | Separates strong-consistency paths from eventual ones — a senior differentiator. |
Scope
In scope
- 24-hour TTL
- Viewer list tracking
- Story ring ordering
- CDN pre-warming
- Soft delete
- Capacity estimation with shown math
Out of scope (state explicitly)
- Full ML ranking model training pipeline
- Direct messaging / chat (#07)
- Ad insertion and monetization
Assumptions
- Clarify scale (DAU, QPS, data volume) for ephemeral stories in the first 5 minutes
- Standard reliability target 99.9%–99.99% unless problem implies higher (payments, booking)
- Managed cloud services (RDS, S3, Kafka, Redis) are acceptable building blocks
These foundational concepts underpin the patterns used in this problem. Review them before deep-diving into component-level trade-offs.
- Post stories: Upload photo/video stories that auto-delete after 24 hours
- View stories: See stories from followed users in a tray/carousel
- Story ring: Colored ring indicator for unviewed stories
- View tracking: Track who viewed your story
- Story reactions: Reply to a story or react with emoji
- Close friends: Post visible only to a selected list
- Highlights: Pin stories to profile (opt out of deletion)
- Story ordering: Closest friends first, then by recency
- Low Latency: Story tray loads in < 300 ms
- Auto-Deletion: Stories MUST disappear after exactly 24 hours
- High Throughput: 500M+ stories/day, 10B+ views/day
- Availability: 99.99%
- Consistency: View tracking must be accurate
| Metric | Calculation | Value |
|---|---|---|
| DAU | Given | 500M |
| Stories posted / day | Given | 500M |
| Story views / day | Given | 10B |
| Avg story media size | Given | 2 MB |
| Upload storage / day | 500M × 2 MB | 1 PB (deleted after 24h) |
| Peak concurrent views / sec | Derived from daily volume ÷ 86400 (+ peak factor) | 200K |
| Storage steady state | Given | ~1 PB (24h window) |
24-Hour Auto-Deletion
Triple TTL approach: Cassandra TTL 86400 on story rows → automatic deletion. S3 lifecycle policy on /stories/ prefix → expire after 24h. Redis EXPIRE on all story keys. Zero application-level cleanup needed. Highlights: copy to permanent table/bucket before original expires.
View Tracking: High-Volume Writes
10B views/day = ~115K writes/sec. Cassandra table keyed by (story_id, viewer_id) with TTL. Redis SET for fast "has user viewed?" check. INCR for view count per story. Kafka buffer before Cassandra write for at-least-once delivery.
Story Tray Ordering
Score = unviewed stories × interaction score × recency - fatigue + close friend boost. Interaction score computed offline via Spark. Cached in Redis with 5-minute TTL.
Event Bus Design (Kafka)
Topic: ephemeral_stories-events Partitions: 64 (scale consumers horizontally) Partition key: entity_id (user_id / order_id — preserves per-entity ordering) Retention: 7 days (compliance) or 24h (high-volume telemetry) Replication factor: 3, min.insync.replicas: 2 Producer: idempotent producer enabled (enable.idempotence=true) Consumer: consumer group "ephemeral_stories-processors" - At-least-once delivery + idempotent handlers (dedup by event_id) - DLQ topic: ephemeral_stories-events-dlq (poison messages after 3 retries) - Lag alert: consumer lag > 60s → scale workers Design Ephemeral Stories (Instagram Stories): async side effects MUST NOT block the synchronous API response. Sync path: validate → persist source of truth → publish event → return 201 Async path: consumers update caches, indexes, notifications, aggregates
Post Story
POST /api/v1/stories
{ "media_url": "s3://...", "audience": "everyone" | "close_friends",
"stickers": [...], "mentions": ["@alice"] }
→ 201 Created { "story_id": "s-uuid", "expires_at": "2026-03-15T10:00:00Z" }Get Story Tray
GET /api/v1/stories/tray
→ 200 OK
{ "tray": [
{ "user_id": "u-alice", "username": "alice",
"has_unviewed": true, "story_count": 3, "latest_at": "..." }
]}View Story
GET /api/v1/stories/{story_id}
→ 200 OK { "story_id": "...", "media_url": "https://cdn.../...",
"posted_at": "...", "view_count": 342 }Common Error Responses
400 Bad Request: invalid input, missing fields, or malformed JSON 401 Unauthorized: missing or invalid auth token or API key 403 Forbidden: authenticated but insufficient permissions 404 Not Found: resource ID does not exist 409 Conflict: duplicate write or version conflict; retry with idempotency key 422 Unprocessable Entity: valid syntax but invalid business logic 429 Too Many Requests: rate limit exceeded; honor Retry-After header 500 Internal Error: unexpected server fault; retry with idempotency key 503 Service Unavailable: dependency down or overloaded; use exponential backoff
Cassandra: Stories + Viewers
CREATE TABLE stories (
author_id UUID,
story_id TIMEUUID,
media_url TEXT,
media_type TEXT,
audience TEXT,
posted_at TIMESTAMP,
PRIMARY KEY (author_id, story_id)
) WITH CLUSTERING ORDER BY (story_id DESC)
AND default_time_to_live = 86400;
CREATE TABLE story_viewers (
story_id UUID,
viewer_id UUID,
viewed_at TIMESTAMP,
PRIMARY KEY (story_id, viewer_id)
) WITH default_time_to_live = 86400;Redis
active_stories → SET of user_ids with active stories
tray:{user_id} → List of author_ids (TTL: 5 min)
story_viewed:{viewer}:{author} → SET of viewed story_ids (TTL: 24h)
story_views:{story_id} → Integer (TTL: 24h)| Concern | Solution |
|---|---|
| Story not deleted after 24h | Triple TTL: Cassandra + S3 + Redis |
| View tracking loss | Kafka buffer before Cassandra; at-least-once |
| Media processing failure | Retry 3×; mark as processing_failed |
| Story tray staleness | Cache TTL = 5 min; pull-to-refresh |
| Expires while viewing | Soft expiry: 5 min grace period |
Fan-Out for Story Tray
Pull model: on tray request, check each followed user (1000 Redis SISMEMBER checks → ~5ms with pipeline). Push for regular users, pull for celebrities (> 10K followers).
Close Friends Privacy
Redis SET of friend IDs. Permission check at story FETCH time. If removed while viewing, next tap skips remaining close-friends stories from that author.
Interview Walkthrough
- Emphasize the 24-hour TTL constraint — stories are ephemeral by design; Cassandra native TTL with TWCS compaction avoids tombstone bloat.
- Separate media (CDN/S3 for photos and video) from metadata (Cassandra for story records, view counts, and expiry timestamps).
- Build the story tray with a hybrid model: push new-story signals for normal users (<10K followers), pull on tray load for celebrities.
- Track views with deduplicated counters — one view per (viewer, story) pair stored in Cassandra with a Redis cache for real-time counts.
- Enforce Close Friends privacy at fetch time by checking a Redis SET — never rely on client-side filtering alone.
- Serve story media through CDN with pre-signed URLs; metadata queries hit Cassandra partitioned by author_id.
- Quantify storage: 500M DAU × 2 stories/day × 500 KB ≈ 500 TB/day raw media — aggressive CDN caching and TTL expiry are mandatory.
- Common pitfall: fan-out on write to push every new story into 100M follower inboxes — celebrity posts create 100M writes per upload.
Cassandra TTL vs Application-Level Deletion
Cassandra TTL ⭐: Zero application code. Tombstone issue mitigated by Time-Window Compaction Strategy (TWCS): each SSTable covers 1 hour window. When entire window expires, drop whole SSTable. Perfect for TTL-heavy workloads.
Pull vs Push Model for Tray
Pull: Simple, no fan-out cost.
Push: Fast tray loads but expensive for celebrities.
Hybrid: Push for < 10K followers, pull for celebrities. This is Instagram's actual approach.
Staff interviews expect you to articulate how the system evolves under real growth — not jump straight to the final architecture.
Phase 1: MVP (0 to 100K users)
Monolith or minimal services proving core ephemeral stories flows. Optimize for shipping speed and correctness over scale.
Key components: Single region · Primary DB + Redis cache · Synchronous core path · Basic monitoring
Move to next phase when: p99 latency exceeds SLO or DB CPU sustained above 70%
Phase 2: Growth (100K to 10M users)
Split read/write paths, introduce async processing for non-critical work, add caching layers and horizontal scaling.
Key components: Read replicas or CQRS · Message queue for async work · CDN / edge caching · Service-level SLOs
Move to next phase when: Hot keys, fan-out bottlenecks, or ops toil from manual scaling
Phase 3: Scale (10M+ users)
Shard data plane, multi-region active-active or active-passive, formal DR runbooks, cost optimization.
Key components: Database sharding / partitioning · Multi-region replication · Auto-scaling + chaos testing · Dedicated platform/SRE ownership
Move to next phase when: Regional failure domain risk, compliance data residency, or linear cost growth unsustainable
SLOs & Error Budgets
| Metric | Target | Rationale |
|---|---|---|
| Core user-facing availability | 99.95% | Budget for planned maintenance + unplanned failures without user-visible outage. |
| p99 latency (critical path) | Problem-specific — state target early and tie to capacity math | Interview credibility comes from connecting SLO to architecture choices. |
| Error rate (5xx) | < 0.1% | Distinguishes transient blips from systemic failure requiring rollback. |
| Data durability | 99.999999999% (11 nines) for committed writes | Define which operations require fsync/quorum vs async replication. |
Incident Scenarios (2am reality)
| Scenario | How you detect | Mitigation |
|---|---|---|
| Primary database unavailable | Health check failures, connection pool exhaustion alerts, elevated 5xx | Failover to replica / promote standby; enable read-only degraded mode if writes impossible; queue writes if async path exists |
| Traffic spike (10× normal) | RPS anomaly alert, autoscaling lag, latency SLO burn rate | Rate limit non-critical endpoints; scale read path horizontally; pre-warm caches; shed load on expensive operations |
| Bad deploy causing elevated errors | Canary metric regression, error budget burn, deployment correlation | Automated rollback within 5 minutes; feature flag kill switch; maintain N-1 compatibility |
Cost Drivers (Staff lens)
- Egress bandwidth and CDN (often dominates media/data-heavy systems)
- Database storage + IOPS at scale (plan compaction, TTL, tiering)
- Compute for async pipelines (right-size workers, spot instances for batch)
- Managed service premiums vs operational headcount trade-off
Multi-Region & DR
Start single-region with cross-AZ redundancy. Add read replicas in secondary region for DR. Move to active-active only when latency SLO or data residency requires it — accept conflict resolution complexity explicitly.