Design Ephemeral Stories (Instagram Stories)

Interview Prompt

Design Design Ephemeral Stories (Instagram Stories).

Clarifying Questions (ask before designing)

Question	Why it matters
Which of these is highest priority: 24-hour TTL, Viewer list tracking, Story ring ordering?	Forces scope negotiation — senior candidates trim before drawing boxes.
What scale should we design for — DAU, QPS, data volume?	Drives every capacity decision; shows structured thinking.
What are the read vs write patterns on the critical path?	Determines caching, DB choice, and replication topology.
What consistency and durability guarantees are required?	Separates strong-consistency paths from eventual ones — a senior differentiator.

Scope

In scope

24-hour TTL
Viewer list tracking
Story ring ordering
CDN pre-warming
Soft delete
Capacity estimation with shown math

Out of scope (state explicitly)

Full ML ranking model training pipeline
Direct messaging / chat (#07)
Ad insertion and monetization

Assumptions

Clarify scale (DAU, QPS, data volume) for ephemeral stories in the first 5 minutes
Standard reliability target 99.9%–99.99% unless problem implies higher (payments, booking)
Managed cloud services (RDS, S3, Kafka, Redis) are acceptable building blocks

Post stories: Upload photo/video stories that auto-delete after 24 hours
View stories: See stories from followed users in a tray/carousel
Story ring: Colored ring indicator for unviewed stories
View tracking: Track who viewed your story
Story reactions: Reply to a story or react with emoji
Close friends: Post visible only to a selected list
Highlights: Pin stories to profile (opt out of deletion)
Story ordering: Closest friends first, then by recency

Metric	Calculation	Value
DAU	Given	500M
Stories posted / day	Given	500M
Story views / day	Given	10B
Avg story media size	Given	2 MB
Upload storage / day	500M × 2 MB	1 PB (deleted after 24h)
Peak concurrent views / sec	Derived from daily volume ÷ 86400 (+ peak factor)	200K
Storage steady state	Given	~1 PB (24h window)

Loading...

Post Story

HTTP

POST /api/v1/stories
{ "media_url": "s3://...", "audience": "everyone" | "close_friends",
  "stickers": [...], "mentions": ["@alice"] }
→ 201 Created { "story_id": "s-uuid", "expires_at": "2026-03-15T10:00:00Z" }

Get Story Tray

HTTP

GET /api/v1/stories/tray
→ 200 OK
{ "tray": [
    { "user_id": "u-alice", "username": "alice",
      "has_unviewed": true, "story_count": 3, "latest_at": "..." }
]}

View Story

HTTP

GET /api/v1/stories/{story_id}
→ 200 OK { "story_id": "...", "media_url": "https://cdn.../...",
           "posted_at": "...", "view_count": 342 }

Common Error Responses

400 Bad Request: invalid input, missing fields, or malformed JSON
401 Unauthorized: missing or invalid auth token or API key
403 Forbidden: authenticated but insufficient permissions
404 Not Found: resource ID does not exist
409 Conflict: duplicate write or version conflict; retry with idempotency key
422 Unprocessable Entity: valid syntax but invalid business logic
429 Too Many Requests: rate limit exceeded; honor Retry-After header
500 Internal Error: unexpected server fault; retry with idempotency key
503 Service Unavailable: dependency down or overloaded; use exponential backoff

Cassandra: Stories + Viewers

SQL

CREATE TABLE stories (
    author_id   UUID,
    story_id    TIMEUUID,
    media_url   TEXT,
    media_type  TEXT,
    audience    TEXT,
    posted_at   TIMESTAMP,
    PRIMARY KEY (author_id, story_id)
) WITH CLUSTERING ORDER BY (story_id DESC)
  AND default_time_to_live = 86400;

CREATE TABLE story_viewers (
    story_id    UUID,
    viewer_id   UUID,
    viewed_at   TIMESTAMP,
    PRIMARY KEY (story_id, viewer_id)
) WITH default_time_to_live = 86400;

Redis

active_stories              → SET of user_ids with active stories
tray:{user_id}              → List of author_ids (TTL: 5 min)
story_viewed:{viewer}:{author} → SET of viewed story_ids (TTL: 24h)
story_views:{story_id}      → Integer (TTL: 24h)

Concern	Solution
Story not deleted after 24h	Triple TTL: Cassandra + S3 + Redis
View tracking loss	Kafka buffer before Cassandra; at-least-once
Media processing failure	Retry 3×; mark as processing_failed
Story tray staleness	Cache TTL = 5 min; pull-to-refresh
Expires while viewing	Soft expiry: 5 min grace period

SLOs & Error Budgets

Metric	Target	Rationale
Core user-facing availability	99.95%	Budget for planned maintenance + unplanned failures without user-visible outage.
p99 latency (critical path)	Problem-specific — state target early and tie to capacity math	Interview credibility comes from connecting SLO to architecture choices.
Error rate (5xx)	< 0.1%	Distinguishes transient blips from systemic failure requiring rollback.
Data durability	99.999999999% (11 nines) for committed writes	Define which operations require fsync/quorum vs async replication.

Incident Scenarios (2am reality)

Scenario	How you detect	Mitigation
Primary database unavailable	Health check failures, connection pool exhaustion alerts, elevated 5xx	Failover to replica / promote standby; enable read-only degraded mode if writes impossible; queue writes if async path exists
Traffic spike (10× normal)	RPS anomaly alert, autoscaling lag, latency SLO burn rate	Rate limit non-critical endpoints; scale read path horizontally; pre-warm caches; shed load on expensive operations
Bad deploy causing elevated errors	Canary metric regression, error budget burn, deployment correlation	Automated rollback within 5 minutes; feature flag kill switch; maintain N-1 compatibility

Cost Drivers (Staff lens)

Egress bandwidth and CDN (often dominates media/data-heavy systems)
Database storage + IOPS at scale (plan compaction, TTL, tiering)
Compute for async pipelines (right-size workers, spot instances for batch)
Managed service premiums vs operational headcount trade-off

Multi-Region & DR

Start single-region with cross-AZ redundancy. Add read replicas in secondary region for DR. Move to active-active only when latency SLO or data residency requires it — accept conflict resolution complexity explicitly.

Interview Prompt

Clarifying Questions (ask before designing)

Scope

In scope

Out of scope (state explicitly)

Assumptions

24-Hour Auto-Deletion

View Tracking: High-Volume Writes

Story Tray Ordering

Event Bus Design (Kafka)

Post Story

Get Story Tray

View Story

Common Error Responses

Cassandra: Stories + Viewers

Redis

Fan-Out for Story Tray

Close Friends Privacy

Interview Walkthrough

Cassandra TTL vs Application-Level Deletion

Pull vs Push Model for Tray

Phase 1: MVP (0 to 100K users)

Phase 2: Growth (100K to 10M users)

Phase 3: Scale (10M+ users)

SLOs & Error Budgets

Incident Scenarios (2am reality)

Cost Drivers (Staff lens)

Multi-Region & DR