This problem appears in multiple sheets. Depth expectations increase as you progress:
| Track | What to demonstrate |
|---|---|
| Arch 75 | Staff level: multi-region, cost at scale, migration path, and production metrics. |
Interview Prompt
Design Design TikTok (Short Video Platform).
Clarifying Questions (ask before designing)
| Question | Why it matters |
|---|---|
| Which of these is highest priority: Interest graph vs social graph, Engagement-loop optimization, Cold-start exploration? | Forces scope negotiation — senior candidates trim before drawing boxes. |
| What scale should we design for — DAU, QPS, data volume? | Drives every capacity decision; shows structured thinking. |
| What are the read vs write patterns on the critical path? | Determines caching, DB choice, and replication topology. |
| What consistency and durability guarantees are required? | Separates strong-consistency paths from eventual ones — a senior differentiator. |
Scope
In scope
- Interest graph vs social graph
- Engagement-loop optimization
- Cold-start exploration
- Short-video pipeline
- Content safety at upload
- Capacity estimation with shown math
Out of scope (state explicitly)
- Full ads auction and monetization stack
- Content moderation at scale (#81)
- Direct messaging (#07)
Assumptions
- Clarify scale (DAU, QPS, data volume) for tiktok in the first 5 minutes
- Standard reliability target 99.9%–99.99% unless problem implies higher (payments, booking)
- Managed cloud services (RDS, S3, Kafka, Redis) are acceptable building blocks
These foundational concepts underpin the patterns used in this problem. Review them before deep-diving into component-level trade-offs.
- Upload short videos: 15s to 10min videos with music, effects, filters
- For You Page (FYP): AI-driven personalized infinite-scroll video feed
- Social features: Like, comment, share, follow, duet, stitch
- Video creation tools: In-app recording, editing, effects, music overlay
- Search & Discover: Search by hashtags, sounds, users, trending content
- Notifications: Likes, comments, new followers, trending
- Creator analytics: Views, likes, shares, audience demographics
- Live streaming: Real-time broadcast with gifts/donations
- Monetization: Creator fund, brand partnerships, in-app purchases
- Low Latency Feed: FYP loads in < 500 ms with pre-fetched videos
- High Throughput: Serve 1B+ video plays/day
- Global CDN: Videos cached at edge worldwide, < 100 ms first byte
- Recommendation Quality: FYP must be highly engaging (core product differentiator)
- Upload Processing: Video available within 5 minutes of upload
- Availability: 99.99%
- Scalability: 1B+ MAU, 500M+ DAU
| Metric | Calculation | Value |
|---|---|---|
| DAU | Given | 500M |
| Videos watched / user / day | Avg session 30–60 min | 150 |
| Total video plays / day | 500M × 150 | 75B |
| Video uploads / day | Given | 10M |
| Avg video size (encoded) | Multi-bitrate ladder | 15 MB |
| Upload storage / day | 10M × 15 MB | 150 TB |
| CDN bandwidth / day | 75B plays × 3 MB avg | 225 PB |
| Feed requests / sec | 75B / 86400 | ~870K |
FYP Recommendation Pipeline (Two-Tower → Rank → Re-rank)
Two-Tower (TikTok's actual approach) ⭐: Tower 1: User encoder → user embedding vector (128-dim) Tower 2: Video encoder → video embedding vector (128-dim) Score = dot_product(user_embedding, video_embedding) ✓ Video embeddings can be pre-computed (offline) ✓ User embedding computed once per session → fast inference ✓ Approximate Nearest Neighbor (ANN) search for candidate generation ✗ Can't capture cross-features (user×video interaction features) Single-Tower (for re-ranking): Input: concat(user_features, video_features, context_features) → Deep neural network → P(engagement) ✓ Captures complex interactions ✗ Must run inference for EACH (user, video) pair → expensive ✗ Can't pre-compute → only used for top 500 candidates (not 10M) TikTok's pipeline: Two-Tower (recall): 10M videos → 10K candidates (fast, pre-computed) Single-Tower (rank): 10K candidates → 500 ranked (precise, expensive) Business rules (re-rank): 500 → 200 final (diversity, safety)
Pre-Computed vs On-Demand Feed
Pre-computed ⭐ (TikTok's approach):
Background job computes feed for each active user every 30 minutes
Stores top 200 video_ids in Redis: feed:{user_id}
✓ Feed loads in < 50 ms (just Redis read)
✓ Recommendation model has time to run complex inference
✗ Stale: doesn't reflect last 30 minutes of activity
✗ Compute cost: 500M DAU × every 30 min = 1B feed generations/day
Optimization: Only recompute for users who were active in last 30 min
If user inactive for 1 hour → use last cached feed → still good enough
On-demand:
Each feed request → call recommendation service → wait for inference
✓ Always fresh
✗ Latency: 200-500 ms for model inference → bad UX for first video
✗ Compute spike at peak hours
Best: Pre-computed feed + on-demand refresh when user exhausts cached feed
Client pre-fetches next batch before current batch runs out.Engagement Signal Hierarchy
TikTok's "secret sauce" is the engagement signal hierarchy:
Signal 1: Watch completion rate (MOST important)
Did the user watch the full video? 2nd loop? 3rd loop?
Completion_rate = watch_time / video_duration
If completion_rate > 1.0 → user LOVED it (rewatched)
This signal is 10× more predictive than likes
Signal 2: Share (very strong positive)
Sharing = "I want my friends to see this"
Strongest explicit signal of quality
Signal 3: Comment (strong positive)
Even negative comments = engagement
Videos with high comment rate get boosted
Signal 4: Like (moderate positive)
Easy action, high frequency → somewhat noisy signal
Signal 5: Follow after watching (strong positive)
"This creator's content is consistently good"
Signal 6: Skip / swipe away (negative)
Watch < 3 seconds then swipe → strong negative signal
Signal 7: "Not interested" (explicit negative)
User explicitly marks → downweight similar content heavily
Training data:
Every user session generates training examples:
(user_features, video_features) → {completed, liked, shared, skipped}
Model retrained daily with latest data → adapts to trends within 24 hoursUpload → Transcode → Moderate → Publish
Every uploaded video MUST be moderated before reaching users.
Pipeline (in transcoding workers):
Stage 1: Automated ML classifiers (< 5 seconds)
• Nudity/sexual content detection (image classifier on keyframes)
• Violence/gore detection
• Hate speech detection (text classifier on captions + OCR on video text)
• Copyright music detection (audio fingerprint → match database)
• Spam/scam detection (metadata patterns)
Result: confidence score [0, 1] per category
Stage 2: Decision routing
confidence > 0.95 → AUTO-REJECT (block immediately, notify creator)
confidence 0.7-0.95 → HUMAN REVIEW QUEUE (hold, don't publish)
confidence < 0.7 → AUTO-APPROVE (publish, but monitor)
Stage 3: Human review (for borderline cases)
Queue: 10K-50K videos/day need human review
SLA: review within 2 hours
Reviewers: trained content moderators with escalation to policy team
Stage 4: Post-publish monitoring
Published videos continuously monitored via user reports
If video gets > 10 reports → auto-deprioritize in recommendations
If > 50 reports → remove from recommendations, flag for review
Appeals: Creator can appeal removal → second human review
False positive rate target: < 1% (wrongly removed)
False negative rate target: < 0.1% (harmful content reaching users)Event Bus Design (Kafka)
Topic: tiktok-events Partitions: 64 (scale consumers horizontally) Partition key: entity_id (user_id / order_id — preserves per-entity ordering) Retention: 7 days (compliance) or 24h (high-volume telemetry) Replication factor: 3, min.insync.replicas: 2 Producer: idempotent producer enabled (enable.idempotence=true) Consumer: consumer group "tiktok-processors" - At-least-once delivery + idempotent handlers (dedup by event_id) - DLQ topic: tiktok-events-dlq (poison messages after 3 retries) - Lag alert: consumer lag > 60s → scale workers Design TikTok (Short Video Platform): async side effects MUST NOT block the synchronous API response. Sync path: validate → persist source of truth → publish event → return 201 Async path: consumers update caches, indexes, notifications, aggregates
Get Feed (For You Page)
GET /api/v1/feed?count=10&cursor={last_video_id}
→ 200 OK
{
"videos": [
{
"video_id": "v-uuid", "creator": {"id": "u1", "name": "Alice", "avatar": "..."},
"description": "Dance challenge #fyp", "music": {"id": "m1", "title": "..."},
"stats": {"plays": 1523000, "likes": 89200, "comments": 3400, "shares": 12300},
"hls_url": "https://cdn.tiktok.com/v-uuid/playlist.m3u8",
"thumbnail_url": "https://cdn.tiktok.com/v-uuid/thumb.jpg",
"duration_sec": 45, "created_at": "2026-03-14T08:00:00Z"
}, ...
],
"cursor": "v-uuid-10"
}Upload Video
POST /api/v1/videos/upload-url
→ 200 OK { "upload_url": "https://s3.../presigned", "video_id": "v-uuid" }
POST /api/v1/videos/{video_id}/publish
{ "description": "Dance challenge #fyp", "music_id": "m1", "privacy": "public" }
→ 202 Accepted { "status": "processing" }Common Error Responses
400 Bad Request: invalid input, missing fields, or malformed JSON 401 Unauthorized: missing or invalid auth token or API key 403 Forbidden: authenticated but insufficient permissions 404 Not Found: resource ID does not exist 409 Conflict: duplicate write or version conflict; retry with idempotency key 422 Unprocessable Entity: valid syntax but invalid business logic 429 Too Many Requests: rate limit exceeded; honor Retry-After header 500 Internal Error: unexpected server fault; retry with idempotency key 503 Service Unavailable: dependency down or overloaded; use exponential backoff
MySQL (Vitess): Video Metadata
CREATE TABLE videos (
video_id VARCHAR(36) PRIMARY KEY,
creator_id BIGINT NOT NULL,
description TEXT,
music_id BIGINT,
duration_sec SMALLINT,
status ENUM('processing','published','removed') DEFAULT 'processing',
privacy ENUM('public','private','friends') DEFAULT 'public',
s3_key VARCHAR(512),
thumbnail_key VARCHAR(512),
view_count BIGINT DEFAULT 0,
like_count BIGINT DEFAULT 0,
comment_count INT DEFAULT 0,
share_count INT DEFAULT 0,
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
INDEX idx_creator (creator_id, created_at DESC),
INDEX idx_status_created (status, created_at DESC)
);Redis: Feed & Features
feed:{user_id} → LIST of video_ids (pre-computed, 200 items)
watched:{user_id} → SET of video_ids (TTL 30 days)
user_features:{user_id} → Hash (embedding, interests, language)
video_features:{vid_id} → Hash (embedding, completion_rate, like_rate)Cassandra: Engagement Data
CREATE TABLE video_likes (
video_id UUID,
user_id UUID,
created_at TIMESTAMP,
PRIMARY KEY (video_id, user_id)
);
CREATE TABLE video_comments (
video_id UUID,
comment_id TIMEUUID,
user_id UUID,
text TEXT,
PRIMARY KEY (video_id, comment_id)
) WITH CLUSTERING ORDER BY (comment_id DESC);| Concern | Solution |
|---|---|
| Transcoding failure | Retry 3× with exponential backoff; dead-letter for manual review |
| Recommendation cache miss | Fallback: serve trending/popular videos in user's region |
| CDN miss | Origin pull from S3; CDN cache warm-up for predicted viral videos |
| Feed service down | Client caches last 50 videos locally; offline playback for downloaded |
| Content moderation false negative | User report → human review queue; takedown within 1 hour |
Race Conditions
1. Video Published Before Transcoding Complete
User publishes → metadata created → transcoding still running →
another user searches and finds it → tries to play → 404!
Solution: Video status = "processing" until ALL transcode variants ready.
Feed/search service only returns videos with status = "published".
Client shows "Processing..." placeholder on creator's profile.
2. View Count Accuracy Under Viral Load
Same approach as Like Count:
Redis INCR for real-time display → Kafka → batch write to MySQL
View count can lag by seconds — invisible at "1.5M views" display level
Anti-fraud: Don't count repeated views from same user within 30 seconds
Redis: SETEX view_dedup:{video_id}:{user_id} 1 30Interview Walkthrough
- Split the problem into two pipelines: upload/transcode/CDN for video delivery, and a separate recommendation engine for the For You Page.
- Describe the upload path: client → object storage → transcoding queue (multiple bitrates) → CDN edge propagation within 5 minutes.
- Propose a two-tower model: user tower + video tower embeddings, ANN search retrieves top-500 candidates in ~30ms.
- Pre-compute candidate pools in Redis (refreshed every 30 min); re-rank per request with real-time features (last watched, session history).
- Weight engagement signals by intent: watch completion > rewatch > share > like — passive scroll-past is a negative signal.
- Serve video bytes from CDN with <100ms first byte; keep metadata and feed logic in application servers, not the CDN.
- Budget total FYP latency: feature fetch (20ms) + ANN (30ms) + GPU rank (50ms) + re-rank (10ms) ≈ 110ms.
- Common pitfall: scoring the entire video catalog on every scroll — without ANN retrieval and pre-computed candidates, the 500ms SLA is unreachable.
Recommendation: Two-Tower vs Single-Tower Model
Two-Tower (TikTok's actual approach) ⭐: Tower 1: User encoder → user embedding vector (128-dim) Tower 2: Video encoder → video embedding vector (128-dim) Score = dot_product(user_embedding, video_embedding) ✓ Video embeddings can be pre-computed (offline) ✓ User embedding computed once per session → fast inference ✓ Approximate Nearest Neighbor (ANN) search for candidate generation ✗ Can't capture cross-features (user×video interaction features) Single-Tower (for re-ranking): Input: concat(user_features, video_features, context_features) → Deep neural network → P(engagement) ✓ Captures complex interactions ✗ Must run inference for EACH (user, video) pair → expensive ✗ Can't pre-compute → only used for top 500 candidates (not 10M) TikTok's pipeline: Two-Tower (recall): 10M videos → 10K candidates (fast, pre-computed) Single-Tower (rank): 10K candidates → 500 ranked (precise, expensive) Business rules (re-rank): 500 → 200 final (diversity, safety)
Pre-Computed vs On-Demand Feed Generation
Pre-computed ⭐ (TikTok's approach):
Background job computes feed for each active user every 30 minutes
Stores top 200 video_ids in Redis: feed:{user_id}
✓ Feed loads in < 50 ms (just Redis read)
✓ Recommendation model has time to run complex inference
✗ Stale: doesn't reflect last 30 minutes of activity
✗ Compute cost: 500M DAU × every 30 min = 1B feed generations/day
Optimization: Only recompute for users who were active in last 30 min
If user inactive for 1 hour → use last cached feed → still good enough
On-demand:
Each feed request → call recommendation service → wait for inference
✓ Always fresh
✗ Latency: 200-500 ms for model inference → bad UX for first video
✗ Compute spike at peak hours
Best: Pre-computed feed + on-demand refresh when user exhausts cached feed
Client pre-fetches next batch before current batch runs out.Content Moderation Pipeline: The Trust & Safety Layer
Every uploaded video MUST be moderated before reaching users.
Pipeline (in transcoding workers):
Stage 1: Automated ML classifiers (< 5 seconds)
• Nudity/sexual content detection (image classifier on keyframes)
• Violence/gore detection
• Hate speech detection (text classifier on captions + OCR on video text)
• Copyright music detection (audio fingerprint → match database)
• Spam/scam detection (metadata patterns)
Result: confidence score [0, 1] per category
Stage 2: Decision routing
confidence > 0.95 → AUTO-REJECT (block immediately, notify creator)
confidence 0.7-0.95 → HUMAN REVIEW QUEUE (hold, don't publish)
confidence < 0.7 → AUTO-APPROVE (publish, but monitor)
Stage 3: Human review (for borderline cases)
Queue: 10K-50K videos/day need human review
SLA: review within 2 hours
Reviewers: trained content moderators with escalation to policy team
Stage 4: Post-publish monitoring
Published videos continuously monitored via user reports
If video gets > 10 reports → auto-deprioritize in recommendations
If > 50 reports → remove from recommendations, flag for review
Appeals: Creator can appeal removal → second human review
False positive rate target: < 1% (wrongly removed)
False negative rate target: < 0.1% (harmful content reaching users)Engagement Signals: What the Algorithm REALLY Optimizes
TikTok's "secret sauce" is the engagement signal hierarchy:
Signal 1: Watch completion rate (MOST important)
Did the user watch the full video? 2nd loop? 3rd loop?
Completion_rate = watch_time / video_duration
If completion_rate > 1.0 → user LOVED it (rewatched)
This signal is 10× more predictive than likes
Signal 2: Share (very strong positive)
Sharing = "I want my friends to see this"
Strongest explicit signal of quality
Signal 3: Comment (strong positive)
Even negative comments = engagement
Videos with high comment rate get boosted
Signal 4: Like (moderate positive)
Easy action, high frequency → somewhat noisy signal
Signal 5: Follow after watching (strong positive)
"This creator's content is consistently good"
Signal 6: Skip / swipe away (negative)
Watch < 3 seconds then swipe → strong negative signal
Signal 7: "Not interested" (explicit negative)
User explicitly marks → downweight similar content heavily
Training data:
Every user session generates training examples:
(user_features, video_features) → {completed, liked, shared, skipped}
Model retrained daily with latest data → adapts to trends within 24 hoursCDN Optimization: Serving 225 PB/Day
225 PB/day = 2.6 TB/sec average, peaks at 5+ TB/sec
Optimization strategies:
1. HLS Adaptive Bitrate:
360p (0.5 Mbps), 480p (1.5 Mbps), 720p (3 Mbps), 1080p (5 Mbps)
Client starts at lowest → upgrades based on bandwidth
Saves bandwidth: 60% of views are on mobile → 480p sufficient
2. Predictive pre-warming:
When video starts going viral (velocity > threshold):
→ Push to CDN edge nodes in likely regions BEFORE requests arrive
→ Avoid origin pull thundering herd
3. Short video = small files = high cache hit rate:
30-second video at 720p = ~5 MB
CDN edge server with 1 TB cache → holds 200K videos
Top 200K videos cover 80%+ of views → 80% cache hit rate
4. Range requests for seek:
HLS segments: 2-second chunks
User scrolls to middle → only fetch that segment → low TTFBStaff interviews expect you to articulate how the system evolves under real growth — not jump straight to the final architecture.
Phase 1: MVP (0 to 100K users)
Monolith or minimal services proving core tiktok flows. Optimize for shipping speed and correctness over scale.
Key components: Single region · Primary DB + Redis cache · Synchronous core path · Basic monitoring
Move to next phase when: p99 latency exceeds SLO or DB CPU sustained above 70%
Phase 2: Growth (100K to 10M users)
Split read/write paths, introduce async processing for non-critical work, add caching layers and horizontal scaling.
Key components: Read replicas or CQRS · Message queue for async work · CDN / edge caching · Service-level SLOs
Move to next phase when: Hot keys, fan-out bottlenecks, or ops toil from manual scaling
Phase 3: Scale (10M+ users)
Shard data plane, multi-region active-active or active-passive, formal DR runbooks, cost optimization.
Key components: Database sharding / partitioning · Multi-region replication · Auto-scaling + chaos testing · Dedicated platform/SRE ownership
Move to next phase when: Regional failure domain risk, compliance data residency, or linear cost growth unsustainable
SLOs & Error Budgets
| Metric | Target | Rationale |
|---|---|---|
| Core user-facing availability | 99.95% | Budget for planned maintenance + unplanned failures without user-visible outage. |
| p99 latency (critical path) | Problem-specific — state target early and tie to capacity math | Interview credibility comes from connecting SLO to architecture choices. |
| Error rate (5xx) | < 0.1% | Distinguishes transient blips from systemic failure requiring rollback. |
| Data durability | 99.999999999% (11 nines) for committed writes | Define which operations require fsync/quorum vs async replication. |
Incident Scenarios (2am reality)
| Scenario | How you detect | Mitigation |
|---|---|---|
| Primary database unavailable | Health check failures, connection pool exhaustion alerts, elevated 5xx | Failover to replica / promote standby; enable read-only degraded mode if writes impossible; queue writes if async path exists |
| Traffic spike (10× normal) | RPS anomaly alert, autoscaling lag, latency SLO burn rate | Rate limit non-critical endpoints; scale read path horizontally; pre-warm caches; shed load on expensive operations |
| Bad deploy causing elevated errors | Canary metric regression, error budget burn, deployment correlation | Automated rollback within 5 minutes; feature flag kill switch; maintain N-1 compatibility |
Cost Drivers (Staff lens)
- Egress bandwidth and CDN (often dominates media/data-heavy systems)
- Database storage + IOPS at scale (plan compaction, TTL, tiering)
- Compute for async pipelines (right-size workers, spot instances for batch)
- Managed service premiums vs operational headcount trade-off
Multi-Region & DR
Start single-region with cross-AZ redundancy. Add read replicas in secondary region for DR. Move to active-active only when latency SLO or data residency requires it — accept conflict resolution complexity explicitly.