Interview Prompt
Design Mentions & Tagging System.
Clarifying Questions (ask before designing)
| Question | Why it matters |
|---|---|
| Which of these is highest priority: @-mention parsing & indexing, Notification trigger, Privacy checks? | Forces scope negotiation — senior candidates trim before drawing boxes. |
| What scale should we design for — DAU, QPS, data volume? | Drives every capacity decision; shows structured thinking. |
| What are the read vs write patterns on the critical path? | Determines caching, DB choice, and replication topology. |
| What consistency and durability guarantees are required? | Separates strong-consistency paths from eventual ones — a senior differentiator. |
Scope
In scope
- @-mention parsing & indexing
- Notification trigger
- Privacy checks
- Reverse index
- Capacity estimation with shown math
Out of scope (state explicitly)
- Full ML ranking model training pipeline
- Direct messaging / chat (#07)
- Ad insertion and monetization
Assumptions
- Clarify scale (DAU, QPS, data volume) for mentions tagging system in the first 5 minutes
- Standard reliability target 99.9%–99.99% unless problem implies higher (payments, booking)
- Managed cloud services (RDS, S3, Kafka, Redis) are acceptable building blocks
These foundational concepts underpin the patterns used in this problem. Review them before deep-diving into component-level trade-offs.
- Mention users: @username syntax in posts, comments, stories
- Tag in media: Tag users in photos/videos at specific positions
- Mention notifications: Notify mentioned users in real-time
- Mention feed: "Posts you're mentioned in" aggregated view
- Autocomplete: Suggest usernames as user types "@"
- Mention permissions: Control who can mention you
- Remove tags: Users can remove themselves from mentions
- Low Latency: Autocomplete in < 100 ms
- Notification Speed: Notified within 5 seconds
- Scale: 100M+ mentions/day
- Accuracy: Handle dots, underscores in usernames
- Anti-Spam: Rate limiting on mentions
| Metric | Calculation | Value |
|---|---|---|
| Posts with mentions / day | Given | 200M |
| Avg mentions per post | Given | 2 |
| Total mention events / day | 200M × 2 | 400M |
| Mention events / sec | Derived from daily volume ÷ 86400 (+ peak factor) | ~4.6K |
| Autocomplete queries / sec | Derived from daily volume ÷ 86400 (+ peak factor) | 50K |
| Username lookup latency | Given | < 50 ms |
Mention Extraction and Validation
Client extracts mentions via regex. Sends structured data with offsets. Server validates: does username exist? (batch DB lookup), permission check (everyone/followers/nobody), blocklist check. Rate limit: max 20 mentions per post, 50 per hour. Invalid mentions rendered as plain text.
Autocomplete
Tier 1: Client-side cache of following list (instant). Tier 2: Server-side with Redis sorted set using ZRANGEBYLEX usernames "[al" "[al\xff" LIMIT 0 5. Ranking: is_following × mutual_friends × popularity × recency. Redis: ~20 GB for 1B usernames.
Photo/Video Tagging
Store normalized coordinates [0, 1] for resolution-independence. JSONB column or separate media_tags table. Index on tagged user_id for "photos of me" queries.
Event Bus Design (Kafka)
Topic: mentions_tagging_system-events Partitions: 64 (scale consumers horizontally) Partition key: entity_id (user_id / order_id — preserves per-entity ordering) Retention: 7 days (compliance) or 24h (high-volume telemetry) Replication factor: 3, min.insync.replicas: 2 Producer: idempotent producer enabled (enable.idempotence=true) Consumer: consumer group "mentions_tagging_system-processors" - At-least-once delivery + idempotent handlers (dedup by event_id) - DLQ topic: mentions_tagging_system-events-dlq (poison messages after 3 retries) - Lag alert: consumer lag > 60s → scale workers Design a Mentions & Tagging System: async side effects MUST NOT block the synchronous API response. Sync path: validate → persist source of truth → publish event → return 201 Async path: consumers update caches, indexes, notifications, aggregates
Post with Mentions
POST /api/v1/posts
{
"text": "Great photo with @alice and @bob!",
"mentions": [
{"username": "alice", "offset": 17, "length": 6},
{"username": "bob", "offset": 28, "length": 4}
],
"media_tags": [
{"user_id": "u-alice", "x": 0.35, "y": 0.62}
]
}
→ 201 Created { "post_id": "p-uuid" }Autocomplete
GET /api/v1/users/autocomplete?prefix=al&limit=5
→ 200 OK
{
"suggestions": [
{"user_id": "u1", "username": "alice_smith", "name": "Alice Smith"},
{"user_id": "u2", "username": "alex_jones", "name": "Alex Jones"}
]
}Common Error Responses
400 Bad Request: invalid input, missing fields, or malformed JSON 401 Unauthorized: missing or invalid auth token or API key 403 Forbidden: authenticated but insufficient permissions 404 Not Found: resource ID does not exist 409 Conflict: duplicate write or version conflict; retry with idempotency key 422 Unprocessable Entity: valid syntax but invalid business logic 429 Too Many Requests: rate limit exceeded; honor Retry-After header 500 Internal Error: unexpected server fault; retry with idempotency key 503 Service Unavailable: dependency down or overloaded; use exponential backoff
MySQL: Mention Records
CREATE TABLE mentions (
mention_id BIGINT PRIMARY KEY AUTO_INCREMENT,
content_type ENUM('post', 'comment', 'story') NOT NULL,
content_id BIGINT NOT NULL,
mentioned_user BIGINT NOT NULL,
mentioned_by BIGINT NOT NULL,
mention_type ENUM('text', 'photo_tag') DEFAULT 'text',
text_offset INT,
text_length INT,
tag_x FLOAT,
tag_y FLOAT,
status ENUM('active', 'removed') DEFAULT 'active',
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
INDEX idx_mentioned_user (mentioned_user, created_at DESC)
);Redis + Kafka
Redis:
usernames → Sorted Set (lexicographic autocomplete)
mention_settings:{uid} → Hash { allow_from: "everyone"|"followers"|"nobody" }
blocked_by:{uid} → SET of blocked user_ids
mentions_feed:{uid} → List (TTL: 5 min)
Kafka topic: mention-events (partitioned by mentioned_user_id)| Concern | Solution |
|---|---|
| Mention spam | Rate limit: max 20/post, 50/hour, 200/day |
| Deleted post | Cascade delete mentions (or mark inactive) |
| User renames | Store user_id not username; render current username |
| Autocomplete staleness | Add to Redis sorted set immediately on registration |
| Notification dedup | Group by (user, post): one notification |
Notification Grouping
Buffer mentions for same (user, post) pair in 5-minute window. First mention → individual notification. > 5 → grouped: "Bob, Carol, and 45 others mentioned you." Uses Redis INCR with TTL=300 for buffering.
@everyone and @channel
For group/channel mentions, store as a single "group" mention type rather than individual records. Permission: only admins can use. Users can mute per channel.
Interview Walkthrough
- Parse @mentions at write time with a tokenizer, not render time — store (post_id, mentioned_user_id, char_offset) in a separate mentions table.
- Validate usernames against a user directory before persisting — reject or queue mentions for deleted/renamed accounts.
- Build autocomplete with Redis sorted sets for prefix lookup (<2ms); fall back to Elasticsearch for fuzzy matching on typos.
- Fire notifications asynchronously via Kafka — never block the post-creation API on fan-out to all mentioned users.
- Group notifications for the same (user, post) pair in a 5-minute buffer — collapse 50 mentions into "Bob, Carol, and 48 others mentioned you."
- Handle @everyone/@channel as a single group-mention record with admin permission checks, not N individual mention rows.
- Render mentions client-side by overlaying clickable links at stored offsets — username changes don't break historical post text.
- Common pitfall: storing only the raw @username string inline — renames break old links and dedup becomes impossible across username changes.
Inline vs Separate Table for Mentions
Separate table ⭐: Post text stored as-is, mentions table links (post_id, user_id, offset). Rendering: fetch mentions, overlay clickable links. Username changes don't break old posts. User removes tag → text stays but link removed.
Redis Sorted Set vs Elasticsearch for Autocomplete
Redis: ~2ms, exact prefix match, 20GB for 1B users. Elasticsearch: ~20ms, fuzzy matching, richer ranking. Recommended: Redis for speed + Elasticsearch for fuzzy fallback.
Staff interviews expect you to articulate how the system evolves under real growth — not jump straight to the final architecture.
Phase 1: MVP (0 to 100K users)
Monolith or minimal services proving core mentions tagging system flows. Optimize for shipping speed and correctness over scale.
Key components: Single region · Primary DB + Redis cache · Synchronous core path · Basic monitoring
Move to next phase when: p99 latency exceeds SLO or DB CPU sustained above 70%
Phase 2: Growth (100K to 10M users)
Split read/write paths, introduce async processing for non-critical work, add caching layers and horizontal scaling.
Key components: Read replicas or CQRS · Message queue for async work · CDN / edge caching · Service-level SLOs
Move to next phase when: Hot keys, fan-out bottlenecks, or ops toil from manual scaling
Phase 3: Scale (10M+ users)
Shard data plane, multi-region active-active or active-passive, formal DR runbooks, cost optimization.
Key components: Database sharding / partitioning · Multi-region replication · Auto-scaling + chaos testing · Dedicated platform/SRE ownership
Move to next phase when: Regional failure domain risk, compliance data residency, or linear cost growth unsustainable
SLOs & Error Budgets
| Metric | Target | Rationale |
|---|---|---|
| Core user-facing availability | 99.95% | Budget for planned maintenance + unplanned failures without user-visible outage. |
| p99 latency (critical path) | Problem-specific — state target early and tie to capacity math | Interview credibility comes from connecting SLO to architecture choices. |
| Error rate (5xx) | < 0.1% | Distinguishes transient blips from systemic failure requiring rollback. |
| Data durability | 99.999999999% (11 nines) for committed writes | Define which operations require fsync/quorum vs async replication. |
Incident Scenarios (2am reality)
| Scenario | How you detect | Mitigation |
|---|---|---|
| Primary database unavailable | Health check failures, connection pool exhaustion alerts, elevated 5xx | Failover to replica / promote standby; enable read-only degraded mode if writes impossible; queue writes if async path exists |
| Traffic spike (10× normal) | RPS anomaly alert, autoscaling lag, latency SLO burn rate | Rate limit non-critical endpoints; scale read path horizontally; pre-warm caches; shed load on expensive operations |
| Bad deploy causing elevated errors | Canary metric regression, error budget burn, deployment correlation | Automated rollback within 5 minutes; feature flag kill switch; maintain N-1 compatibility |
Cost Drivers (Staff lens)
- Egress bandwidth and CDN (often dominates media/data-heavy systems)
- Database storage + IOPS at scale (plan compaction, TTL, tiering)
- Compute for async pipelines (right-size workers, spot instances for batch)
- Managed service premiums vs operational headcount trade-off
Multi-Region & DR
Start single-region with cross-AZ redundancy. Add read replicas in secondary region for DR. Move to active-active only when latency SLO or data residency requires it — accept conflict resolution complexity explicitly.