Design a Mentions & Tagging System

Interview Prompt

Design Mentions & Tagging System.

Clarifying Questions (ask before designing)

Question	Why it matters
Which of these is highest priority: @-mention parsing & indexing, Notification trigger, Privacy checks?	Forces scope negotiation — senior candidates trim before drawing boxes.
What scale should we design for — DAU, QPS, data volume?	Drives every capacity decision; shows structured thinking.
What are the read vs write patterns on the critical path?	Determines caching, DB choice, and replication topology.
What consistency and durability guarantees are required?	Separates strong-consistency paths from eventual ones — a senior differentiator.

Scope

In scope

@-mention parsing & indexing
Notification trigger
Privacy checks
Reverse index
Capacity estimation with shown math

Out of scope (state explicitly)

Full ML ranking model training pipeline
Direct messaging / chat (#07)
Ad insertion and monetization

Assumptions

Clarify scale (DAU, QPS, data volume) for mentions tagging system in the first 5 minutes
Standard reliability target 99.9%–99.99% unless problem implies higher (payments, booking)
Managed cloud services (RDS, S3, Kafka, Redis) are acceptable building blocks

Mention users: @username syntax in posts, comments, stories
Tag in media: Tag users in photos/videos at specific positions
Mention notifications: Notify mentioned users in real-time
Mention feed: "Posts you're mentioned in" aggregated view
Autocomplete: Suggest usernames as user types "@"
Mention permissions: Control who can mention you
Remove tags: Users can remove themselves from mentions

Metric	Calculation	Value
Posts with mentions / day	Given	200M
Avg mentions per post	Given	2
Total mention events / day	200M × 2	400M
Mention events / sec	Derived from daily volume ÷ 86400 (+ peak factor)	~4.6K
Autocomplete queries / sec	Derived from daily volume ÷ 86400 (+ peak factor)	50K
Username lookup latency	Given	< 50 ms

Loading...

Post with Mentions

HTTP

POST /api/v1/posts
{
  "text": "Great photo with @alice and @bob!",
  "mentions": [
    {"username": "alice", "offset": 17, "length": 6},
    {"username": "bob", "offset": 28, "length": 4}
  ],
  "media_tags": [
    {"user_id": "u-alice", "x": 0.35, "y": 0.62}
  ]
}
→ 201 Created { "post_id": "p-uuid" }

Autocomplete

HTTP

GET /api/v1/users/autocomplete?prefix=al&limit=5
→ 200 OK
{
  "suggestions": [
    {"user_id": "u1", "username": "alice_smith", "name": "Alice Smith"},
    {"user_id": "u2", "username": "alex_jones", "name": "Alex Jones"}
  ]
}

Common Error Responses

400 Bad Request: invalid input, missing fields, or malformed JSON
401 Unauthorized: missing or invalid auth token or API key
403 Forbidden: authenticated but insufficient permissions
404 Not Found: resource ID does not exist
409 Conflict: duplicate write or version conflict; retry with idempotency key
422 Unprocessable Entity: valid syntax but invalid business logic
429 Too Many Requests: rate limit exceeded; honor Retry-After header
500 Internal Error: unexpected server fault; retry with idempotency key
503 Service Unavailable: dependency down or overloaded; use exponential backoff

MySQL: Mention Records

SQL

CREATE TABLE mentions (
    mention_id      BIGINT PRIMARY KEY AUTO_INCREMENT,
    content_type    ENUM('post', 'comment', 'story') NOT NULL,
    content_id      BIGINT NOT NULL,
    mentioned_user  BIGINT NOT NULL,
    mentioned_by    BIGINT NOT NULL,
    mention_type    ENUM('text', 'photo_tag') DEFAULT 'text',
    text_offset     INT,
    text_length     INT,
    tag_x           FLOAT,
    tag_y           FLOAT,
    status          ENUM('active', 'removed') DEFAULT 'active',
    created_at      TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    INDEX idx_mentioned_user (mentioned_user, created_at DESC)
);

Redis + Kafka

Redis:
  usernames              → Sorted Set (lexicographic autocomplete)
  mention_settings:{uid} → Hash { allow_from: "everyone"|"followers"|"nobody" }
  blocked_by:{uid}       → SET of blocked user_ids
  mentions_feed:{uid}    → List (TTL: 5 min)

Kafka topic: mention-events (partitioned by mentioned_user_id)

Concern	Solution
Mention spam	Rate limit: max 20/post, 50/hour, 200/day
Deleted post	Cascade delete mentions (or mark inactive)
User renames	Store user_id not username; render current username
Autocomplete staleness	Add to Redis sorted set immediately on registration
Notification dedup	Group by (user, post): one notification

SLOs & Error Budgets

Metric	Target	Rationale
Core user-facing availability	99.95%	Budget for planned maintenance + unplanned failures without user-visible outage.
p99 latency (critical path)	Problem-specific — state target early and tie to capacity math	Interview credibility comes from connecting SLO to architecture choices.
Error rate (5xx)	< 0.1%	Distinguishes transient blips from systemic failure requiring rollback.
Data durability	99.999999999% (11 nines) for committed writes	Define which operations require fsync/quorum vs async replication.

Incident Scenarios (2am reality)

Scenario	How you detect	Mitigation
Primary database unavailable	Health check failures, connection pool exhaustion alerts, elevated 5xx	Failover to replica / promote standby; enable read-only degraded mode if writes impossible; queue writes if async path exists
Traffic spike (10× normal)	RPS anomaly alert, autoscaling lag, latency SLO burn rate	Rate limit non-critical endpoints; scale read path horizontally; pre-warm caches; shed load on expensive operations
Bad deploy causing elevated errors	Canary metric regression, error budget burn, deployment correlation	Automated rollback within 5 minutes; feature flag kill switch; maintain N-1 compatibility

Cost Drivers (Staff lens)

Egress bandwidth and CDN (often dominates media/data-heavy systems)
Database storage + IOPS at scale (plan compaction, TTL, tiering)
Compute for async pipelines (right-size workers, spot instances for batch)
Managed service premiums vs operational headcount trade-off

Multi-Region & DR

Start single-region with cross-AZ redundancy. Add read replicas in secondary region for DR. Move to active-active only when latency SLO or data residency requires it — accept conflict resolution complexity explicitly.

Interview Prompt

Clarifying Questions (ask before designing)

Scope

In scope

Out of scope (state explicitly)

Assumptions

Mention Extraction and Validation

Autocomplete

Photo/Video Tagging

Event Bus Design (Kafka)

Post with Mentions

Autocomplete

Common Error Responses

MySQL: Mention Records

Redis + Kafka

Notification Grouping

@everyone and @channel

Interview Walkthrough

Inline vs Separate Table for Mentions

Redis Sorted Set vs Elasticsearch for Autocomplete

Phase 1: MVP (0 to 100K users)

Phase 2: Growth (100K to 10M users)

Phase 3: Scale (10M+ users)

SLOs & Error Budgets

Incident Scenarios (2am reality)

Cost Drivers (Staff lens)

Multi-Region & DR