Design Like Count for High Profile Users

Interview Prompt

Design Design Like Count for High Profile Users.

Clarifying Questions (ask before designing)

Question	Why it matters
Which of these is highest priority: Eventual consistency for counters, Sharded counters, Read-your-writes for liker?	Forces scope negotiation — senior candidates trim before drawing boxes.
What scale should we design for — DAU, QPS, data volume?	Drives every capacity decision; shows structured thinking.
What are the read vs write patterns on the critical path?	Determines caching, DB choice, and replication topology.
What consistency and durability guarantees are required?	Separates strong-consistency paths from eventual ones — a senior differentiator.

Scope

In scope

Eventual consistency for counters
Sharded counters
Read-your-writes for liker
Cache warming
Capacity estimation with shown math

Out of scope (state explicitly)

Detailed frontend/UI pixel implementation
Org structure, staffing, and hiring plan

Assumptions

Clarify scale (DAU, QPS, data volume) for like count high profile in the first 5 minutes
Standard reliability target 99.9%–99.99% unless problem implies higher (payments, booking)
Managed cloud services (RDS, S3, Kafka, Redis) are acceptable building blocks

Like/Unlike: Users can like/unlike posts, photos, videos
Display count: Show total like count on each piece of content
Check if liked: Show whether the current user has already liked a post
Celebrity scale: Handle posts with 100M+ likes (Ronaldo, Taylor Swift, MrBeast)
Recent likers: "Liked by Alice, Bob, and 2.3M others"
Like notifications: Notify content owner of new likes (batched/grouped for celebrities)

Metric	Calculation	Value
Total likes / day	Given	5B
Likes / sec (avg)	5B ÷ 86400	~58K
Peak likes / sec (viral post)	From Peak likes / day ÷ 86400 (+ peak factor in value)	1M+
Avg post likes	Given (typical workload assumption)	50
Celebrity post likes	Given (assumption documented in value)	10M - 100M
Like record size	Given (assumption documented in value)	32 bytes (user_id + post_id + timestamp)
Storage / day	5B × 32B	160 GB

Loading...

The Hot Counter Problem: Why Normal DBs Fail

Celebrity posts "Ronaldo scores goal" → 1M likes in 60 seconds

Naive approach:
  UPDATE posts SET like_count = like_count + 1 WHERE post_id = ?;
  
  At 1M/sec → 1M write transactions on a SINGLE ROW
  → Row-level lock contention → DB thread pool exhausted
  → All other queries on the DB slow down → cascading failure

Solution: Tiered Architecture for Normal vs Hot Posts

Tier 1 — Normal Posts (< 10K likes):
  Like → Redis SADD + INCR → synchronous Cassandra write

Tier 2 — Hot Posts (10K+ likes, detected via velocity):
  Like → Redis SADD + INCR only → Kafka event
  Cassandra batch write every 5 seconds

Hot post detection:
  Track like velocity per post in Redis:
    INCR like_velocity:{post_id}:{minute}
    If velocity > 1000/min → add to hot_posts SET → switch to Tier 2
  When velocity drops below 100/min → remove from hot_posts

Liked By Deduplication

Redis SET approach:
  SADD liked:{post_id} {user_id}
  Returns 1 → new like → INCR count
  Returns 0 → already liked → no-op

Bloom Filter approach ⭐ (for mega-posts):
  Space: 100M entries × 10 bits = 125 MB (vs 1.6 GB)
  False positive rate: ~1%

Hybrid ⭐⭐:
  Posts with < 100K likes → Redis SET (exact)
  Posts with 100K+ likes → Bloom Filter (approximate)
  Cassandra is always the source of truth for exact dedup

Sharded Counters

Instead of one counter, split into N sub-counters:
  like_count:{post_id}:0 = 15234
  like_count:{post_id}:1 = 15189
  ...
  like_count:{post_id}:7 = 15122

  Write: INCR like_count:{post_id}:{random(0..7)}
  Read: SUM(like_count:{post_id}:0 through :7) = 121,234

  Each shard handles 1/8 of write traffic → 8× more throughput

Event Bus Design (Kafka)

Topic: like_count_high_profile-events
  Partitions: 64 (scale consumers horizontally)
  Partition key: entity_id (user_id / order_id — preserves per-entity ordering)
  Retention: 7 days (compliance) or 24h (high-volume telemetry)
  Replication factor: 3, min.insync.replicas: 2

Producer: idempotent producer enabled (enable.idempotence=true)
Consumer: consumer group "like_count_high_profile-processors"
  - At-least-once delivery + idempotent handlers (dedup by event_id)
  - DLQ topic: like_count_high_profile-events-dlq (poison messages after 3 retries)
  - Lag alert: consumer lag > 60s → scale workers

Design Like Count for High Profile Users: async side effects MUST NOT block the synchronous API response.
  Sync path: validate → persist source of truth → publish event → return 201
  Async path: consumers update caches, indexes, notifications, aggregates

Like / Unlike

HTTP

POST /api/v1/likes
{ "post_id": "post-uuid", "action": "like" }
→ 200 OK { "liked": true, "like_count": 1523401 }

POST /api/v1/likes
{ "post_id": "post-uuid", "action": "unlike" }
→ 200 OK { "liked": false, "like_count": 1523400 }

Check If Liked

HTTP

GET /api/v1/likes/check?post_id=post-uuid
→ 200 OK { "liked": true }

Get Like Count (Batch)

HTTP

POST /api/v1/likes/counts
{ "post_ids": ["post-1", "post-2", "post-3"] }
→ 200 OK { "counts": {"post-1": 1523401, "post-2": 42, "post-3": 89234} }

Get Recent Likers

HTTP

GET /api/v1/likes/recent?post_id=post-uuid&limit=5
→ 200 OK { "users": [{"id": "...", "name": "Alice"}, ...], "total_count": 1523401 }

Common Error Responses

400 Bad Request: invalid input, missing fields, or malformed JSON
401 Unauthorized: missing or invalid auth token or API key
403 Forbidden: authenticated but insufficient permissions
404 Not Found: resource ID does not exist
409 Conflict: duplicate write or version conflict; retry with idempotency key
422 Unprocessable Entity: valid syntax but invalid business logic
429 Too Many Requests: rate limit exceeded; honor Retry-After header
500 Internal Error: unexpected server fault; retry with idempotency key
503 Service Unavailable: dependency down or overloaded; use exponential backoff

Redis: Real-Time (Primary Read Path)

# Like count
Key:    like_count:{post_id}
Type:   Integer

# Dedup set (normal posts)
Key:    liked:{post_id}
Type:   SET

# Bloom filter (mega-posts)
Key:    bf:liked:{post_id}
Type:   Bloom Filter (RedisBloom)

# Hot post detection
Key:    like_velocity:{post_id}:{minute}
Type:   Integer (TTL=120s)

Cassandra: Source of Truth

SQL

-- Who liked a specific post?
CREATE TABLE likes_by_post (
    post_id     UUID,
    liked_at    TIMESTAMP,
    user_id     UUID,
    PRIMARY KEY (post_id, liked_at, user_id)
) WITH CLUSTERING ORDER BY (liked_at DESC);

-- What posts did a user like?
CREATE TABLE likes_by_user (
    user_id     UUID,
    liked_at    TIMESTAMP,
    post_id     UUID,
    PRIMARY KEY (user_id, liked_at)
) WITH CLUSTERING ORDER BY (liked_at DESC);

-- Materialized like count
CREATE TABLE like_counts (
    post_id     UUID PRIMARY KEY,
    count       COUNTER
);

Concern	Solution
Redis crash	Redis Cluster with replicas; rebuild counts from Cassandra
Double-like	Redis SET/Bloom dedup + Cassandra PRIMARY KEY dedup
Count drift	Reconciliation job every hour: read Cassandra count → fix Redis
Hot post overloads Redis	Sharded counters or in-memory batching at service layer
Kafka consumer lag	Cassandra writes may lag; Redis is always current for display

Redis SET vs Bloom Filter vs Cassandra-Only Dedup

Approach	Memory per 1M likes	Accuracy	Latency	Best For
Redis SET	16 MB	100% exact	< 1 ms	Posts with < 100K likes
Bloom Filter	1.2 MB	~99%	< 1 ms	Mega-posts (100K+)
Cassandra-only	0 (disk)	100% exact	~5 ms	When memory is constrained
Hybrid ⭐	SET for small + BF for large	99-100%	< 1 ms	Production systems at scale

Why Not PostgreSQL for Likes?

PostgreSQL:
  At 1M likes/sec: row-level lock on unique index → lock contention → ~10K/sec max
  COUNT(*) on 100M rows → sequential scan → 30+ seconds

Cassandra:
  Writes are O(1) (append-only LSM tree)
  Partition by post_id → each post's writes go to one node
  COUNTER type for O(1) count reads

Best: Redis for real-time count + Cassandra for persistence

SLOs & Error Budgets

Metric	Target	Rationale
Core user-facing availability	99.95%	Budget for planned maintenance + unplanned failures without user-visible outage.
p99 latency (critical path)	Problem-specific — state target early and tie to capacity math	Interview credibility comes from connecting SLO to architecture choices.
Error rate (5xx)	< 0.1%	Distinguishes transient blips from systemic failure requiring rollback.
Data durability	99.999999999% (11 nines) for committed writes	Define which operations require fsync/quorum vs async replication.

Incident Scenarios (2am reality)

Scenario	How you detect	Mitigation
Primary database unavailable	Health check failures, connection pool exhaustion alerts, elevated 5xx	Failover to replica / promote standby; enable read-only degraded mode if writes impossible; queue writes if async path exists
Traffic spike (10× normal)	RPS anomaly alert, autoscaling lag, latency SLO burn rate	Rate limit non-critical endpoints; scale read path horizontally; pre-warm caches; shed load on expensive operations
Bad deploy causing elevated errors	Canary metric regression, error budget burn, deployment correlation	Automated rollback within 5 minutes; feature flag kill switch; maintain N-1 compatibility

Cost Drivers (Staff lens)

Egress bandwidth and CDN (often dominates media/data-heavy systems)
Database storage + IOPS at scale (plan compaction, TTL, tiering)
Compute for async pipelines (right-size workers, spot instances for batch)
Managed service premiums vs operational headcount trade-off

Multi-Region & DR

Start single-region with cross-AZ redundancy. Add read replicas in secondary region for DR. Move to active-active only when latency SLO or data residency requires it — accept conflict resolution complexity explicitly.

Interview Prompt

Clarifying Questions (ask before designing)

Scope

In scope

Out of scope (state explicitly)

Assumptions

The Hot Counter Problem: Why Normal DBs Fail

Solution: Tiered Architecture for Normal vs Hot Posts

Liked By Deduplication

Sharded Counters

Event Bus Design (Kafka)

Like / Unlike

Check If Liked

Get Like Count (Batch)

Get Recent Likers

Common Error Responses

Redis: Real-Time (Primary Read Path)

Cassandra: Source of Truth

Count Display Formatting

Celebrity Notification Batching

Interview Walkthrough

Redis SET vs Bloom Filter vs Cassandra-Only Dedup

Why Not PostgreSQL for Likes?

Phase 1: MVP (0 to 100K users)

Phase 2: Growth (100K to 10M users)

Phase 3: Scale (10M+ users)

SLOs & Error Budgets

Incident Scenarios (2am reality)

Cost Drivers (Staff lens)

Multi-Region & DR