Design a Review and Rating System

Interview Prompt

Design Review and Rating System.

Clarifying Questions (ask before designing)

Question	Why it matters
Which of these is highest priority: Aggregated rating computation, Spam detection, Verified purchase?	Forces scope negotiation — senior candidates trim before drawing boxes.
What scale should we design for — DAU, QPS, data volume?	Drives every capacity decision; shows structured thinking.
What are the read vs write patterns on the critical path?	Determines caching, DB choice, and replication topology.
What consistency and durability guarantees are required?	Separates strong-consistency paths from eventual ones — a senior differentiator.

Scope

In scope

Aggregated rating computation
Spam detection
Verified purchase
Helpful vote ranking
Capacity estimation with shown math

Out of scope (state explicitly)

Full catalog/search infrastructure (#12)
Payment checkout flow (#24)
Fraud and abuse ML pipelines

Assumptions

Clarify scale (DAU, QPS, data volume) for review rating system in the first 5 minutes
Standard reliability target 99.9%–99.99% unless problem implies higher (payments, booking)
Managed cloud services (RDS, S3, Kafka, Redis) are acceptable building blocks

Write reviews: Text reviews with 1-5 star rating for products/services
Photo/video reviews: Attach media to reviews
Helpful votes: Other users vote reviews as "helpful" or "not helpful"
Verified purchase badge: Mark reviews from actual buyers
Aggregate ratings: Average rating, rating distribution histogram, total count per product
Sort/filter reviews: By recency, helpfulness, rating, verified purchase
Seller responses: Sellers can respond publicly to reviews
Edit/delete own reviews: Users manage their own reviews
Anti-abuse: Detect and filter fake/spam reviews via ML + rules
Review summary: AI-generated summary of common themes across reviews

Metric	Calculation	Value
Total products	Given	500M
Total reviews	Given	5B
New reviews / day	Given (assumption documented in value)	200K
Review reads / sec	Derived from daily volume ÷ 86400 (+ peak factor)	100K
Aggregate rating reads / sec	Derived from daily volume ÷ 86400 (+ peak factor)	500K
Total review storage	Given	5 TB text + 100 TB media

Loading...

Aggregate Rating — Pre-Computed

Event Bus Design (Kafka)

Topic: review_rating_system-events
  Partitions: 64 (scale consumers horizontally)
  Partition key: entity_id (user_id / order_id — preserves per-entity ordering)
  Retention: 7 days (compliance) or 24h (high-volume telemetry)
  Replication factor: 3, min.insync.replicas: 2

Producer: idempotent producer enabled (enable.idempotence=true)
Consumer: consumer group "review_rating_system-processors"
  - At-least-once delivery + idempotent handlers (dedup by event_id)
  - DLQ topic: review_rating_system-events-dlq (poison messages after 3 retries)
  - Lag alert: consumer lag > 60s → scale workers

Design a Review and Rating System: async side effects MUST NOT block the synchronous API response.
  Sync path: validate → persist source of truth → publish event → return 201
  Async path: consumers update caches, indexes, notifications, aggregates

Bayesian Average

Product A: 1 review, 5.0 avg. Product B: 10K reviews, 4.5 avg.
Simple average ranks A higher. That's statistically wrong.

Bayesian Average (what Amazon/IMDB actually use):
  bayesian_avg = (C * m + sum_of_ratings) / (C + total_reviews)
  
  C = 25 (minimum reviews to trust)
  m = 3.7 (global average rating)
  
  Product A: (25*3.7 + 5) / (25+1) = 3.75
  Product B: (25*3.7 + 45000) / (25+10000) = 4.498
  
  Product B correctly ranks higher. As review count grows,
  bayesian_avg converges to true average.

Fake Review Detection

Signals for ML fraud detection:

Behavioral:
  - Account age < 7 days + review posted -> suspicious
  - 50 reviews in 1 day -> bot
  - All 5-star or all 1-star -> biased
  - Copy-pasted text across products

Purchase verification:
  - No purchase record -> unverified (lower weight in aggregates)
  - Purchased and returned same day -> suspicious

Text analysis (NLP):
  - Generic text: "Great product! Highly recommended!" (no specifics)
  - Sentiment mismatch: positive text + 1 star
  - Language similarity between reviews from same IP

Network analysis:
  - Multiple reviews from same IP/device for same product
  - Review rings: group of accounts all reviewing same products
  
Timing:
  - Product gets 500 5-star reviews in 1 hour (normally 5/day) -> burst

ML Model (XGBoost):
  Features: all signals above. Output: P(fake) score.
  P(fake) > 0.8: auto-suppress + queue for human review
  P(fake) 0.5-0.8: flag, reduced weight in aggregate
  P(fake) < 0.5: show normally
  
  Suppressed reviews excluded from aggregate rating.
  Nightly recomputation corrects for retroactively suppressed reviews.

PostgreSQL: Reviews

CREATE TABLE reviews (
    review_id       UUID PRIMARY KEY,
    product_id      VARCHAR(50) NOT NULL,
    user_id         UUID NOT NULL,
    order_id        UUID,
    rating          SMALLINT NOT NULL CHECK (rating BETWEEN 1 AND 5),
    title           VARCHAR(200),
    body            TEXT,
    photos          JSONB,
    verified_purchase BOOLEAN DEFAULT FALSE,
    helpful_count   INT DEFAULT 0,
    unhelpful_count INT DEFAULT 0,
    status          ENUM('published','suppressed','pending_review','deleted') DEFAULT 'published',
    fraud_score     DECIMAL(4,3),
    created_at      TIMESTAMPTZ DEFAULT NOW(),
    updated_at      TIMESTAMPTZ,
    UNIQUE (product_id, user_id),  -- one review per user per product
    INDEX idx_product (product_id, status, created_at DESC),
    INDEX idx_product_helpful (product_id, status, helpful_count DESC),
    INDEX idx_user (user_id, created_at DESC)
);

CREATE TABLE product_ratings (
    product_id      VARCHAR(50) PRIMARY KEY,
    avg_rating      DECIMAL(3,2),
    total_reviews   INT DEFAULT 0,
    star_1 INT DEFAULT 0, star_2 INT DEFAULT 0, star_3 INT DEFAULT 0,
    star_4 INT DEFAULT 0, star_5 INT DEFAULT 0,
    updated_at      TIMESTAMPTZ
);

CREATE TABLE review_votes (
    user_id         UUID NOT NULL,
    review_id       UUID NOT NULL,
    helpful         BOOLEAN NOT NULL,
    created_at      TIMESTAMPTZ DEFAULT NOW(),
    PRIMARY KEY (user_id, review_id)
);

CREATE TABLE seller_responses (
    response_id     UUID PRIMARY KEY,
    review_id       UUID NOT NULL UNIQUE,
    seller_id       UUID NOT NULL,
    body            TEXT NOT NULL,
    created_at      TIMESTAMPTZ DEFAULT NOW()
);

Redis

# Aggregate ratings cache
rating:{product_id}            --> Hash { avg, total, s1, s2, s3, s4, s5 }
TTL: 300

# Helpful vote dedup
voted:{user_id}:{review_id}    --> "1"
TTL: 86400

# Review count per user per day (rate limit)
review_rate:{user_id}:{date}   --> INT
TTL: 86400

Elasticsearch: Review Search

{
  "review_id": "rev-uuid", "product_id": "prod-123",
  "rating": 4, "title": "Great quality",
  "body": "The product quality is excellent but shipping was slow...",
  "verified": true, "helpful_count": 234, "created_at": "2026-03-10"
}
// Queries: "battery life" within reviews of product X
// Filter by: rating >= 4, verified only, sort by helpful

Concern	Solution
Review written but aggregate not updated	Kafka at-least-once delivery to aggregation worker; idempotent UPDATE
Duplicate review	UNIQUE(product_id, user_id) constraint prevents duplicates
Rating cache stale	TTL 5 min; invalidated on write; worst case 5-min lag
Fake review flood	Rate limit per user (max 5 reviews/day); async ML detection
Vote spam	One vote per user per review (DB primary key constraint + Redis dedup)
Aggregate drift	Nightly batch: recompute all aggregates from reviews table; overwrite running values

SLOs & Error Budgets

Metric	Target	Rationale
Core user-facing availability	99.95%	Budget for planned maintenance + unplanned failures without user-visible outage.
p99 latency (critical path)	Problem-specific — state target early and tie to capacity math	Interview credibility comes from connecting SLO to architecture choices.
Error rate (5xx)	< 0.1%	Distinguishes transient blips from systemic failure requiring rollback.
Data durability	99.999999999% (11 nines) for committed writes	Define which operations require fsync/quorum vs async replication.

Incident Scenarios (2am reality)

Scenario	How you detect	Mitigation
Primary database unavailable	Health check failures, connection pool exhaustion alerts, elevated 5xx	Failover to replica / promote standby; enable read-only degraded mode if writes impossible; queue writes if async path exists
Traffic spike (10× normal)	RPS anomaly alert, autoscaling lag, latency SLO burn rate	Rate limit non-critical endpoints; scale read path horizontally; pre-warm caches; shed load on expensive operations
Bad deploy causing elevated errors	Canary metric regression, error budget burn, deployment correlation	Automated rollback within 5 minutes; feature flag kill switch; maintain N-1 compatibility

Cost Drivers (Staff lens)

Egress bandwidth and CDN (often dominates media/data-heavy systems)
Database storage + IOPS at scale (plan compaction, TTL, tiering)
Compute for async pipelines (right-size workers, spot instances for batch)
Managed service premiums vs operational headcount trade-off

Multi-Region & DR

Start single-region with cross-AZ redundancy. Add read replicas in secondary region for DR. Move to active-active only when latency SLO or data residency requires it — accept conflict resolution complexity explicitly.

Interview Prompt

Clarifying Questions (ask before designing)

Scope

In scope

Out of scope (state explicitly)

Assumptions

Aggregate Rating — Pre-Computed

Event Bus Design (Kafka)

Bayesian Average

Fake Review Detection

Write Review

Get Reviews for Product

Vote Helpful

Common Error Responses

PostgreSQL: Reviews

Redis

Elasticsearch: Review Search

Interview Walkthrough

Review Ordering: Wilson Score Interval

AI Review Summaries

Phase 1: MVP (0 to 100K users)

Phase 2: Growth (100K to 10M users)

Phase 3: Scale (10M+ users)

SLOs & Error Budgets

Incident Scenarios (2am reality)

Cost Drivers (Staff lens)

Multi-Region & DR