Design a Price Comparison Engine

Interview Prompt

Design Price Comparison Engine.

Clarifying Questions (ask before designing)

Question	Why it matters
Which of these is highest priority: Web scraping pipeline, Price normalization, Product matching (entity resolution)?	Forces scope negotiation — senior candidates trim before drawing boxes.
What scale should we design for — DAU, QPS, data volume?	Drives every capacity decision; shows structured thinking.
What are the read vs write patterns on the critical path?	Determines caching, DB choice, and replication topology.
What consistency and durability guarantees are required?	Separates strong-consistency paths from eventual ones — a senior differentiator.

Scope

In scope

Web scraping pipeline
Price normalization
Product matching (entity resolution)
Capacity estimation with shown math

Out of scope (state explicitly)

Full catalog/search infrastructure (#12)
Payment checkout flow (#24)
Fraud and abuse ML pipelines

Assumptions

Clarify scale (DAU, QPS, data volume) for price comparison engine in the first 5 minutes
Standard reliability target 99.9%–99.99% unless problem implies higher (payments, booking)
Managed cloud services (RDS, S3, Kafka, Redis) are acceptable building blocks

Aggregate prices: Collect prices for the same product from multiple retailers/sellers
Product matching: Identify the same product across different sites (different names/URLs)
Price tracking: Track price history over time; show price trends and charts
Price alerts: Notify users when price drops below their target
Search & browse: Search products; filter by category, brand, price range
Best deal identification: Show cheapest option with total cost (price + shipping + tax)
Coupon integration: Show applicable coupons/deals alongside prices
Retailer ratings: Show retailer reliability and shipping speed

Metric	Calculation	Value
Products tracked	Given	100M
Retailers	Given	50+
Price data points / day	100M products × ~5 price updates	500M (100M products x ~5 price updates avg)
Search queries / sec	Derived from daily volume ÷ 86400 (+ peak factor)	10K
Price alert checks / day	Given (assumption documented in value)	50M
Price history storage	Given	~2 TB/year

Loading...

Product Matching

Event Bus Design (Kafka)

Topic: price_comparison_engine-events
  Partitions: 64 (scale consumers horizontally)
  Partition key: entity_id (user_id / order_id — preserves per-entity ordering)
  Retention: 7 days (compliance) or 24h (high-volume telemetry)
  Replication factor: 3, min.insync.replicas: 2

Producer: idempotent producer enabled (enable.idempotence=true)
Consumer: consumer group "price_comparison_engine-processors"
  - At-least-once delivery + idempotent handlers (dedup by event_id)
  - DLQ topic: price_comparison_engine-events-dlq (poison messages after 3 retries)
  - Lag alert: consumer lag > 60s → scale workers

Design a Price Comparison Engine: async side effects MUST NOT block the synchronous API response.
  Sync path: validate → persist source of truth → publish event → return 201
  Async path: consumers update caches, indexes, notifications, aggregates

Price Ingestion Pipeline

Three data sources:

1. Affiliate APIs (best quality):
   Amazon Product Advertising API, eBay API, etc.
   Structured data: price, availability, shipping, images
   Rate limited: ~1 request/sec per API key
   Coverage: major retailers only

2. Data Feeds (bulk):
   Retailers provide CSV/XML feeds daily with all products + prices
   Process: download feed -> parse -> match products -> update prices
   Coverage: retailers with affiliate programs

3. Web Scraping (fill gaps):
   Crawl retailer websites for products without API/feed
   Challenges: anti-bot measures, dynamic JS rendering, rate limiting
   Legal: must comply with robots.txt and terms of service
   Use: headless browser (Playwright) + residential proxy rotation
    
Pipeline:
  Source -> Kafka (raw price events) -> Flink (dedup, validate, match)
    -> PostgreSQL (current prices) + ClickHouse (price history)
  
  Validation:
  - Price is positive and reasonable (not $0.01 for a laptop)
  - Currency is correct
  - Product URL is still valid
  - Price change < 50% from previous (flag for review if larger)

Price Alert System

User sets alert: "Notify me when AirPods Pro drops below $180"

Implementation:
  1. Alerts stored in PostgreSQL:
     alerts: { alert_id, user_id, product_id, target_price, active }
     Index on (product_id, active, target_price)

  2. Price update arrives for product X at $175:
     SELECT user_id FROM alerts 
     WHERE product_id = 'X' AND active = true AND target_price >= 175

  3. For each matching user: send push/email notification
     Mark alert as triggered: UPDATE alerts SET active = false

  Optimization: batch check
    Flink consumer: accumulate price updates for 1 minute
    Batch query: SELECT * FROM alerts WHERE product_id IN (updated_products)
                 AND target_price >= current_price AND active = true
    Send notifications in batch -> reduces DB round trips

Concern	Solution
Stale prices	TTL-based freshness; show 'last updated X hours ago'; flag stale
Retailer API down	Serve last known price with staleness indicator; retry with backoff
Wrong product match	Human review queue for low-confidence matches; user 'report wrong match'
Price scraping blocked	Proxy rotation, rate limiting, fallback to affiliate API
Alert notification lost	Kafka at-least-once; dedup by (alert_id, trigger_time)

SLOs & Error Budgets

Metric	Target	Rationale
Core user-facing availability	99.95%	Budget for planned maintenance + unplanned failures without user-visible outage.
p99 latency (critical path)	Problem-specific — state target early and tie to capacity math	Interview credibility comes from connecting SLO to architecture choices.
Error rate (5xx)	< 0.1%	Distinguishes transient blips from systemic failure requiring rollback.
Data durability	99.999999999% (11 nines) for committed writes	Define which operations require fsync/quorum vs async replication.

Incident Scenarios (2am reality)

Scenario	How you detect	Mitigation
Primary database unavailable	Health check failures, connection pool exhaustion alerts, elevated 5xx	Failover to replica / promote standby; enable read-only degraded mode if writes impossible; queue writes if async path exists
Traffic spike (10× normal)	RPS anomaly alert, autoscaling lag, latency SLO burn rate	Rate limit non-critical endpoints; scale read path horizontally; pre-warm caches; shed load on expensive operations
Bad deploy causing elevated errors	Canary metric regression, error budget burn, deployment correlation	Automated rollback within 5 minutes; feature flag kill switch; maintain N-1 compatibility

Cost Drivers (Staff lens)

Egress bandwidth and CDN (often dominates media/data-heavy systems)
Database storage + IOPS at scale (plan compaction, TTL, tiering)
Compute for async pipelines (right-size workers, spot instances for batch)
Managed service premiums vs operational headcount trade-off

Multi-Region & DR

Start single-region with cross-AZ redundancy. Add read replicas in secondary region for DR. Move to active-active only when latency SLO or data residency requires it — accept conflict resolution complexity explicitly.

Interview Prompt

Clarifying Questions (ask before designing)

Scope

In scope

Out of scope (state explicitly)

Assumptions

Product Matching

Event Bus Design (Kafka)

Price Ingestion Pipeline

Price Alert System

Common Error Responses

PostgreSQL: Products & Current Prices

ClickHouse: Price History

Interview Walkthrough

Total Cost Comparison (Not Just Price)

Web Scraper Architecture

Price Manipulation Detection

Phase 1: MVP (0 to 100K users)

Phase 2: Growth (100K to 10M users)

Phase 3: Scale (10M+ users)

SLOs & Error Budgets

Incident Scenarios (2am reality)

Cost Drivers (Staff lens)

Multi-Region & DR