Interview Prompt
Design Price Comparison Engine.
Clarifying Questions (ask before designing)
| Question | Why it matters |
|---|---|
| Which of these is highest priority: Web scraping pipeline, Price normalization, Product matching (entity resolution)? | Forces scope negotiation — senior candidates trim before drawing boxes. |
| What scale should we design for — DAU, QPS, data volume? | Drives every capacity decision; shows structured thinking. |
| What are the read vs write patterns on the critical path? | Determines caching, DB choice, and replication topology. |
| What consistency and durability guarantees are required? | Separates strong-consistency paths from eventual ones — a senior differentiator. |
Scope
In scope
- Web scraping pipeline
- Price normalization
- Product matching (entity resolution)
- Capacity estimation with shown math
Out of scope (state explicitly)
- Full catalog/search infrastructure (#12)
- Payment checkout flow (#24)
- Fraud and abuse ML pipelines
Assumptions
- Clarify scale (DAU, QPS, data volume) for price comparison engine in the first 5 minutes
- Standard reliability target 99.9%–99.99% unless problem implies higher (payments, booking)
- Managed cloud services (RDS, S3, Kafka, Redis) are acceptable building blocks
These foundational concepts underpin the patterns used in this problem. Review them before deep-diving into component-level trade-offs.
- Aggregate prices: Collect prices for the same product from multiple retailers/sellers
- Product matching: Identify the same product across different sites (different names/URLs)
- Price tracking: Track price history over time; show price trends and charts
- Price alerts: Notify users when price drops below their target
- Search & browse: Search products; filter by category, brand, price range
- Best deal identification: Show cheapest option with total cost (price + shipping + tax)
- Coupon integration: Show applicable coupons/deals alongside prices
- Retailer ratings: Show retailer reliability and shipping speed
- Freshness: Prices updated at least every 6 hours; popular products every 1 hour
- Scale: 100M+ products, 50+ retailers, 1B+ price data points
- Search Latency: < 200 ms for product search with price comparison
- Accuracy: Prices must reflect actual retailer prices (stale prices erode trust)
- Availability: 99.9%
| Metric | Calculation | Value |
|---|---|---|
| Products tracked | Given | 100M |
| Retailers | Given | 50+ |
| Price data points / day | 100M products × ~5 price updates | 500M (100M products x ~5 price updates avg) |
| Search queries / sec | Derived from daily volume ÷ 86400 (+ peak factor) | 10K |
| Price alert checks / day | Given (assumption documented in value) | 50M |
| Price history storage | Given | ~2 TB/year |
Product Matching
Event Bus Design (Kafka)
Topic: price_comparison_engine-events Partitions: 64 (scale consumers horizontally) Partition key: entity_id (user_id / order_id — preserves per-entity ordering) Retention: 7 days (compliance) or 24h (high-volume telemetry) Replication factor: 3, min.insync.replicas: 2 Producer: idempotent producer enabled (enable.idempotence=true) Consumer: consumer group "price_comparison_engine-processors" - At-least-once delivery + idempotent handlers (dedup by event_id) - DLQ topic: price_comparison_engine-events-dlq (poison messages after 3 retries) - Lag alert: consumer lag > 60s → scale workers Design a Price Comparison Engine: async side effects MUST NOT block the synchronous API response. Sync path: validate → persist source of truth → publish event → return 201 Async path: consumers update caches, indexes, notifications, aggregates
Price Ingestion Pipeline
Three data sources:
1. Affiliate APIs (best quality):
Amazon Product Advertising API, eBay API, etc.
Structured data: price, availability, shipping, images
Rate limited: ~1 request/sec per API key
Coverage: major retailers only
2. Data Feeds (bulk):
Retailers provide CSV/XML feeds daily with all products + prices
Process: download feed -> parse -> match products -> update prices
Coverage: retailers with affiliate programs
3. Web Scraping (fill gaps):
Crawl retailer websites for products without API/feed
Challenges: anti-bot measures, dynamic JS rendering, rate limiting
Legal: must comply with robots.txt and terms of service
Use: headless browser (Playwright) + residential proxy rotation
Pipeline:
Source -> Kafka (raw price events) -> Flink (dedup, validate, match)
-> PostgreSQL (current prices) + ClickHouse (price history)
Validation:
- Price is positive and reasonable (not $0.01 for a laptop)
- Currency is correct
- Product URL is still valid
- Price change < 50% from previous (flag for review if larger)Price Alert System
User sets alert: "Notify me when AirPods Pro drops below $180"
Implementation:
1. Alerts stored in PostgreSQL:
alerts: { alert_id, user_id, product_id, target_price, active }
Index on (product_id, active, target_price)
2. Price update arrives for product X at $175:
SELECT user_id FROM alerts
WHERE product_id = 'X' AND active = true AND target_price >= 175
3. For each matching user: send push/email notification
Mark alert as triggered: UPDATE alerts SET active = false
Optimization: batch check
Flink consumer: accumulate price updates for 1 minute
Batch query: SELECT * FROM alerts WHERE product_id IN (updated_products)
AND target_price >= current_price AND active = true
Send notifications in batch -> reduces DB round tripsGET /api/v1/products/{product_id}/prices
--> { "product": { "name": "AirPods Pro 2", "image": "..." },
"prices": [
{ "retailer": "Amazon", "price": 199.99, "shipping": "Free",
"url": "https://...", "in_stock": true, "updated_at": "..." },
{ "retailer": "BestBuy", "price": 189.99, ... }
],
"price_history": { "30d_low": 179.99, "30d_high": 249.99, "current_vs_avg": "-12%" } }
GET /api/v1/search?q=airpods+pro&sort=price_asc&min_price=100&max_price=300
POST /api/v1/alerts
{ "product_id": "prod-123", "target_price": 180.00 }Common Error Responses
400 Bad Request: invalid input, missing fields, or malformed JSON 401 Unauthorized: missing or invalid auth token or API key 403 Forbidden: authenticated but insufficient permissions 404 Not Found: resource ID does not exist 409 Conflict: duplicate write or version conflict; retry with idempotency key 422 Unprocessable Entity: valid syntax but invalid business logic 429 Too Many Requests: rate limit exceeded; honor Retry-After header 500 Internal Error: unexpected server fault; retry with idempotency key 503 Service Unavailable: dependency down or overloaded; use exponential backoff
PostgreSQL: Products & Current Prices
CREATE TABLE products (
product_id UUID PRIMARY KEY, name TEXT, brand VARCHAR(100),
category VARCHAR(100), upc VARCHAR(20), image_url TEXT,
attributes JSONB, created_at TIMESTAMPTZ
);
CREATE TABLE current_prices (
product_id UUID, retailer_id VARCHAR(50),
price DECIMAL(10,2), currency CHAR(3), shipping_cost DECIMAL(10,2),
in_stock BOOLEAN, url TEXT, updated_at TIMESTAMPTZ,
PRIMARY KEY (product_id, retailer_id)
);
CREATE TABLE price_alerts (
alert_id UUID PRIMARY KEY, user_id UUID, product_id UUID,
target_price DECIMAL(10,2), active BOOLEAN DEFAULT TRUE,
INDEX idx_product_active (product_id, active, target_price)
);ClickHouse: Price History
CREATE TABLE price_history (
product_id UUID, retailer_id String, price Float32,
in_stock UInt8, recorded_at DateTime, recorded_date Date
) ENGINE = MergeTree() PARTITION BY toYYYYMM(recorded_at)
ORDER BY (product_id, retailer_id, recorded_at);| Concern | Solution |
|---|---|
| Stale prices | TTL-based freshness; show 'last updated X hours ago'; flag stale |
| Retailer API down | Serve last known price with staleness indicator; retry with backoff |
| Wrong product match | Human review queue for low-confidence matches; user 'report wrong match' |
| Price scraping blocked | Proxy rotation, rate limiting, fallback to affiliate API |
| Alert notification lost | Kafka at-least-once; dedup by (alert_id, trigger_time) |
Interview Walkthrough
- Frame as a crawl-and-serve system: 100M+ products with tiered refresh rates — hot products hourly, long tail every 24-48 hours.
- Explain product matching across retailers via UPC/GTIN normalization plus fuzzy title matching for products without barcodes.
- Cover total-cost comparison (price + shipping + estimated tax) — sorting by sticker price alone misleads shoppers.
- Walk through scraper fleet architecture: headless Chromium workers, residential proxies, 1 req/sec/domain rate limit.
- Mention ClickHouse for price history — rolling 90-day median detects fake "was $200, now $100" sale manipulation.
- Discuss price alert service: user sets target → background job compares on each crawl → push notification on drop.
- Common pitfall: crawling every product at the same frequency — burning proxy budget on dead listings while hot SKUs go stale.
Total Cost Comparison (Not Just Price)
Price alone is misleading: Amazon: $189.99 + Free shipping + no tax in some states = $189.99 SmallRetailer: $179.99 + $9.99 shipping + $14.40 tax = $204.38 Show total_cost = price + shipping + estimated_tax Sort by total_cost, not just price. This is what smart shoppers actually care about. Tax estimation: use user's zip code + product category + state tax rules.
Web Scraper Architecture
Not all 100M products need the same crawl frequency: Tier 1 (Hot products, top 1%): crawl every 1 hour Products with > 1000 daily views or active price alerts ~1M products x 24 crawls/day = 24M crawls/day Tier 2 (Popular, top 10%): crawl every 6 hours ~10M products x 4 crawls/day = 40M crawls/day Tier 3 (Long tail, 89%): crawl every 24-48 hours ~89M products x 0.5 crawls/day = 44.5M crawls/day Total: ~108M crawls/day = ~1,250 crawls/sec Scraper fleet: 200 scraper workers (Kubernetes pods) Each worker: headless Chromium (Playwright) + residential proxy Rate limit: 1 request/sec/worker per retailer domain
Price Manipulation Detection
Problem: Retailers may inflate prices before a "sale" to show larger discounts. Original price: $100. Retailer raises to $150, then "50% off" = $75. Misleading: actual discount from true price is only 25%. Detection: 1. Track rolling 90-day median price per product per retailer 2. If current "original price" > 120% of 90-day median -> flag as potentially manipulated 3. Show users: "Lowest price in 90 days: $75" and "Average price: $100" 4. This is exactly what CamelCamelCamel does for Amazon prices
Staff interviews expect you to articulate how the system evolves under real growth — not jump straight to the final architecture.
Phase 1: MVP (0 to 100K users)
Monolith or minimal services proving core price comparison engine flows. Optimize for shipping speed and correctness over scale.
Key components: Single region · Primary DB + Redis cache · Synchronous core path · Basic monitoring
Move to next phase when: p99 latency exceeds SLO or DB CPU sustained above 70%
Phase 2: Growth (100K to 10M users)
Split read/write paths, introduce async processing for non-critical work, add caching layers and horizontal scaling.
Key components: Read replicas or CQRS · Message queue for async work · CDN / edge caching · Service-level SLOs
Move to next phase when: Hot keys, fan-out bottlenecks, or ops toil from manual scaling
Phase 3: Scale (10M+ users)
Shard data plane, multi-region active-active or active-passive, formal DR runbooks, cost optimization.
Key components: Database sharding / partitioning · Multi-region replication · Auto-scaling + chaos testing · Dedicated platform/SRE ownership
Move to next phase when: Regional failure domain risk, compliance data residency, or linear cost growth unsustainable
SLOs & Error Budgets
| Metric | Target | Rationale |
|---|---|---|
| Core user-facing availability | 99.95% | Budget for planned maintenance + unplanned failures without user-visible outage. |
| p99 latency (critical path) | Problem-specific — state target early and tie to capacity math | Interview credibility comes from connecting SLO to architecture choices. |
| Error rate (5xx) | < 0.1% | Distinguishes transient blips from systemic failure requiring rollback. |
| Data durability | 99.999999999% (11 nines) for committed writes | Define which operations require fsync/quorum vs async replication. |
Incident Scenarios (2am reality)
| Scenario | How you detect | Mitigation |
|---|---|---|
| Primary database unavailable | Health check failures, connection pool exhaustion alerts, elevated 5xx | Failover to replica / promote standby; enable read-only degraded mode if writes impossible; queue writes if async path exists |
| Traffic spike (10× normal) | RPS anomaly alert, autoscaling lag, latency SLO burn rate | Rate limit non-critical endpoints; scale read path horizontally; pre-warm caches; shed load on expensive operations |
| Bad deploy causing elevated errors | Canary metric regression, error budget burn, deployment correlation | Automated rollback within 5 minutes; feature flag kill switch; maintain N-1 compatibility |
Cost Drivers (Staff lens)
- Egress bandwidth and CDN (often dominates media/data-heavy systems)
- Database storage + IOPS at scale (plan compaction, TTL, tiering)
- Compute for async pipelines (right-size workers, spot instances for batch)
- Managed service premiums vs operational headcount trade-off
Multi-Region & DR
Start single-region with cross-AZ redundancy. Add read replicas in secondary region for DR. Move to active-active only when latency SLO or data residency requires it — accept conflict resolution complexity explicitly.