Interview Prompt
Design Hotel Booking System.
Clarifying Questions (ask before designing)
| Question | Why it matters |
|---|---|
| Which of these is highest priority: Variant of ticketing, Room availability calendar, Overbooking policy? | Forces scope negotiation — senior candidates trim before drawing boxes. |
| What scale should we design for — DAU, QPS, data volume? | Drives every capacity decision; shows structured thinking. |
| What are the read vs write patterns on the critical path? | Determines caching, DB choice, and replication topology. |
| What consistency and durability guarantees are required? | Separates strong-consistency paths from eventual ones — a senior differentiator. |
Scope
In scope
- Variant of ticketing
- Room availability calendar
- Overbooking policy
- Cancellation
- Price optimization
- Capacity estimation with shown math
Out of scope (state explicitly)
- Flight and travel package bundling
- Property management / housekeeping systems
- Revenue-management ML model internals
Assumptions
- Clarify scale (DAU, QPS, data volume) for hotel booking system in the first 5 minutes
- Standard reliability target 99.9%–99.99% unless problem implies higher (payments, booking)
- Managed cloud services (RDS, S3, Kafka, Redis) are acceptable building blocks
These foundational concepts underpin the patterns used in this problem. Review them before deep-diving into component-level trade-offs.
- Search hotels by location, dates, guests, price range, amenities, star rating
- View room types, photos, reviews, availability calendar
- Book room(s): select dates → reserve → pay → confirm
- Overbooking management: controlled overbooking with walking policy
- Cancellation with policy enforcement (free cancel before X days)
- Price management: dynamic pricing based on demand, season, events
- Loyalty program: points accrual and redemption
- Calendar-based inventory (each room-night is a separate unit)
- Consistency: No double-booking of the same room-night
- High Availability: 99.99%: booking failures = lost revenue
- Low Latency: Search < 200ms, booking < 2s
- Scalability: 1M+ hotels, 100M+ room-nights of inventory
| Metric | Calculation | Value |
|---|---|---|
| Hotels | Given | 1M |
| Rooms (total) | Given | 50M |
| Bookable room-nights (next 365 days) | Given | 18.25B |
| Searches / sec | Derived from daily volume ÷ 86400 (+ peak factor) | 50K |
| Bookings / sec | Derived from daily volume ÷ 86400 (+ peak factor) | 5K |
| Peak (holiday season) | Given | 5× normal |
Search Service + Elasticsearch
Geo queries (bounding box), full-text (hotel name, description), faceted filtering (star rating, amenities, price range). Search flow: ES returns matching hotels → Availability Service validates inventory → Pricing Service computes price → Return sorted results.
Booking Service: Critical Path
PostgreSQL ACID with SELECT FOR UPDATE prevents double-booking. Hold-and-confirm pattern: reserve (PENDING, 15-min TTL) ? charge payment ? confirm (CONFIRMED). Multi-night atomicity: must lock ALL room-night rows in date range.
Pricing Service
Dynamic pricing: Base price × demand_multiplier × season_factor × day_of_week_factor. Demand signals: booking velocity, search volume, competitor rates, inventory remaining percentage. Rate parity: centralized pricing source.
Calendar-Based Inventory Model
Unlike ticketing where each seat is unique: Hotels have POOLED inventory: "5 Deluxe King rooms available on Dec 20" Room-Night Inventory: hotel_id + room_type + date → available_count Hotel ABC, Deluxe King: Dec 20: 5 available (of 10 total) Dec 21: 3 available Dec 22: 0 available ? SOLD OUT Dec 23: 7 available Booking "Dec 20-23" requires ALL 4 nights to have availability → Atomic decrement across all 4 dates in one transaction
Overbooking Strategy
Airlines/Hotels intentionally overbook by 5-10% (data-driven): 10 rooms, sell 11 reservations Historical no-show rate: 15% → expect 1-2 no-shows If all 11 show up → "walk" lowest-priority guest to partner hotel Implementation: available_count can go negative (to overbooking limit) overbooking_limit = total_rooms × overbooking_factor (e.g., 1.1) Decision: ML model predicts no-show probability per booking - Business traveler, booked recently: low no-show risk - Leisure, booked 6 months ago, no prepayment: high no-show risk
GET /api/hotels/search?location=NYC&checkin=2026-12-20&checkout=2026-12-25&guests=2
POST /api/bookings → {hotel_id, room_type, checkin, checkout, guest_info, payment}
GET /api/bookings/{id} → Booking details
DELETE /api/bookings/{id} → Cancel
GET /api/hotels/{id}/availability → start=...&end=...Common Error Responses
400 Bad Request: invalid input, missing fields, or malformed JSON 401 Unauthorized: missing or invalid auth token or API key 403 Forbidden: authenticated but insufficient permissions 404 Not Found: resource ID does not exist 409 Conflict: duplicate write or version conflict; retry with idempotency key 422 Unprocessable Entity: valid syntax but invalid business logic 429 Too Many Requests: rate limit exceeded; honor Retry-After header 500 Internal Error: unexpected server fault; retry with idempotency key 503 Service Unavailable: dependency down or overloaded; use exponential backoff 402 Payment Required: insufficient funds 502 Bad Gateway: payment provider timeout; poll status endpoint
PostgreSQL
CREATE TABLE room_inventory (
hotel_id UUID, room_type TEXT, date DATE,
total_rooms INT, booked_rooms INT, overbooking_limit INT,
price_cents INT,
PRIMARY KEY (hotel_id, room_type, date)
);
CREATE TABLE reservations (
reservation_id UUID PRIMARY KEY,
hotel_id UUID, room_type TEXT,
guest_id UUID, checkin DATE, checkout DATE,
status TEXT, -- PENDING|CONFIRMED|CANCELLED|CHECKED_IN|COMPLETED
total_price_cents INT, payment_id UUID,
cancellation_policy TEXT,
created_at TIMESTAMPTZ
);Booking Transaction: Race Condition Prevention
BEGIN;
-- Lock all room-night rows for the date range
SELECT booked_rooms, total_rooms, overbooking_limit
FROM room_inventory
WHERE hotel_id = $1 AND room_type = $2 AND date BETWEEN $3 AND $4
FOR UPDATE;
-- Check ALL dates have availability
-- If any date has booked_rooms >= overbooking_limit → ROLLBACK
UPDATE room_inventory SET booked_rooms = booked_rooms + 1
WHERE hotel_id = $1 AND room_type = $2 AND date BETWEEN $3 AND $4;
INSERT INTO reservations (...) VALUES (...);
COMMIT;- Payment failure after inventory reserved: Hold for 15 min → auto-release if unpaid
- Double-booking prevention: SELECT FOR UPDATE + transaction isolation
- Search-to-book staleness: Availability shown may be stale by seconds → final check at booking time
- Rate parity: Price must be consistent across OTAs → centralized pricing service
Ticketing: Each seat is unique → seat-level locking Hotel: Pooled inventory → count-based → simpler locking Ticketing: One event, one time → no date-range complexity Hotel: Multi-night stays → must lock ALL dates atomically Ticketing: No overbooking (each seat is physical) Hotel: Controlled overbooking is standard industry practice
Interview Walkthrough
- Pooled inventory model: guests book a room TYPE (Deluxe King × 3 nights), not a specific room — assign the physical room at check-in.
- Controlled overbooking with no-show prediction — airlines do this; cap the ratio and reconcile with walk-in/no-show data daily.
- Atomic booking via
SELECT FOR UPDATEonroom_inventoryrows for the date range — prevents double-booking under concurrent requests. - Search shows eventually consistent availability; final atomic inventory check happens at booking time — search index is not the source of truth.
- Payment authorization hold with 15-minute TTL — auto-release inventory if checkout is abandoned.
- Rate parity across OTAs requires a centralized pricing service — same room must show the same price everywhere.
- Common pitfall: decrementing inventory in the search index at query time — race conditions between search and book cause double-bookings.
Scalability: Sharding Strategy
Shard by hotel_id:
room_inventory: Shard key = hotel_id
Booking transaction: hits ONE shard (single-shard ACID transaction)
1M hotels / 16 shards = ~62,500 hotels per shard
Within each shard: partition room_inventory by date range
PARTITION BY RANGE (date), 1 month per partition
Search (cross-shard):
Step 1: Elasticsearch returns matching hotels (not sharded by hotel)
Step 2: Group hotels by shard → fan-out availability queries
Step 3: Merge results → return to client with prices
Optimization: Redis cache per hotel with availability bitmap
Key: avail:{hotel_id}:{room_type}:{month}
Value: bitmask (1 bit per day, 1=available, 0=booked)Race Condition: Isolation Levels
READ COMMITTED + FOR UPDATE is sufficient since room_inventory rows are pre-created. No INSERTs during booking: only UPDATEs. SERIALIZABLE not needed.
Pooled Inventory vs Named-Room Assignment
Pooled: "5 Deluxe King rooms available": any of the 5 rooms. Simpler booking logic, flexible. Named: Room 401 tracked individually. Industry practice: Pooled inventory for booking, named assignment at check-in. Exception: luxury hotels sell specific rooms.
Eager Payment vs Lazy Payment
Hybrid (industry standard): Authorization hold at booking ? validates card. No-show fee: charge 1 night if guest doesn't cancel. Non-refundable rate: charge immediately (lower price, no cancellation).
Search Staleness vs Real-Time Availability
Problem: Search shows "5 rooms available" → user clicks → booking fails Option 1: Real-time availability on every search result → 50K searches/sec × 50 hotels = 2.5M DB queries/sec → DB dies Option 2: Cached availability with staleness ? Redis bitmap per hotel, refreshed every 30 seconds Final check at booking time (authoritative DB with FOR UPDATE) Acceptable UX: < 1% of booking attempts fail due to staleness Option 3: Pessimistic display (show fewer rooms than available) Cache shows "available" only if real availability > 2
Staff interviews expect you to articulate how the system evolves under real growth — not jump straight to the final architecture.
Phase 1: MVP (0 to 100K users)
Monolith or minimal services proving core hotel booking system flows. Optimize for shipping speed and correctness over scale.
Key components: Single region · Primary DB + Redis cache · Synchronous core path · Basic monitoring
Move to next phase when: p99 latency exceeds SLO or DB CPU sustained above 70%
Phase 2: Growth (100K to 10M users)
Split read/write paths, introduce async processing for non-critical work, add caching layers and horizontal scaling.
Key components: Read replicas or CQRS · Message queue for async work · CDN / edge caching · Service-level SLOs
Move to next phase when: Hot keys, fan-out bottlenecks, or ops toil from manual scaling
Phase 3: Scale (10M+ users)
Shard data plane, multi-region active-active or active-passive, formal DR runbooks, cost optimization.
Key components: Database sharding / partitioning · Multi-region replication · Auto-scaling + chaos testing · Dedicated platform/SRE ownership
Move to next phase when: Regional failure domain risk, compliance data residency, or linear cost growth unsustainable
SLOs & Error Budgets
| Metric | Target | Rationale |
|---|---|---|
| Core user-facing availability | 99.95% | Budget for planned maintenance + unplanned failures without user-visible outage. |
| p99 latency (critical path) | Problem-specific — state target early and tie to capacity math | Interview credibility comes from connecting SLO to architecture choices. |
| Error rate (5xx) | < 0.1% | Distinguishes transient blips from systemic failure requiring rollback. |
| Data durability | 99.999999999% (11 nines) for committed writes | Define which operations require fsync/quorum vs async replication. |
Incident Scenarios (2am reality)
| Scenario | How you detect | Mitigation |
|---|---|---|
| Primary database unavailable | Health check failures, connection pool exhaustion alerts, elevated 5xx | Failover to replica / promote standby; enable read-only degraded mode if writes impossible; queue writes if async path exists |
| Traffic spike (10× normal) | RPS anomaly alert, autoscaling lag, latency SLO burn rate | Rate limit non-critical endpoints; scale read path horizontally; pre-warm caches; shed load on expensive operations |
| Bad deploy causing elevated errors | Canary metric regression, error budget burn, deployment correlation | Automated rollback within 5 minutes; feature flag kill switch; maintain N-1 compatibility |
Cost Drivers (Staff lens)
- Egress bandwidth and CDN (often dominates media/data-heavy systems)
- Database storage + IOPS at scale (plan compaction, TTL, tiering)
- Compute for async pipelines (right-size workers, spot instances for batch)
- Managed service premiums vs operational headcount trade-off
Multi-Region & DR
Start single-region with cross-AZ redundancy. Add read replicas in secondary region for DR. Move to active-active only when latency SLO or data residency requires it — accept conflict resolution complexity explicitly.