This problem appears in multiple sheets. Depth expectations increase as you progress:
| Track | What to demonstrate |
|---|---|
| Arch 75 | Staff level: multi-region, cost at scale, migration path, and production metrics. |
Interview Prompt
Design Bike Sharing System like Citi Bike.
Clarifying Questions (ask before designing)
| Question | Why it matters |
|---|---|
| Which of these is highest priority: Station inventory, Rebalancing algorithm, Trip lifecycle? | Forces scope negotiation — senior candidates trim before drawing boxes. |
| What scale should we design for — DAU, QPS, data volume? | Drives every capacity decision; shows structured thinking. |
| What are the read vs write patterns on the critical path? | Determines caching, DB choice, and replication topology. |
| What consistency and durability guarantees are required? | Separates strong-consistency paths from eventual ones — a senior differentiator. |
Scope
In scope
- Station inventory
- Rebalancing algorithm
- Trip lifecycle
- Pricing engine
- IoT lock integration
- Capacity estimation with shown math
Out of scope (state explicitly)
- Full payment processing (#24)
- Turn-by-turn map rendering (#54)
- Driver/rider identity verification and background checks
Assumptions
- Single metro / region unless interviewer asks for multi-city
- Mobile clients with intermittent connectivity — server is source of truth
- Managed geo + messaging infra (Kafka, Redis, RDS) is acceptable
These foundational concepts underpin the patterns used in this problem. Review them before deep-diving into component-level trade-offs.
- Station map: Show nearby docking stations with availability
- Rent a bike: Unlock a bike from a station
- Return a bike: Dock at any station; end the trip
- Pricing: Time-based pricing with free minutes for members
- Membership: Annual/monthly and single-ride passes
- Trip history: View past rides
- Rebalancing alerts: Notify when stations are too full/empty
- Station status: Real-time dock availability
- Reservation: Reserve a bike for 5 minutes
- E-bike support: Battery level tracking, premium pricing
- Effectively-once checkout: Reservation lock + idempotency key: duplicate checkout requests return the same reservation
- Low Latency: Bike unlock in < 3 seconds
- Real-time Station Data: Updates within 10 seconds
- Availability: 99.9%: downtime means stranded riders
- Scalability: 50K+ bikes, 5K+ stations, 500K+ daily trips
- Durability: Trip records for billing must never be lost
- Fault Tolerant: Bikes unlockable during partial outages
| Metric | Calculation | Value |
|---|---|---|
| Total bikes | Given | 50,000 |
| Total stations | Given | 5,000 |
| Docks per station | Given | 10-40 (avg 15) |
| Daily trips | Given | 500K |
| Trips / sec peak | Derived from daily volume ÷ 86400 (+ peak factor) | ~50 (rush hour) |
| Active rides concurrently | Given | ~25,000 |
| Station status updates / sec | Derived from daily volume ÷ 86400 (+ peak factor) | ~100 |
| Bike heartbeat (e-bikes) | Given | ~1,700/sec |
Bike Checkout (Rent) Flow: The Critical Path
User taps "Unlock Bike" at Station 42, Dock 7:
Step 1: Validate user (auth, membership, balance, no active ride)
Step 2: Reserve the bike (prevent race condition)
Redis: SET bike_lock:{bike_id} {user_id} NX EX 30
NX = only one user wins. EX = 30s auto-release.
Step 3: Unlock the bike
Send unlock command via MQTT to dock controller
Wait for ACK (timeout 10s, retry once)
Step 4: Start trip
INSERT INTO trips (...) VALUES (..., 'active')
Step 5: Update station availability
DECR station:{id}:bikes_available
INCR station:{id}:docks_available
Total latency: 2-3 secondsBike Return (Dock) Flow
1. Dock sensor detects bike inserted → MQTT event 2. Trip Service: find active trip → complete it → calculate fare 3. Lock the bike mechanism → release Redis lock 4. Update station availability (INCR bikes, DECR docks) 5. Publish "trip-completed" → Billing Service charges user 6. E-bikes: start battery charging
Station Rebalancing: The Operational Challenge
Commuters ride from residential to business districts. By 9 AM: residential EMPTY, business FULL. Strategies: 1. Truck-based (reactive): VRP solver to redistribute bikes 2. Incentive-based ⭐: "Bike Angels" program — points for returning to empty stations 3. Predictive: ML predicts demand per station per hour → pre-position bikes Rebalancing Service (every 5 min): For each station: delta = predicted_demand - current Sort by abs(delta) → generate truck plan → push to Ops Dashboard
Rent a Bike
POST /api/v1/trips/start
{
"station_id": "stn-42",
"dock_id": "dock-7",
"bike_type": "classic"
}
Response: 200 OK
{
"trip_id": "trip-uuid",
"bike_id": "B-123",
"started_at": "2025-03-14T08:30:00Z",
"pricing_plan": "annual_member",
"free_minutes": 30,
"unlock_status": "success"
}Return a Bike
POST /api/v1/trips/end
{
"trip_id": "trip-uuid",
"station_id": "stn-55",
"dock_id": "dock-12"
}
Response: 200 OK
{
"trip_id": "trip-uuid",
"duration_minutes": 22,
"distance_km": 4.2,
"fare": 0.00,
"fare_breakdown": { "base": 0, "overage": 0, "ebike_premium": 0 }
}Get Nearby Stations
GET /api/v1/stations?lat=37.7749&lng=-122.4194&radius=1000&limit=10Reserve a Bike
POST /api/v1/reservations
{
"station_id": "stn-42",
"bike_type": "classic"
}
Response: 200 OK
{
"reservation_id": "res-uuid",
"station_id": "stn-42",
"bike_id": "B-123",
"expires_at": "2025-03-14T08:35:00Z"
}Common Error Responses
400 Bad Request: invalid input, missing fields, or malformed JSON 401 Unauthorized: missing or invalid auth token or API key 403 Forbidden: authenticated but insufficient permissions 404 Not Found: resource ID does not exist 409 Conflict: duplicate write or version conflict; retry with idempotency key 422 Unprocessable Entity: valid syntax but invalid business logic 429 Too Many Requests: rate limit exceeded; honor Retry-After header 500 Internal Error: unexpected server fault; retry with idempotency key 503 Service Unavailable: dependency down or overloaded; use exponential backoff 440 Login Timeout: WebSocket session expired; reconnect required
PostgreSQL
CREATE TABLE stations (
station_id UUID PRIMARY KEY,
name VARCHAR(255), lat DECIMAL(10,7), lng DECIMAL(10,7),
total_docks SMALLINT, status ENUM('active','maintenance','closed'),
geometry GEOMETRY(Point, 4326)
);
CREATE INDEX idx_location ON stations USING GIST(geometry);
CREATE TABLE bikes (
bike_id UUID PRIMARY KEY,
bike_type ENUM('classic','ebike'),
status ENUM('available','rented','maintenance','retired'),
current_station UUID REFERENCES stations,
battery_level DECIMAL(3,2), total_trips INT DEFAULT 0
);
CREATE TABLE trips (
trip_id UUID PRIMARY KEY,
user_id UUID NOT NULL, bike_id UUID NOT NULL,
start_station_id UUID NOT NULL, end_station_id UUID,
started_at TIMESTAMPTZ NOT NULL, ended_at TIMESTAMPTZ,
duration_minutes DECIMAL(8,2), distance_km DECIMAL(8,2),
fare_cents INT, status ENUM('active','completed','cancelled'),
INDEX idx_user (user_id, started_at DESC),
INDEX idx_bike (bike_id, started_at DESC),
INDEX idx_active (status) WHERE status = 'active'
);
CREATE TABLE users (
user_id UUID PRIMARY KEY, email VARCHAR(255) UNIQUE,
name VARCHAR(100), membership_type ENUM('annual','monthly','day_pass','none'),
membership_expires DATE, balance_cents INT DEFAULT 0
);Redis: Real-time State
station:{id}:bikes_available → INT
station:{id}:docks_available → INT
station:{id}:ebikes_available → INT
bike_lock:{bike_id} → user_id (TTL: 30s)
reservation:{station_id}:{bike_id} → reservation_id (TTL: 300s)
active_ride:{user_id} → trip_id (TTL: 86400)
ebike_battery:{bike_id} → FLOAT (TTL: 120)
stations:geo → Sorted Set (GEOADD for nearby queries)ClickHouse — Trip Analytics
CREATE TABLE trip_analytics (
trip_id UUID, user_id UUID, bike_id UUID,
bike_type Enum8('classic'=0,'ebike'=1),
start_station UUID, end_station UUID,
started_at DateTime, ended_at DateTime,
duration_min Float32, distance_km Float32, fare_cents UInt32,
day_of_week UInt8, hour_of_day UInt8,
trip_date Date MATERIALIZED toDate(started_at)
) ENGINE = MergeTree()
PARTITION BY toYYYYMM(started_at)
ORDER BY (start_station, started_at);| Concern | Solution |
|---|---|
| Race condition on checkout | Redis SET NX (atomic lock) |
| Dock controller offline | Queue command; retry when online; manual override at kiosk |
| Redis failure | Fallback to PostgreSQL; Redis Cluster for HA |
| Payment failure | Start ride anyway (pre-authorized); charge async post-ride |
| Trip not ended (abandoned) | Auto-end after 24h; charge max fare; flag for ops |
| Network outage at station | Local unlock capability (cached member list); sync when online |
Interview Walkthrough
- Start with the physical-digital boundary: unlocking a dock solenoid is as critical as the API — plan for hardware failure modes upfront.
- Walk through rent flow: find nearby stations (PostGIS) → atomic trip start (lock bike, decrement dock count, open trip record) in one PostgreSQL transaction.
- Explain station inventory as the source of truth: dock sensors + e-bike GPS heartbeats, with confidence scores when sensors disagree.
- Cover return flow: validate dock availability, compute fare from duration/plan, release lock — again ACID across trips and station tables.
- Discuss rebalancing as an async ops problem: predict empty/full stations and dispatch trucks without blocking the rent/return hot path.
- Mention MQTT to station controllers for lock commands with local fallback (member card auth) when cellular is down.
- Common pitfall: treating station bike counts as eventually consistent — a race between two users renting the last bike causes double-unlock or angry riders.
Dock Sensor Failure: "Ghost Bikes"
Detection: user reports + GPS check (e-bikes report real location) Mitigation: show "reported" count, under-report, confidence scores Opposite: "Phantom docks" — kiosk override + maintenance alert
Handling Full / Empty Stations
Full station: extra 15 free min + show nearby alternatives Virtual return: mark "returned" at full station (risk: theft) Push notification: "Station 55 is filling up — consider Station 57"
Physical Lock Mechanism
App → API → MQTT → Station Controller → Dock Lock Motor Failure modes: a. Solenoid stuck → "dock unavailable" + alert maintenance b. Cellular outage → fallback: local member card auth c. Power outage → battery backup (4hr) + fail-secure (bikes stay locked) This is why 99.9% (not 99.99%): physical hardware fails more than software
Should You Allow Reservations?
Pros: user knows bike is waiting, reduces frustration Cons: reduces effective supply, no-shows waste 5 min Citi Bike: NO reservations for regular bikes. E-bikes can be reserved (premium). If implementing: max 1 per user, 5-min window, limit to stations with ≥5 bikes
Why PostgreSQL (Not DynamoDB/Cassandra) for Trips?
ACID required: lock + start trip + update station (multi-table atomic) Billing queries: SUM fares by user (JOINs + GROUP BY) Moderate scale: 500K trips/day = ~6 writes/sec PostgreSQL ✓: ACID, PostGIS, rich analytics, 6 writes/sec trivial DynamoDB ✗: 25-item tx limit, no JOINs Cassandra ✗: overkill, no transactions, no JOINs At Citi Bike scale: PostgreSQL is the right choice.
Staff interviews expect you to articulate how the system evolves under real growth — not jump straight to the final architecture.
Phase 1: MVP (0 to 100K users)
Monolith or minimal services proving core bike sharing system flows. Optimize for shipping speed and correctness over scale.
Key components: Single region · Primary DB + Redis cache · Synchronous core path · Basic monitoring
Move to next phase when: p99 latency exceeds SLO or DB CPU sustained above 70%
Phase 2: Growth (100K to 10M users)
Split read/write paths, introduce async processing for non-critical work, add caching layers and horizontal scaling.
Key components: Read replicas or CQRS · Message queue for async work · CDN / edge caching · Service-level SLOs
Move to next phase when: Hot keys, fan-out bottlenecks, or ops toil from manual scaling
Phase 3: Scale (10M+ users)
Shard data plane, multi-region active-active or active-passive, formal DR runbooks, cost optimization.
Key components: Database sharding / partitioning · Multi-region replication · Auto-scaling + chaos testing · Dedicated platform/SRE ownership
Move to next phase when: Regional failure domain risk, compliance data residency, or linear cost growth unsustainable
SLOs & Error Budgets
| Metric | Target | Rationale |
|---|---|---|
| Core user-facing availability | 99.95% | Budget for planned maintenance + unplanned failures without user-visible outage. |
| p99 latency (critical path) | Problem-specific — state target early and tie to capacity math | Interview credibility comes from connecting SLO to architecture choices. |
| Error rate (5xx) | < 0.1% | Distinguishes transient blips from systemic failure requiring rollback. |
| Data durability | 99.999999999% (11 nines) for committed writes | Define which operations require fsync/quorum vs async replication. |
Incident Scenarios (2am reality)
| Scenario | How you detect | Mitigation |
|---|---|---|
| Primary database unavailable | Health check failures, connection pool exhaustion alerts, elevated 5xx | Failover to replica / promote standby; enable read-only degraded mode if writes impossible; queue writes if async path exists |
| Traffic spike (10× normal) | RPS anomaly alert, autoscaling lag, latency SLO burn rate | Rate limit non-critical endpoints; scale read path horizontally; pre-warm caches; shed load on expensive operations |
| Bad deploy causing elevated errors | Canary metric regression, error budget burn, deployment correlation | Automated rollback within 5 minutes; feature flag kill switch; maintain N-1 compatibility |
Cost Drivers (Staff lens)
- Egress bandwidth and CDN (often dominates media/data-heavy systems)
- Database storage + IOPS at scale (plan compaction, TTL, tiering)
- Compute for async pipelines (right-size workers, spot instances for batch)
- Managed service premiums vs operational headcount trade-off
Multi-Region & DR
Start single-region with cross-AZ redundancy. Add read replicas in secondary region for DR. Move to active-active only when latency SLO or data residency requires it — accept conflict resolution complexity explicitly.