Interview Prompt
Design Shared Calendar System (like Google Calendar).
Clarifying Questions (ask before designing)
| Question | Why it matters |
|---|---|
| What scale should we design for — DAU, QPS, data volume? | Drives every capacity decision; shows structured thinking. |
| What are the read vs write patterns on the critical path? | Determines caching, DB choice, and replication topology. |
| What consistency and durability guarantees are required? | Separates strong-consistency paths from eventual ones — a senior differentiator. |
Scope
In scope
- End-to-end Shared Calendar System flows and API contracts
- High-level architecture with major components
- Data model and storage choices with rationale
- Capacity estimation with shown math
- Primary failure modes and mitigations
Out of scope (state explicitly)
- Flight and travel package bundling
- Property management / housekeeping systems
- Revenue-management ML model internals
Assumptions
- Clarify scale (DAU, QPS, data volume) for shared calendar system in the first 5 minutes
- Standard reliability target 99.9%–99.99% unless problem implies higher (payments, booking)
- Managed cloud services (RDS, S3, Kafka, Redis) are acceptable building blocks
These foundational concepts underpin the patterns used in this problem. Review them before deep-diving into component-level trade-offs.
- Create, update, delete events (title, time, location, description, recurrence)
- Recurring events: daily, weekly, monthly, yearly, custom (RFC 5545 RRULE)
- Invite attendees: send invitations, track RSVP
- Free/busy lookup: check availability of multiple people across calendars
- Calendar sharing: view-only or edit access to entire calendars
- Reminders and notifications (push, email) before events
- Time zone handling: events stored in UTC, displayed in user's timezone
- Room/resource booking: find and reserve conference rooms
- Calendar sync: CalDAV/iCal standard for external apps
- Multiple calendars per user (personal, work, holidays)
- Consistency: No double-booking of rooms
- Availability: 99.99%: calendar is always-on productivity tool
- Low Latency: Calendar view load < 200ms, free/busy query < 500ms
- Sync: Changes propagated to all devices within 5 seconds
- Scalability: 1B+ users, 10B+ events
| Metric | Calculation | Value |
|---|---|---|
| Users | Given | 1B |
| Events per user (next 12 months) | Given | 500 (including recurring expansions) |
| Total events | Given | 500B (with recurring expansions) |
| Stored events (without expansion) | Given | 50B |
| Calendar views / sec | Derived from daily volume ÷ 86400 (+ peak factor) | 100K |
| Event creates / sec | Derived from daily volume ÷ 86400 (+ peak factor) | 10K |
| Free/busy queries / sec | Derived from daily volume ÷ 86400 (+ peak factor) | 50K |
Recurring Events: Storage vs Expansion
Wrong approach: Store every occurrence of "daily standup for 2 years" = 730 rows Right approach: Store ONE row with recurrence rule (RRULE) RRULE:FREQ=WEEKLY;BYDAY=MO,WE,FR;UNTIL=20261231 ? Expands at query time to show individual occurrences Exception handling: - "Edit this occurrence": store exception (overrides that single instance) - "Edit this and following": split into two recurrence rules - "Delete this occurrence": store exclusion date (EXDATE) Storage: 1 event row + exceptions Query: expand RRULE in [view_start, view_end] range at read time Cache: pre-expand next 30 days in Redis for fast calendar view
Free/Busy Query (Key Algorithm)
Input: "Find when Alice, Bob, and Carol are all free next Tuesday 9am-5pm" 1. Fetch all events for each person on Tuesday (from cache or DB) 2. Build busy intervals per person: Alice: [9:00-10:00, 11:00-12:00, 14:00-15:00] Bob: [9:30-10:30, 13:00-14:00] Carol: [10:00-11:30] 3. Merge all busy intervals → union: [9:00-12:00, 13:00-15:00] 4. Invert → free slots: [12:00-13:00, 15:00-17:00] 5. Filter by desired meeting duration Optimization: Pre-compute daily free/busy bitmask (one bit per 15-min slot) 8 hours × 4 slots/hr = 32 bits per day per person AND all bitmasks → free slots in O(1) bitwise operation
Room Booking: PostgreSQL Exclusion Constraint
-- Atomic room reservation (prevent double-booking)
BEGIN;
SELECT event_id FROM events
WHERE room_id = 'conf-room-A'
AND date = '2026-03-14'
AND (start_time < '11:00' AND end_time > '10:00')
FOR UPDATE;
-- If no rows returned, room is free
-- (FOR UPDATE cannot be used with COUNT/aggregates in PostgreSQL)
INSERT INTO events (room_id, start_time, end_time, ...) VALUES (...);
COMMIT;
-- Alternative: Exclusion constraint (database-level guarantee)
-- EXCLUDE USING gist (room_id WITH =, tsrange(start_time, end_time) WITH &&)
-- Prevents overlapping time ranges for same room automaticallyTime Zone + DST Deep Dive
Problem: "9 AM every Monday" in New York
Summer (EDT, UTC-4): 9 AM local = 13:00 UTC
Winter (EST, UTC-5): 9 AM local = 14:00 UTC
Storage: Store RRULE + original timezone ("America/New_York")
Expansion: At query time, expand RRULE using IANA tz database
NEVER store computed UTC offsets in the RRULE itself
Offsets change when DST rules change (governments update DST dates)
Must re-expand using latest IANA tzdata at render timeCalDAV Sync Protocol
Client A edits ? server stores change ? version incremented. Client B sends GET with sync-token ? server returns changes since that token. Efficient delta sync: only sends diffs. Conflict resolution: Last-write-wins with server timestamp.
Event Bus Design (Kafka)
Topic: shared_calendar_system-events Partitions: 64 (scale consumers horizontally) Partition key: entity_id (user_id / order_id — preserves per-entity ordering) Retention: 7 days (compliance) or 24h (high-volume telemetry) Replication factor: 3, min.insync.replicas: 2 Producer: idempotent producer enabled (enable.idempotence=true) Consumer: consumer group "shared_calendar_system-processors" - At-least-once delivery + idempotent handlers (dedup by event_id) - DLQ topic: shared_calendar_system-events-dlq (poison messages after 3 retries) - Lag alert: consumer lag > 60s → scale workers Design a Shared Calendar System (like Google Calendar): async side effects MUST NOT block the synchronous API response. Sync path: validate → persist source of truth → publish event → return 201 Async path: consumers update caches, indexes, notifications, aggregates
POST /api/events ? Create event
GET /api/events?start=...&end=...&calendar_id=... → List events in range
PUT /api/events/{id} → Update event (this/all/following)
DELETE /api/events/{id} → Delete event (this/all/following)
POST /api/events/{id}/rsvp → Accept/decline/tentative
GET /api/freebusy → Query free/busy for list of users
POST /api/rooms/search → Find available rooms for time slot
GET /api/calendars → List user's calendars
POST /api/calendars/{id}/share → Share calendar with user/groupCommon Error Responses
400 Bad Request: invalid input, missing fields, or malformed JSON 401 Unauthorized: missing or invalid auth token or API key 403 Forbidden: authenticated but insufficient permissions 404 Not Found: resource ID does not exist 409 Conflict: duplicate write or version conflict; retry with idempotency key 422 Unprocessable Entity: valid syntax but invalid business logic 429 Too Many Requests: rate limit exceeded; honor Retry-After header 500 Internal Error: unexpected server fault; retry with idempotency key 503 Service Unavailable: dependency down or overloaded; use exponential backoff
PostgreSQL
CREATE TABLE events (
event_id UUID PRIMARY KEY,
calendar_id UUID NOT NULL,
creator_id UUID NOT NULL,
title TEXT, description TEXT, location TEXT,
start_time TIMESTAMPTZ NOT NULL,
end_time TIMESTAMPTZ NOT NULL,
timezone TEXT DEFAULT 'UTC',
is_all_day BOOLEAN DEFAULT FALSE,
recurrence TEXT,
room_id UUID,
status TEXT DEFAULT 'confirmed',
EXCLUDE USING gist (room_id WITH =, tsrange(start_time, end_time) WITH &&)
WHERE (room_id IS NOT NULL)
);
CREATE TABLE event_attendees (
event_id UUID REFERENCES events(event_id),
user_id UUID,
rsvp_status TEXT DEFAULT 'needs-action',
PRIMARY KEY (event_id, user_id)
);Redis (Free/Busy Cache + RRULE Expansion Cache)
freebusy:{user_id}:{date} ? 32-bit bitmask
calendar:{user_id}:{month} ? expanded events for monthRoom Double-Booking Prevention
PostgreSQL exclusion constraint ? database-level guarantee. Even if two concurrent transactions try to book overlapping times, the constraint prevents both succeeding.
vs Ticketing (#23): Calendar manages TIME SLOTS not seats; recurring events are unique vs Notification (#05): Calendar is the SOURCE of scheduled notifications vs Job Scheduler (#28): Similar recurrence handling (RRULE ˜ cron), but calendar is user-facing Unique challenges: - RRULE expansion with exceptions is complex (RFC 5545 spec is 150+ pages) - Time zone + DST: "9 AM every Monday" means different UTC times - Free/busy across organizations: privacy (show busy, not event details) - Room booking: exclusion constraints prevent overlapping reservations at DB level
Interview Walkthrough
- Frame scheduling conflicts as interval overlap detection — exclusion constraints or application-level locking.
- Explain free/busy aggregation across multiple calendars without exposing event details.
- Cover timezone handling: store UTC, display local — DST transitions break naive datetime math.
- Discuss recurring event expansion (RRULE) as lazy expansion vs precomputed instances.
- Mention invite flow with RSVP state machine and notification via async pipeline.
- Common pitfall: comparing local times across timezones without UTC normalization — meetings appear at wrong hours.
Scalability: Sharding Calendar Data
Shard by calendar_id (˜ user_id for primary calendar):
events table: shard by calendar_id
1B users / 256 shards = ~4M users per shard
Single-user operations: single shard query ?
Cross-shard challenges:
1. Free/busy query for 10 attendees across 10 shards:
Solution: Pre-computed free/busy bitmasks in Redis
Key: freebusy:{user_id}:{date} ? 32-bit bitmask
No cross-shard DB queries needed ?
2. Shared calendar: Kafka topic calendar-changes → each shard's consumer
Eventual consistency: updates visible within 2-5 seconds
3. Room booking: Separate rooms shard with exclusion constraintsOffline Sync Conflict Resolution
Strategy: Field-level merge with last-write-wins per field
Example:
Phone edit (offline, T=10:00): Changed title to "Team Standup v2"
Laptop edit (online, T=10:05): Changed time to 10:30 AM
Phone reconnects at T=10:15:
- title: phone="Team Standup v2" (T=10:00), server is older
→ Accept phone's title ?
- time: phone=10:00 AM (unchanged), server=10:30 AM (T=10:05)
? Keep server's time ?
Same-field conflict: Last-write-wins by timestamp
Deletion conflict: Option A (delete wins) vs Option B (preserve with changes)
Recurring event conflict: Apply bulk changes first, then single exceptionsRRULE Expansion at Read-Time vs Materialized
Option 1: Expand at read time (this design) ? Store: 1 row with RRULE ✓ Storage efficient: 1 row instead of 365+ rows per recurring event ? "Edit all future" is a single row update ✗ CPU cost at read time (~1ms per event) Mitigation: Redis cache of expanded events for next 30 days Option 2: Materialized occurrences Store: 365 rows for "daily standup for 1 year" ✓ Simple queries (no expansion logic) ✗ Storage explosion: 500B+ rows for all users ? "Edit all future" requires updating 200+ rows atomically Recommendation: Expand at read time + aggressive caching
Bitmask Granularity: 15-min vs Per-Minute
15-minute slots: 32 bits/day/user, 1.46 TB for 1B users full year. Per-minute: 15× more storage. Recommendation: 15-minute slots (Google Calendar's approach). Standard meeting durations align perfectly.
Exclusion Constraint vs Application-Level Locking
Exclusion constraint: DB enforces invariant: impossible to bypass. Application-level: more flexible, but bug risk. Recommendation: Use exclusion constraint as safety net + application check. Belt AND suspenders approach.
Staff interviews expect you to articulate how the system evolves under real growth — not jump straight to the final architecture.
Phase 1: MVP (0 to 100K users)
Monolith or minimal services proving core shared calendar system flows. Optimize for shipping speed and correctness over scale.
Key components: Single region · Primary DB + Redis cache · Synchronous core path · Basic monitoring
Move to next phase when: p99 latency exceeds SLO or DB CPU sustained above 70%
Phase 2: Growth (100K to 10M users)
Split read/write paths, introduce async processing for non-critical work, add caching layers and horizontal scaling.
Key components: Read replicas or CQRS · Message queue for async work · CDN / edge caching · Service-level SLOs
Move to next phase when: Hot keys, fan-out bottlenecks, or ops toil from manual scaling
Phase 3: Scale (10M+ users)
Shard data plane, multi-region active-active or active-passive, formal DR runbooks, cost optimization.
Key components: Database sharding / partitioning · Multi-region replication · Auto-scaling + chaos testing · Dedicated platform/SRE ownership
Move to next phase when: Regional failure domain risk, compliance data residency, or linear cost growth unsustainable
SLOs & Error Budgets
| Metric | Target | Rationale |
|---|---|---|
| Core user-facing availability | 99.95% | Budget for planned maintenance + unplanned failures without user-visible outage. |
| p99 latency (critical path) | Problem-specific — state target early and tie to capacity math | Interview credibility comes from connecting SLO to architecture choices. |
| Error rate (5xx) | < 0.1% | Distinguishes transient blips from systemic failure requiring rollback. |
| Data durability | 99.999999999% (11 nines) for committed writes | Define which operations require fsync/quorum vs async replication. |
Incident Scenarios (2am reality)
| Scenario | How you detect | Mitigation |
|---|---|---|
| Primary database unavailable | Health check failures, connection pool exhaustion alerts, elevated 5xx | Failover to replica / promote standby; enable read-only degraded mode if writes impossible; queue writes if async path exists |
| Traffic spike (10× normal) | RPS anomaly alert, autoscaling lag, latency SLO burn rate | Rate limit non-critical endpoints; scale read path horizontally; pre-warm caches; shed load on expensive operations |
| Bad deploy causing elevated errors | Canary metric regression, error budget burn, deployment correlation | Automated rollback within 5 minutes; feature flag kill switch; maintain N-1 compatibility |
Cost Drivers (Staff lens)
- Egress bandwidth and CDN (often dominates media/data-heavy systems)
- Database storage + IOPS at scale (plan compaction, TTL, tiering)
- Compute for async pipelines (right-size workers, spot instances for batch)
- Managed service premiums vs operational headcount trade-off
Multi-Region & DR
Start single-region with cross-AZ redundancy. Add read replicas in secondary region for DR. Move to active-active only when latency SLO or data residency requires it — accept conflict resolution complexity explicitly.