Design a Shared Calendar System (like Google Calendar)

Interview Prompt

Design Shared Calendar System (like Google Calendar).

Clarifying Questions (ask before designing)

Question	Why it matters
What scale should we design for — DAU, QPS, data volume?	Drives every capacity decision; shows structured thinking.
What are the read vs write patterns on the critical path?	Determines caching, DB choice, and replication topology.
What consistency and durability guarantees are required?	Separates strong-consistency paths from eventual ones — a senior differentiator.

Scope

In scope

End-to-end Shared Calendar System flows and API contracts
High-level architecture with major components
Data model and storage choices with rationale
Capacity estimation with shown math
Primary failure modes and mitigations

Out of scope (state explicitly)

Flight and travel package bundling
Property management / housekeeping systems
Revenue-management ML model internals

Assumptions

Clarify scale (DAU, QPS, data volume) for shared calendar system in the first 5 minutes
Standard reliability target 99.9%–99.99% unless problem implies higher (payments, booking)
Managed cloud services (RDS, S3, Kafka, Redis) are acceptable building blocks

Create, update, delete events (title, time, location, description, recurrence)
Recurring events: daily, weekly, monthly, yearly, custom (RFC 5545 RRULE)
Invite attendees: send invitations, track RSVP
Free/busy lookup: check availability of multiple people across calendars
Calendar sharing: view-only or edit access to entire calendars
Reminders and notifications (push, email) before events
Time zone handling: events stored in UTC, displayed in user's timezone
Room/resource booking: find and reserve conference rooms
Calendar sync: CalDAV/iCal standard for external apps
Multiple calendars per user (personal, work, holidays)

Metric	Calculation	Value
Users	Given	1B
Events per user (next 12 months)	Given	500 (including recurring expansions)
Total events	Given	500B (with recurring expansions)
Stored events (without expansion)	Given	50B
Calendar views / sec	Derived from daily volume ÷ 86400 (+ peak factor)	100K
Event creates / sec	Derived from daily volume ÷ 86400 (+ peak factor)	10K
Free/busy queries / sec	Derived from daily volume ÷ 86400 (+ peak factor)	50K

Loading...

Recurring Events: Storage vs Expansion

Wrong approach: Store every occurrence of "daily standup for 2 years" = 730 rows
Right approach: Store ONE row with recurrence rule (RRULE)

RRULE:FREQ=WEEKLY;BYDAY=MO,WE,FR;UNTIL=20261231
? Expands at query time to show individual occurrences

Exception handling:
  - "Edit this occurrence": store exception (overrides that single instance)
  - "Edit this and following": split into two recurrence rules
  - "Delete this occurrence": store exclusion date (EXDATE)

Storage: 1 event row + exceptions
Query: expand RRULE in [view_start, view_end] range at read time
Cache: pre-expand next 30 days in Redis for fast calendar view

Free/Busy Query (Key Algorithm)

Input: "Find when Alice, Bob, and Carol are all free next Tuesday 9am-5pm"

1. Fetch all events for each person on Tuesday (from cache or DB)
2. Build busy intervals per person:
   Alice: [9:00-10:00, 11:00-12:00, 14:00-15:00]
   Bob:   [9:30-10:30, 13:00-14:00]
   Carol: [10:00-11:30]
3. Merge all busy intervals → union: [9:00-12:00, 13:00-15:00]
4. Invert → free slots: [12:00-13:00, 15:00-17:00]
5. Filter by desired meeting duration

Optimization: Pre-compute daily free/busy bitmask (one bit per 15-min slot)
  8 hours × 4 slots/hr = 32 bits per day per person
  AND all bitmasks → free slots in O(1) bitwise operation

Room Booking: PostgreSQL Exclusion Constraint

SQL

-- Atomic room reservation (prevent double-booking)
BEGIN;
SELECT event_id FROM events
WHERE room_id = 'conf-room-A'
  AND date = '2026-03-14'
  AND (start_time < '11:00' AND end_time > '10:00')
FOR UPDATE;
-- If no rows returned, room is free
-- (FOR UPDATE cannot be used with COUNT/aggregates in PostgreSQL)
INSERT INTO events (room_id, start_time, end_time, ...) VALUES (...);
COMMIT;

-- Alternative: Exclusion constraint (database-level guarantee)
-- EXCLUDE USING gist (room_id WITH =, tsrange(start_time, end_time) WITH &&)
-- Prevents overlapping time ranges for same room automatically

Time Zone + DST Deep Dive

Problem: "9 AM every Monday" in New York
  Summer (EDT, UTC-4): 9 AM local = 13:00 UTC
  Winter (EST, UTC-5): 9 AM local = 14:00 UTC

Storage: Store RRULE + original timezone ("America/New_York")
Expansion: At query time, expand RRULE using IANA tz database

NEVER store computed UTC offsets in the RRULE itself
Offsets change when DST rules change (governments update DST dates)
Must re-expand using latest IANA tzdata at render time

CalDAV Sync Protocol

Client A edits ? server stores change ? version incremented. Client B sends GET with sync-token ? server returns changes since that token. Efficient delta sync: only sends diffs. Conflict resolution: Last-write-wins with server timestamp.

Event Bus Design (Kafka)

Topic: shared_calendar_system-events
  Partitions: 64 (scale consumers horizontally)
  Partition key: entity_id (user_id / order_id — preserves per-entity ordering)
  Retention: 7 days (compliance) or 24h (high-volume telemetry)
  Replication factor: 3, min.insync.replicas: 2

Producer: idempotent producer enabled (enable.idempotence=true)
Consumer: consumer group "shared_calendar_system-processors"
  - At-least-once delivery + idempotent handlers (dedup by event_id)
  - DLQ topic: shared_calendar_system-events-dlq (poison messages after 3 retries)
  - Lag alert: consumer lag > 60s → scale workers

Design a Shared Calendar System (like Google Calendar): async side effects MUST NOT block the synchronous API response.
  Sync path: validate → persist source of truth → publish event → return 201
  Async path: consumers update caches, indexes, notifications, aggregates

HTTP

POST   /api/events              ? Create event
GET    /api/events?start=...&end=...&calendar_id=...  → List events in range
PUT    /api/events/{id}          → Update event (this/all/following)
DELETE /api/events/{id}          → Delete event (this/all/following)
POST   /api/events/{id}/rsvp     → Accept/decline/tentative
GET    /api/freebusy             → Query free/busy for list of users
POST   /api/rooms/search         → Find available rooms for time slot
GET    /api/calendars            → List user's calendars
POST   /api/calendars/{id}/share → Share calendar with user/group

Common Error Responses

400 Bad Request: invalid input, missing fields, or malformed JSON
401 Unauthorized: missing or invalid auth token or API key
403 Forbidden: authenticated but insufficient permissions
404 Not Found: resource ID does not exist
409 Conflict: duplicate write or version conflict; retry with idempotency key
422 Unprocessable Entity: valid syntax but invalid business logic
429 Too Many Requests: rate limit exceeded; honor Retry-After header
500 Internal Error: unexpected server fault; retry with idempotency key
503 Service Unavailable: dependency down or overloaded; use exponential backoff

PostgreSQL

SQL

CREATE TABLE events (
    event_id       UUID PRIMARY KEY,
    calendar_id    UUID NOT NULL,
    creator_id     UUID NOT NULL,
    title          TEXT, description TEXT, location TEXT,
    start_time     TIMESTAMPTZ NOT NULL,
    end_time       TIMESTAMPTZ NOT NULL,
    timezone       TEXT DEFAULT 'UTC',
    is_all_day     BOOLEAN DEFAULT FALSE,
    recurrence     TEXT,
    room_id        UUID,
    status         TEXT DEFAULT 'confirmed',
    EXCLUDE USING gist (room_id WITH =, tsrange(start_time, end_time) WITH &&)
        WHERE (room_id IS NOT NULL)
);

CREATE TABLE event_attendees (
    event_id     UUID REFERENCES events(event_id),
    user_id      UUID,
    rsvp_status  TEXT DEFAULT 'needs-action',
    PRIMARY KEY (event_id, user_id)
);

Redis (Free/Busy Cache + RRULE Expansion Cache)

freebusy:{user_id}:{date} ? 32-bit bitmask
calendar:{user_id}:{month} ? expanded events for month

Scalability: Sharding Calendar Data

Shard by calendar_id (˜ user_id for primary calendar):
  events table: shard by calendar_id
  1B users / 256 shards = ~4M users per shard
  Single-user operations: single shard query ?

Cross-shard challenges:
  1. Free/busy query for 10 attendees across 10 shards:
     Solution: Pre-computed free/busy bitmasks in Redis
     Key: freebusy:{user_id}:{date} ? 32-bit bitmask
     No cross-shard DB queries needed ?

  2. Shared calendar: Kafka topic calendar-changes → each shard's consumer
     Eventual consistency: updates visible within 2-5 seconds

  3. Room booking: Separate rooms shard with exclusion constraints

Offline Sync Conflict Resolution

Strategy: Field-level merge with last-write-wins per field

Example:
  Phone edit (offline, T=10:00): Changed title to "Team Standup v2"
  Laptop edit (online, T=10:05): Changed time to 10:30 AM
  
  Phone reconnects at T=10:15:
    - title: phone="Team Standup v2" (T=10:00), server is older
      → Accept phone's title ?
    - time: phone=10:00 AM (unchanged), server=10:30 AM (T=10:05)
      ? Keep server's time ?

Same-field conflict: Last-write-wins by timestamp
Deletion conflict: Option A (delete wins) vs Option B (preserve with changes)
Recurring event conflict: Apply bulk changes first, then single exceptions

RRULE Expansion at Read-Time vs Materialized

Option 1: Expand at read time (this design) ?
  Store: 1 row with RRULE
  ✓ Storage efficient: 1 row instead of 365+ rows per recurring event
  ? "Edit all future" is a single row update
  ✗ CPU cost at read time (~1ms per event)
  Mitigation: Redis cache of expanded events for next 30 days

Option 2: Materialized occurrences
  Store: 365 rows for "daily standup for 1 year"
  ✓ Simple queries (no expansion logic)
  ✗ Storage explosion: 500B+ rows for all users
  ? "Edit all future" requires updating 200+ rows atomically

Recommendation: Expand at read time + aggressive caching

Bitmask Granularity: 15-min vs Per-Minute

15-minute slots: 32 bits/day/user, 1.46 TB for 1B users full year. Per-minute: 15× more storage. Recommendation: 15-minute slots (Google Calendar's approach). Standard meeting durations align perfectly.

Exclusion Constraint vs Application-Level Locking

Exclusion constraint: DB enforces invariant: impossible to bypass. Application-level: more flexible, but bug risk. Recommendation: Use exclusion constraint as safety net + application check. Belt AND suspenders approach.

SLOs & Error Budgets

Metric	Target	Rationale
Core user-facing availability	99.95%	Budget for planned maintenance + unplanned failures without user-visible outage.
p99 latency (critical path)	Problem-specific — state target early and tie to capacity math	Interview credibility comes from connecting SLO to architecture choices.
Error rate (5xx)	< 0.1%	Distinguishes transient blips from systemic failure requiring rollback.
Data durability	99.999999999% (11 nines) for committed writes	Define which operations require fsync/quorum vs async replication.

Incident Scenarios (2am reality)

Scenario	How you detect	Mitigation
Primary database unavailable	Health check failures, connection pool exhaustion alerts, elevated 5xx	Failover to replica / promote standby; enable read-only degraded mode if writes impossible; queue writes if async path exists
Traffic spike (10× normal)	RPS anomaly alert, autoscaling lag, latency SLO burn rate	Rate limit non-critical endpoints; scale read path horizontally; pre-warm caches; shed load on expensive operations
Bad deploy causing elevated errors	Canary metric regression, error budget burn, deployment correlation	Automated rollback within 5 minutes; feature flag kill switch; maintain N-1 compatibility

Cost Drivers (Staff lens)

Egress bandwidth and CDN (often dominates media/data-heavy systems)
Database storage + IOPS at scale (plan compaction, TTL, tiering)
Compute for async pipelines (right-size workers, spot instances for batch)
Managed service premiums vs operational headcount trade-off

Multi-Region & DR

Start single-region with cross-AZ redundancy. Add read replicas in secondary region for DR. Move to active-active only when latency SLO or data residency requires it — accept conflict resolution complexity explicitly.