Design an Authentication and Authorization System (OAuth 2.0/SSO)

This problem appears in multiple sheets. Depth expectations increase as you progress:

Track	What to demonstrate
Arch 50	Show domain depth beyond the baseline: async pipelines, consistency semantics, and operational SLOs.
Arch 75	Staff angles: partition behavior, cost drivers, and MVP → production evolution with clear triggers.

Interview Prompt

Design Authentication and Authorization System (OAuth 2.0/SSO).

Clarifying Questions (ask before designing)

Question	Why it matters
Which of these is highest priority: OAuth 2.0 flows (authorization code, PKCE), JWT lifecycle, Refresh token rotation?	Forces scope negotiation — senior candidates trim before drawing boxes.
What scale should we design for — DAU, QPS, data volume?	Drives every capacity decision; shows structured thinking.
What are the read vs write patterns on the critical path?	Determines caching, DB choice, and replication topology.
What consistency and durability guarantees are required?	Separates strong-consistency paths from eventual ones — a senior differentiator.

Scope

In scope

OAuth 2.0 flows (authorization code, PKCE)
JWT lifecycle
Refresh token rotation
RBAC vs ABAC
Session management
Token revocation

Out of scope (state explicitly)

Full HR employee directory / SCIM provisioning product
Hardware security module manufacturing
Building a social network on top

Assumptions

Clarify scale (DAU, QPS, data volume) for auth system oauth sso in the first 5 minutes
Standard reliability target 99.9%–99.99% unless problem implies higher (payments, booking)
Managed cloud services (RDS, S3, Kafka, Redis) are acceptable building blocks

User registration/login: Email+password, social login (Google, GitHub, Apple)
OAuth 2.0 provider: Issue access/refresh tokens; support authorization code, PKCE, client credentials flows
Single Sign-On (SSO): Login once, access multiple applications (SAML 2.0 and OIDC)
Multi-Factor Authentication (MFA): TOTP, SMS OTP, WebAuthn/passkeys
Role-Based Access Control (RBAC): Users have roles; roles have permissions
API key management: Issue, rotate, revoke API keys for machine-to-machine auth
Session management: Active session list, revoke sessions, device tracking
Password policies: Minimum strength, breach detection, forced rotation

Metric	Calculation	Value
Total users	Given	1B
Login requests / sec	Derived from daily volume ÷ 86400 (+ peak factor)	50K
Token validation / sec	Derived from daily volume ÷ 86400 (+ peak factor)	500K
Token refresh / sec	Derived from daily volume ÷ 86400 (+ peak factor)	10K
Active sessions	Given	500M

Loading...

SAML 2.0 vs OIDC: When to Use Each

OIDC (OpenID Connect) ⭐ — modern default:
  - JSON/JWT tokens, REST-friendly, mobile-native
  - Built on OAuth 2.0 authorization code + PKCE
  - Use for: SaaS apps, mobile, SPAs, microservices
  - Flow: /authorize → code → /token → id_token + access_token

SAML 2.0 — enterprise legacy:
  - XML assertions, browser POST binding, heavier
  - Use for: enterprise SSO (Okta/ADFS → legacy apps)
  - Flow: SP-initiated → IdP login → SAML assertion POST to /acs

Hybrid architecture:
  - OIDC for customer-facing apps and API access
  - SAML bridge for enterprise customers requiring federation
  - Same user store; map SAML NameID → internal user_id
  - Staff probe: "How do you handle SAML clock skew?" → NotBefore/NotOnOrAfter
    with 5-min tolerance; NTP sync on IdP and SP

JWKS Rotation and Key Compromise

Signing keys (RS256):
  - Publish public keys at /.well-known/jwks.json with kid header
  - Rotate monthly: add new key, sign new tokens with new kid
  - Old keys remain in JWKS for verification-only (30-day overlap)

Gateway caches JWKS for 1 hour. On unknown kid:
  1. Force refresh JWKS (don't wait for TTL)
  2. Retry validation once
  3. If still unknown → 401 (possible key compromise or misconfig)

Key compromise incident:
  1. Revoke compromised kid immediately from JWKS
  2. Bump token_version for all users (invalidates all access tokens)
  3. Force re-login; rotate refresh tokens
  4. Audit log: which tokens signed with compromised kid

OAuth 2.0 Authorization Code Flow with PKCE

1. Client generates code_verifier + code_challenge = SHA256(code_verifier)
2. Client redirects to auth server: GET /authorize?response_type=code&code_challenge=...
3. User authenticates (login + MFA if enabled)
4. Auth server redirects back with authorization code
5. Client exchanges code for tokens at POST /token
6. Auth server validates code + verifier, returns tokens

Why PKCE? Prevents authorization code interception attack.
Without PKCE: intercepted code can be exchanged for tokens.
With PKCE: code exchange requires code_verifier that only original client has.

JWT Token Structure and Validation

Access token (JWT):
  Header: { "alg": "RS256", "kid": "key-2026-03" }
  Payload: {
    "sub": "user-uuid", "iss": "https://auth.example.com",
    "aud": "api.example.com", "exp": 1710403200,
    "scope": "read write", "roles": ["admin"],
    "tenant_id": "tenant-uuid"
  }

Validation at API Gateway (stateless, < 1ms):
  1. Decode JWT -> get kid, look up public key from JWKS
  2. Verify signature, check exp, iss, aud
  3. Extract roles/permissions

Token revocation:
  a. Short expiry (15 min) + refresh tokens
  b. Token blacklist in Redis
  c. Version counter in JWT + DB

RBAC: Role-Based Access Control

Hierarchy: User -> has -> Roles -> have -> Permissions

Example:
  User "Alice" -> roles: ["editor", "viewer"]
  Role "editor" -> permissions: ["article:create", "article:edit"]
  Role "viewer" -> permissions: ["article:read"]

Permission format: resource:action
  Storage: user_roles + role_permissions in PostgreSQL
  Cache: in JWT + Redis

HTTP

POST /api/v1/auth/register
{ "email": "alice@example.com", "password": "SecureP@ss123", "name": "Alice" }
-> 201 { "user_id": "...", "email_verification_sent": true }

POST /api/v1/auth/login
{ "email": "alice@example.com", "password": "SecureP@ss123" }
-> 200 { "access_token": "eyJ...", "refresh_token": "tGz...", "expires_in": 900 }
 OR 200 { "mfa_required": true, "mfa_token": "...", "mfa_methods": ["totp","sms"] }

POST /api/v1/auth/mfa/verify
{ "mfa_token": "...", "code": "123456", "method": "totp" }
-> 200 { "access_token": "eyJ...", "refresh_token": "tGz..." }

POST /api/v1/auth/token/refresh
{ "refresh_token": "tGz..." }
-> 200 { "access_token": "eyJ...(new)", "refresh_token": "tGz...(rotated)" }

POST /api/v1/auth/logout
-> 200 { "logged_out": true }

GET /api/v1/auth/sessions
-> { "sessions": [{ "device": "iPhone", "ip": "...", "current": true }] }

--- Common Errors ---
401 Unauthorized    — invalid/expired token, wrong audience
403 Forbidden       — insufficient scope or suspended account
429 Too Many Requests — login brute-force (>5 attempts/15 min)
503 Service Unavailable — auth DB failover in progress (retry with backoff)

PostgreSQL

SQL

CREATE TABLE users (
    user_id UUID PRIMARY KEY, email VARCHAR(255) UNIQUE NOT NULL,
    password_hash VARCHAR(255), name VARCHAR(100),
    mfa_enabled BOOLEAN DEFAULT FALSE, mfa_secret VARCHAR(64),
    token_version INT DEFAULT 0,
    status ENUM('active','suspended','deleted'), created_at TIMESTAMPTZ
);

CREATE TABLE refresh_tokens (
    token_id UUID PRIMARY KEY, user_id UUID NOT NULL,
    token_hash VARCHAR(64) NOT NULL,
    expires_at TIMESTAMPTZ NOT NULL, revoked BOOLEAN DEFAULT FALSE,
    created_at TIMESTAMPTZ DEFAULT NOW()
);

CREATE TABLE user_roles ( user_id UUID, role_name VARCHAR(50), PRIMARY KEY (user_id, role_name) );
CREATE TABLE role_permissions ( role_name VARCHAR(50), permission VARCHAR(100), PRIMARY KEY (role_name, permission) );

Redis

session:{session_id}         -> Hash { user_id, device, ip, last_active }
revoked_token:{jti}          -> "", TTL = token remaining lifetime
permissions:{user_id}        -> SET of permission strings, TTL 300
login_attempts:{email}       -> INT (max 5 per 15 min), TTL 900

Concern	Solution
Auth service down	JWTs still validated at gateway (stateless); only login/refresh affected
Redis down	Token validation falls back to JWT-only (no revocation check)
Brute force	Rate limit: 5 login attempts per 15 min per email; CAPTCHA after 3 failures
Credential stuffing	Check passwords against HaveIBeenPwned API; flag compromised accounts
Token theft	Short access token TTL (15 min); refresh token rotation; device binding
Key compromise	Key rotation: new signing key monthly; old keys valid for verification only

SLOs & Error Budgets

Metric	Target	Rationale
Core user-facing availability	99.95%	Budget for planned maintenance + unplanned failures without user-visible outage.
p99 latency (critical path)	Problem-specific — state target early and tie to capacity math	Interview credibility comes from connecting SLO to architecture choices.
Error rate (5xx)	< 0.1%	Distinguishes transient blips from systemic failure requiring rollback.
Data durability	99.999999999% (11 nines) for committed writes	Define which operations require fsync/quorum vs async replication.

Incident Scenarios (2am reality)

Scenario	How you detect	Mitigation
Primary database unavailable	Health check failures, connection pool exhaustion alerts, elevated 5xx	Failover to replica / promote standby; enable read-only degraded mode if writes impossible; queue writes if async path exists
Traffic spike (10× normal)	RPS anomaly alert, autoscaling lag, latency SLO burn rate	Rate limit non-critical endpoints; scale read path horizontally; pre-warm caches; shed load on expensive operations
Bad deploy causing elevated errors	Canary metric regression, error budget burn, deployment correlation	Automated rollback within 5 minutes; feature flag kill switch; maintain N-1 compatibility

Cost Drivers (Staff lens)

Egress bandwidth and CDN (often dominates media/data-heavy systems)
Database storage + IOPS at scale (plan compaction, TTL, tiering)
Compute for async pipelines (right-size workers, spot instances for batch)
Managed service premiums vs operational headcount trade-off

Multi-Region & DR

Start single-region with cross-AZ redundancy. Add read replicas in secondary region for DR. Move to active-active only when latency SLO or data residency requires it — accept conflict resolution complexity explicitly.

Interview Prompt

Clarifying Questions (ask before designing)

Scope

In scope

Out of scope (state explicitly)

Assumptions

SAML 2.0 vs OIDC: When to Use Each

JWKS Rotation and Key Compromise

OAuth 2.0 Authorization Code Flow with PKCE

JWT Token Structure and Validation

RBAC: Role-Based Access Control

PostgreSQL

Redis

Interview Walkthrough

Password Storage: Why Argon2id

Refresh Token Rotation

Credential Stuffing Defense

Account Takeover (ATO) Detection

Phase 1: MVP (0 to 100K users)

Phase 2: Growth (100K to 10M users)

Phase 3: Scale (10M+ users)

SLOs & Error Budgets

Incident Scenarios (2am reality)

Cost Drivers (Staff lens)

Multi-Region & DR