This problem appears in multiple sheets. Depth expectations increase as you progress:
Interview Prompt
Design Authentication and Authorization System (OAuth 2.0/SSO).
Clarifying Questions (ask before designing)
| Question | Why it matters |
|---|---|
| Which of these is highest priority: OAuth 2.0 flows (authorization code, PKCE), JWT lifecycle, Refresh token rotation? | Forces scope negotiation — senior candidates trim before drawing boxes. |
| What scale should we design for — DAU, QPS, data volume? | Drives every capacity decision; shows structured thinking. |
| What are the read vs write patterns on the critical path? | Determines caching, DB choice, and replication topology. |
| What consistency and durability guarantees are required? | Separates strong-consistency paths from eventual ones — a senior differentiator. |
Scope
In scope
- OAuth 2.0 flows (authorization code, PKCE)
- JWT lifecycle
- Refresh token rotation
- RBAC vs ABAC
- Session management
- Token revocation
Out of scope (state explicitly)
- Full HR employee directory / SCIM provisioning product
- Hardware security module manufacturing
- Building a social network on top
Assumptions
- Clarify scale (DAU, QPS, data volume) for auth system oauth sso in the first 5 minutes
- Standard reliability target 99.9%–99.99% unless problem implies higher (payments, booking)
- Managed cloud services (RDS, S3, Kafka, Redis) are acceptable building blocks
These foundational concepts underpin the patterns used in this problem. Review them before deep-diving into component-level trade-offs.
- User registration/login: Email+password, social login (Google, GitHub, Apple)
- OAuth 2.0 provider: Issue access/refresh tokens; support authorization code, PKCE, client credentials flows
- Single Sign-On (SSO): Login once, access multiple applications (SAML 2.0 and OIDC)
- Multi-Factor Authentication (MFA): TOTP, SMS OTP, WebAuthn/passkeys
- Role-Based Access Control (RBAC): Users have roles; roles have permissions
- API key management: Issue, rotate, revoke API keys for machine-to-machine auth
- Session management: Active session list, revoke sessions, device tracking
- Password policies: Minimum strength, breach detection, forced rotation
- Low Latency: Token validation < 5 ms (stateless JWT); login < 500 ms
- High Availability: 99.999%: auth down = entire platform down
- Security: Bcrypt/Argon2 password hashing; token encryption; rate limiting on login
- Scale: 1B+ users, 500K+ auth requests/sec
- Compliance: SOC 2, GDPR (right to delete), PCI DSS for payment-adjacent
| Metric | Calculation | Value |
|---|---|---|
| Total users | Given | 1B |
| Login requests / sec | Derived from daily volume ÷ 86400 (+ peak factor) | 50K |
| Token validation / sec | Derived from daily volume ÷ 86400 (+ peak factor) | 500K |
| Token refresh / sec | Derived from daily volume ÷ 86400 (+ peak factor) | 10K |
| Active sessions | Given | 500M |
SAML 2.0 vs OIDC: When to Use Each
OIDC (OpenID Connect) ⭐ — modern default:
- JSON/JWT tokens, REST-friendly, mobile-native
- Built on OAuth 2.0 authorization code + PKCE
- Use for: SaaS apps, mobile, SPAs, microservices
- Flow: /authorize → code → /token → id_token + access_token
SAML 2.0 — enterprise legacy:
- XML assertions, browser POST binding, heavier
- Use for: enterprise SSO (Okta/ADFS → legacy apps)
- Flow: SP-initiated → IdP login → SAML assertion POST to /acs
Hybrid architecture:
- OIDC for customer-facing apps and API access
- SAML bridge for enterprise customers requiring federation
- Same user store; map SAML NameID → internal user_id
- Staff probe: "How do you handle SAML clock skew?" → NotBefore/NotOnOrAfter
with 5-min tolerance; NTP sync on IdP and SPJWKS Rotation and Key Compromise
Signing keys (RS256): - Publish public keys at /.well-known/jwks.json with kid header - Rotate monthly: add new key, sign new tokens with new kid - Old keys remain in JWKS for verification-only (30-day overlap) Gateway caches JWKS for 1 hour. On unknown kid: 1. Force refresh JWKS (don't wait for TTL) 2. Retry validation once 3. If still unknown → 401 (possible key compromise or misconfig) Key compromise incident: 1. Revoke compromised kid immediately from JWKS 2. Bump token_version for all users (invalidates all access tokens) 3. Force re-login; rotate refresh tokens 4. Audit log: which tokens signed with compromised kid
OAuth 2.0 Authorization Code Flow with PKCE
1. Client generates code_verifier + code_challenge = SHA256(code_verifier) 2. Client redirects to auth server: GET /authorize?response_type=code&code_challenge=... 3. User authenticates (login + MFA if enabled) 4. Auth server redirects back with authorization code 5. Client exchanges code for tokens at POST /token 6. Auth server validates code + verifier, returns tokens Why PKCE? Prevents authorization code interception attack. Without PKCE: intercepted code can be exchanged for tokens. With PKCE: code exchange requires code_verifier that only original client has.
JWT Token Structure and Validation
Access token (JWT):
Header: { "alg": "RS256", "kid": "key-2026-03" }
Payload: {
"sub": "user-uuid", "iss": "https://auth.example.com",
"aud": "api.example.com", "exp": 1710403200,
"scope": "read write", "roles": ["admin"],
"tenant_id": "tenant-uuid"
}
Validation at API Gateway (stateless, < 1ms):
1. Decode JWT -> get kid, look up public key from JWKS
2. Verify signature, check exp, iss, aud
3. Extract roles/permissions
Token revocation:
a. Short expiry (15 min) + refresh tokens
b. Token blacklist in Redis
c. Version counter in JWT + DBRBAC: Role-Based Access Control
Hierarchy: User -> has -> Roles -> have -> Permissions Example: User "Alice" -> roles: ["editor", "viewer"] Role "editor" -> permissions: ["article:create", "article:edit"] Role "viewer" -> permissions: ["article:read"] Permission format: resource:action Storage: user_roles + role_permissions in PostgreSQL Cache: in JWT + Redis
POST /api/v1/auth/register
{ "email": "alice@example.com", "password": "SecureP@ss123", "name": "Alice" }
-> 201 { "user_id": "...", "email_verification_sent": true }
POST /api/v1/auth/login
{ "email": "alice@example.com", "password": "SecureP@ss123" }
-> 200 { "access_token": "eyJ...", "refresh_token": "tGz...", "expires_in": 900 }
OR 200 { "mfa_required": true, "mfa_token": "...", "mfa_methods": ["totp","sms"] }
POST /api/v1/auth/mfa/verify
{ "mfa_token": "...", "code": "123456", "method": "totp" }
-> 200 { "access_token": "eyJ...", "refresh_token": "tGz..." }
POST /api/v1/auth/token/refresh
{ "refresh_token": "tGz..." }
-> 200 { "access_token": "eyJ...(new)", "refresh_token": "tGz...(rotated)" }
POST /api/v1/auth/logout
-> 200 { "logged_out": true }
GET /api/v1/auth/sessions
-> { "sessions": [{ "device": "iPhone", "ip": "...", "current": true }] }
--- Common Errors ---
401 Unauthorized — invalid/expired token, wrong audience
403 Forbidden — insufficient scope or suspended account
429 Too Many Requests — login brute-force (>5 attempts/15 min)
503 Service Unavailable — auth DB failover in progress (retry with backoff)PostgreSQL
CREATE TABLE users (
user_id UUID PRIMARY KEY, email VARCHAR(255) UNIQUE NOT NULL,
password_hash VARCHAR(255), name VARCHAR(100),
mfa_enabled BOOLEAN DEFAULT FALSE, mfa_secret VARCHAR(64),
token_version INT DEFAULT 0,
status ENUM('active','suspended','deleted'), created_at TIMESTAMPTZ
);
CREATE TABLE refresh_tokens (
token_id UUID PRIMARY KEY, user_id UUID NOT NULL,
token_hash VARCHAR(64) NOT NULL,
expires_at TIMESTAMPTZ NOT NULL, revoked BOOLEAN DEFAULT FALSE,
created_at TIMESTAMPTZ DEFAULT NOW()
);
CREATE TABLE user_roles ( user_id UUID, role_name VARCHAR(50), PRIMARY KEY (user_id, role_name) );
CREATE TABLE role_permissions ( role_name VARCHAR(50), permission VARCHAR(100), PRIMARY KEY (role_name, permission) );Redis
session:{session_id} -> Hash { user_id, device, ip, last_active }
revoked_token:{jti} -> "", TTL = token remaining lifetime
permissions:{user_id} -> SET of permission strings, TTL 300
login_attempts:{email} -> INT (max 5 per 15 min), TTL 900| Concern | Solution |
|---|---|
| Auth service down | JWTs still validated at gateway (stateless); only login/refresh affected |
| Redis down | Token validation falls back to JWT-only (no revocation check) |
| Brute force | Rate limit: 5 login attempts per 15 min per email; CAPTCHA after 3 failures |
| Credential stuffing | Check passwords against HaveIBeenPwned API; flag compromised accounts |
| Token theft | Short access token TTL (15 min); refresh token rotation; device binding |
| Key compromise | Key rotation: new signing key monthly; old keys valid for verification only |
Interview Walkthrough
- Start with OAuth 2.0 authorization code flow with PKCE for SPAs — never implicit flow, never store tokens in localStorage without rotation.
- Walk through SSO: central IdP issues JWT access tokens (15 min) + opaque refresh tokens (7 days) with rotation on every refresh.
- Explain refresh token rotation detecting reuse — an invalidated refresh token used again triggers revocation of all sessions (compromise signal).
- Cover RBAC with permissions cached in Redis (5-min TTL) and invalidated on role change via pub/sub.
- Mention credential stuffing defenses: rate limits, CAPTCHA after 3 failures, breached-password checks, step-up auth for sensitive actions.
- Discuss token revocation via Redis blocklist keyed by JWT
jtiwith TTL matching remaining token lifetime. - Common pitfall: long-lived JWTs without refresh rotation — a stolen access token grants hours of unauthorized access with no revocation path.
Password Storage: Why Argon2id
bcrypt: good, but vulnerable to GPU attacks (fixed memory usage). scrypt: better (memory-hard), but complex to tune. Argon2id: winner of Password Hashing Competition. Memory-hard + CPU-hard. Configurable: time, memory, parallelism. Recommended: Argon2id with 64MB memory, 3 iterations, 4 threads. Each hash takes ~200ms and 64MB RAM -> GPU attacks infeasible.
Refresh Token Rotation
On every token refresh: 1. Validate current refresh_token 2. Issue new access_token + NEW refresh_token 3. Invalidate old refresh_token If stolen old refresh_token is used: It's already invalidated -> request fails Detection: invalidated token reuse -> compromise detected -> revoke ALL tokens
Credential Stuffing Defense
Defense layers: 1. Rate limiting: max 5 failed attempts per email per 15 min 2. IP rate limiting: max 50 attempts per IP per hour 3. CAPTCHA after 3 failed attempts 4. Device fingerprinting: new device + correct password -> email verification 5. Breached password check against HaveIBeenPwned API 6. Anomaly detection: new country, new device, unusual hour
Account Takeover (ATO) Detection
Signals of compromise: - Password changed + email changed within 5 minutes - Login from new country + immediate sensitive action - Session from unrecognized device + bulk data export Action: 1. Send notification to all active sessions 2. Require re-authentication for sensitive actions 3. If password AND email changed: lock account + send recovery to ORIGINAL email Step-up authentication: Sensitive actions require re-entering password even if session is valid.
Staff interviews expect you to articulate how the system evolves under real growth — not jump straight to the final architecture.
Phase 1: MVP (0 to 100K users)
Monolith or minimal services proving core auth system oauth sso flows. Optimize for shipping speed and correctness over scale.
Key components: Single region · Primary DB + Redis cache · Synchronous core path · Basic monitoring
Move to next phase when: p99 latency exceeds SLO or DB CPU sustained above 70%
Phase 2: Growth (100K to 10M users)
Split read/write paths, introduce async processing for non-critical work, add caching layers and horizontal scaling.
Key components: Read replicas or CQRS · Message queue for async work · CDN / edge caching · Service-level SLOs
Move to next phase when: Hot keys, fan-out bottlenecks, or ops toil from manual scaling
Phase 3: Scale (10M+ users)
Shard data plane, multi-region active-active or active-passive, formal DR runbooks, cost optimization.
Key components: Database sharding / partitioning · Multi-region replication · Auto-scaling + chaos testing · Dedicated platform/SRE ownership
Move to next phase when: Regional failure domain risk, compliance data residency, or linear cost growth unsustainable
SLOs & Error Budgets
| Metric | Target | Rationale |
|---|---|---|
| Core user-facing availability | 99.95% | Budget for planned maintenance + unplanned failures without user-visible outage. |
| p99 latency (critical path) | Problem-specific — state target early and tie to capacity math | Interview credibility comes from connecting SLO to architecture choices. |
| Error rate (5xx) | < 0.1% | Distinguishes transient blips from systemic failure requiring rollback. |
| Data durability | 99.999999999% (11 nines) for committed writes | Define which operations require fsync/quorum vs async replication. |
Incident Scenarios (2am reality)
| Scenario | How you detect | Mitigation |
|---|---|---|
| Primary database unavailable | Health check failures, connection pool exhaustion alerts, elevated 5xx | Failover to replica / promote standby; enable read-only degraded mode if writes impossible; queue writes if async path exists |
| Traffic spike (10× normal) | RPS anomaly alert, autoscaling lag, latency SLO burn rate | Rate limit non-critical endpoints; scale read path horizontally; pre-warm caches; shed load on expensive operations |
| Bad deploy causing elevated errors | Canary metric regression, error budget burn, deployment correlation | Automated rollback within 5 minutes; feature flag kill switch; maintain N-1 compatibility |
Cost Drivers (Staff lens)
- Egress bandwidth and CDN (often dominates media/data-heavy systems)
- Database storage + IOPS at scale (plan compaction, TTL, tiering)
- Compute for async pipelines (right-size workers, spot instances for batch)
- Managed service premiums vs operational headcount trade-off
Multi-Region & DR
Start single-region with cross-AZ redundancy. Add read replicas in secondary region for DR. Move to active-active only when latency SLO or data residency requires it — accept conflict resolution complexity explicitly.