Design a Service Discovery System

This problem appears in multiple sheets. Depth expectations increase as you progress:

Track	What to demonstrate
Arch 50	Show domain depth beyond the baseline: async pipelines, consistency semantics, and operational SLOs.
Arch 75	Staff angles: partition behavior, cost drivers, and MVP → production evolution with clear triggers.

Interview Prompt

Design Service Discovery System.

Clarifying Questions (ask before designing)

Question	Why it matters
Which of these is highest priority: Client-side vs server-side discovery, Health check mechanisms, DNS-based vs registry (Consul/etcd)?	Forces scope negotiation — senior candidates trim before drawing boxes.
What scale should we design for — DAU, QPS, data volume?	Drives every capacity decision; shows structured thinking.
What are the read vs write patterns on the critical path?	Determines caching, DB choice, and replication topology.
What consistency and durability guarantees are required?	Separates strong-consistency paths from eventual ones — a senior differentiator.

Scope

In scope

Client-side vs server-side discovery
Health check mechanisms
DNS-based vs registry (Consul/etcd)
Self-registration
Stale entry eviction
Capacity estimation with shown math

Out of scope (state explicitly)

Full service mesh control plane
Application business logic in downstream services
Building a custom service registry from scratch when managed options exist

Assumptions

Clarify scale (DAU, QPS, data volume) for service discovery in the first 5 minutes
Standard reliability target 99.9%–99.99% unless problem implies higher (payments, booking)
Managed cloud services (RDS, S3, Kafka, Redis) are acceptable building blocks

Services register themselves on startup (name, address, port, health endpoint, metadata)
Services deregister on shutdown (graceful) or get auto-deregistered on failure
Clients look up healthy instances of a service by name
Health checking: periodic checks (HTTP, TCP, gRPC) to detect unhealthy instances
Support for multiple environments/namespaces (prod, staging, per-tenant)
Service metadata: version, region, weight, canary flags
Watch/subscribe: clients get notified of changes (new instances, removals)
DNS-based and API-based discovery
Load balancing integration: return instances in weighted/round-robin order

Metric	Calculation	Value
Services	Given	5,000 distinct names
Instances (total)	Given	100,000
Registrations / sec	Derived from daily volume ÷ 86400 (+ peak factor)	50 (deploys, autoscaling)
Health checks / sec	Derived from daily volume ÷ 86400 (+ peak factor)	10K/sec
Lookups / sec	Derived from daily volume ÷ 86400 (+ peak factor)	1M (with client caching, most served locally)
Watch subscribers	Given	50K (one per client instance)
Metadata per instance	Given	~500 bytes

A Consul-like service discovery system with a Raft-consistent registry, health checking by local agents, gossip-based anti-entropy, and client-side caching with watch support for instant updates.

Loading...

Server-Side vs Client-Side Discovery

Approach	How	Pros	Cons
Server-Side (ELB/K8s Service)	Client → LB → Backend	Client is simple (just knows LB address)	Extra hop, LB is bottleneck
Client-Side (Consul/Eureka) ⭐	Client queries registry → picks instance directly	No extra hop, locality-aware decisions	Per-language library needed
Service Mesh (Envoy/Istio)	Client → local Envoy sidecar → Backend	Language-agnostic, rich features	Sidecar resource overhead
Kubernetes DNS	svc.namespace.svc.cluster.local → ClusterIP	Built-in, zero setup	Limited health checking, DNS caching

Health Check Strategies

Level 0: TCP Check: Is port open? Catches process crash. < 1ms.

Level 1: HTTP Shallow ⭐: GET /healthz → 200. Catches HTTP server failure. < 5ms.

Level 2: HTTP Deep: Checks DB connection, cache, external APIs. Catches dependency failures but risks cascading. 10-100ms.

Level 3: Liveness + Readiness (K8s): /healthz/live (is process stuck?) vs /healthz/ready (can it handle traffic?). Service starting: live=true, ready=false. DB lost: live=true, ready=false. Deadlocked: live=false → K8s KILLS pod.

Recommendation: SD health check = Level 1 (shallow, every 5s). Application readiness = Level 2 (deep, every 30s). K8s liveness = Level 0 or 1. NEVER use deep health checks at high frequency with SD.

Anti-Entropy & Convergence

Consul uses SERF gossip protocol: nodes share state updates peer-to-peer. Anti-entropy sync every 30s: each agent syncs full state with servers. On conflict: latest write wins (LWW with Lamport timestamps). All nodes converge within seconds (< 5s). Health checks are NOT centralized on servers: each local agent health-checks services on its node and gossips status to servers.

Graceful Deployment with Service Discovery

Blue-Green: Deploy v2, register with tag "v2", gradually shift weight v1→v2, deregister v1. Canary: Register 1 canary with tag "canary", route 5% to canary via metadata, monitor, promote or deregister. Connection Draining: Before deregistration, mark as "draining": SD stops sending new requests, waits 30s for in-flight requests to complete, then fully deregisters.

Event Bus Design (Kafka)

Topic: service_discovery-events
  Partitions: 64 (scale consumers horizontally)
  Partition key: entity_id (user_id / order_id — preserves per-entity ordering)
  Retention: 7 days (compliance) or 24h (high-volume telemetry)
  Replication factor: 3, min.insync.replicas: 2

Producer: idempotent producer enabled (enable.idempotence=true)
Consumer: consumer group "service_discovery-processors"
  - At-least-once delivery + idempotent handlers (dedup by event_id)
  - DLQ topic: service_discovery-events-dlq (poison messages after 3 retries)
  - Lag alert: consumer lag > 60s → scale workers

Design a Service Discovery System: async side effects MUST NOT block the synchronous API response.
  Sync path: validate → persist source of truth → publish event → return 201
  Async path: consumers update caches, indexes, notifications, aggregates

Service Registry (In-Memory + Raft-Replicated)

JSON

{
  "services": {
    "payment-svc": {
      "instances": {
        "payment-svc-i-abc123": {
          "id": "payment-svc-i-abc123",
          "address": "10.0.1.42",
          "port": 8080,
          "tags": ["v2.3.1", "canary"],
          "meta": {"region": "us-east", "weight": "100"},
          "health": "passing",
          "last_heartbeat": "2026-03-14T10:05:00Z",
          "registered_at": "2026-03-14T08:00:00Z",
          "check": {
            "type": "http",
            "endpoint": "http://10.0.1.42:8080/healthz",
            "interval_sec": 10,
            "consecutive_failures": 0
          }
        }
      }
    }
  }
}

Consul vs etcd vs ZooKeeper Comparison

Feature	Consul	etcd	ZooKeeper
Consensus	Raft	Raft	ZAB
Health checks	✅ Built-in	❌ Must build	❌ Must build
DNS interface	✅ Built-in	❌ External	❌ External
Multi-DC	✅ WAN gossip	❌ Single cluster	❌ Single cluster
Watch	Long-poll	gRPC watch	ZooKeeper watches
Used by	HashiCorp stack	Kubernetes	Kafka, HBase

DNS-Based vs API-Based Service Discovery

Aspect	DNS-Based	API-Based ⭐
Universality	✓ Every language supports DNS natively	✗ Requires client library per language
Data richness	✗ IP + port only (SRV records help)	✓ Full metadata: tags, health, weight
Freshness	✗ DNS caching (TTL-dependent)	✓ Instant via long-poll/watch
Smart routing	✗ Round-robin only	✓ Weighted, canary, version-based

Recommendation: Use DNS for simple cases (Kubernetes internal). Use API for microservices with advanced routing needs. Many systems use both: DNS for initial discovery, API for watch/health.

Health Check Depth Spectrum

Level 0 (TCP): < 1ms, catches process crash.

Level 1 ⭐ (HTTP Shallow): < 5ms, catches HTTP failure. Safe for SD at 5s intervals.

Level 2 (HTTP Deep): 10-100ms, catches dependency failures but risks cascading (DB slow → all services marked unhealthy).

Level 3 (Liveness + Readiness): K8s pattern. SD health check = Level 1. Application readiness = Level 2 (every 30s). K8s liveness = Level 0 or 1. NEVER use deep checks at high frequency.

SLOs & Error Budgets

Metric	Target	Rationale
Core user-facing availability	99.95%	Budget for planned maintenance + unplanned failures without user-visible outage.
p99 latency (critical path)	Problem-specific — state target early and tie to capacity math	Interview credibility comes from connecting SLO to architecture choices.
Error rate (5xx)	< 0.1%	Distinguishes transient blips from systemic failure requiring rollback.
Data durability	99.999999999% (11 nines) for committed writes	Define which operations require fsync/quorum vs async replication.

Incident Scenarios (2am reality)

Scenario	How you detect	Mitigation
Primary database unavailable	Health check failures, connection pool exhaustion alerts, elevated 5xx	Failover to replica / promote standby; enable read-only degraded mode if writes impossible; queue writes if async path exists
Traffic spike (10× normal)	RPS anomaly alert, autoscaling lag, latency SLO burn rate	Rate limit non-critical endpoints; scale read path horizontally; pre-warm caches; shed load on expensive operations
Bad deploy causing elevated errors	Canary metric regression, error budget burn, deployment correlation	Automated rollback within 5 minutes; feature flag kill switch; maintain N-1 compatibility

Cost Drivers (Staff lens)

Egress bandwidth and CDN (often dominates media/data-heavy systems)
Database storage + IOPS at scale (plan compaction, TTL, tiering)
Compute for async pipelines (right-size workers, spot instances for batch)
Managed service premiums vs operational headcount trade-off

Multi-Region & DR

Start single-region with cross-AZ redundancy. Add read replicas in secondary region for DR. Move to active-active only when latency SLO or data residency requires it — accept conflict resolution complexity explicitly.

Interview Prompt

Clarifying Questions (ask before designing)

Scope

In scope

Out of scope (state explicitly)

Assumptions

Server-Side vs Client-Side Discovery

Health Check Strategies

Anti-Entropy & Convergence

Graceful Deployment with Service Discovery

Event Bus Design (Kafka)

Common Error Responses

Service Registry (In-Memory + Raft-Replicated)

Consul vs etcd vs ZooKeeper Comparison

SD Cluster Failure

Race Conditions in Service Discovery

Client-Side Caching: How It Actually Works

Self-Registration vs Third-Party Registration

Interview Walkthrough

DNS-Based vs API-Based Service Discovery

Health Check Depth Spectrum

Phase 1: MVP (0 to 100K users)

Phase 2: Growth (100K to 10M users)

Phase 3: Scale (10M+ users)

SLOs & Error Budgets

Incident Scenarios (2am reality)

Cost Drivers (Staff lens)

Multi-Region & DR