This problem appears in multiple sheets. Depth expectations increase as you progress:
| Track | What to demonstrate |
|---|---|
| Arch 25 | The politeness + scale problem. Nail BFS URL frontier, per-domain rate limiting, robots.txt caching, SimHash near-duplicate detection, and DNS caching. Show why naive parallel crawling gets you blocked. |
| Arch 50 | Add priority crawling, freshness scheduling, and distributed coordination without single-point bottleneck. |
| Arch 75 | Staff: crawl budget allocation across 100M domains, adversarial sites (crawler traps), and index pipeline backpressure. |
Interview Prompt
Design a web crawler that discovers and downloads billions of web pages for a search engine index. Respect robots.txt, avoid overloading sites, and deduplicate near-identical content.
Clarifying Questions (ask before designing)
| Question | Why it matters |
|---|---|
| What's the crawl rate target and how many domains? | 1B pages/month across 100M domains = strict per-domain politeness. 1B pages from 10K domains = different bottleneck. |
| Freshness priority or breadth-first discovery? | BFS discovers new sites; freshness scheduler re-crawls known pages — often separate queues. |
| Exact dedup or near-duplicate detection? | Exact URL dedup is a hash set; near-dedup (mirrors, templates) needs SimHash/MinHash. |
| JavaScript-rendered pages in scope? | Headless browser crawling is 100× slower — usually a separate rendering pipeline. |
Scope
In scope
- URL frontier with BFS and priority scheduling
- Politeness: per-domain rate limiting
- robots.txt fetch, parse, and cache
- Content deduplication (exact URL + SimHash near-dedup)
- DNS caching and resolution
- Distributed crawler worker architecture
Out of scope (state explicitly)
- HTML parsing and index extraction (downstream indexer)
- JavaScript rendering (headless Chrome farm)
- Login/authenticated crawling
- Image/video binary crawling
Assumptions
- Crawl 1B pages/month (~400 pages/sec sustained)
- 100M unique domains, avg 10 pages/domain
- Default politeness: 1 request/sec/domain, 10 concurrent per domain max
- Average page size 50 KB, 90% HTML text
These foundational concepts underpin the patterns used in this problem. Review them before deep-diving into component-level trade-offs.
- Crawl the web starting from a set of seed URLs
- Discover new URLs by extracting links from crawled pages
- Download and store web page content for indexing
- Respect robots.txt directives (politeness)
- Handle URL deduplication (don't crawl the same page twice)
- Support recrawling to detect updated content
- Prioritize important/popular pages for crawling first
- Handle different content types (HTML, PDF, images, etc.)
- High Throughput: Crawl 1 billion pages per day
- Politeness: Don't overload any single web server (rate limit per domain)
- Robustness: Handle spider traps, infinite loops, malformed HTML, server errors
- Scalability: Horizontally scalable: add machines to crawl faster
- Freshness: Re-crawl pages based on change frequency
- Extensible: Easy to add new modules (content extraction, language detection, etc.)
- Fault Tolerant: A single node failure should not lose crawl progress
| Metric | Calculation | Value |
|---|---|---|
| Pages to crawl | Given (assumption documented in value) | 15B (entire web, roughly) |
| Target crawl rate | Given (assumption documented in value) | 1B pages/day |
| Pages / sec | Given | ~11,500 |
| Avg page size | Given (typical workload assumption) | 100 KB |
| Download / day | 1B × 100 KB | 100 TB |
| Download bandwidth | Given | ~10 Gbps |
| URL frontier size | 10B URLs × 100 bytes | 1 TB |
| Content storage / month | Given (assumption documented in value) | 3 PB |
| Crawl workers | Given | ~1000 machines (each handles ~12 pages/sec) |
URL Frontier: The Most Complex Component
The URL frontier is NOT a simple queue. It has two sub-systems:
1. Priority Queue (Front Queue): Determines WHICH URLs to crawl first. Priority based on PageRank, change frequency, domain authority, and freshness requirement. Implementation: Multiple priority queues (P0, P1, P2, ..., PN).
2. Politeness Queue (Back Queue): Ensures we don't overwhelm any single domain. One FIFO queue per domain. A rate limiter enforces max 1 request per domain per second (or per Crawl-delay in robots.txt).
Fetcher (HTTP Downloader)
- Connection pooling: Reuse HTTP connections to the same host
- Timeouts: Connect timeout = 5s, read timeout = 30s
- Redirect handling: Follow up to 5 redirects (301, 302)
- robots.txt compliance: Before crawling any page on a domain, fetch and cache robots.txt. Parse directives: Disallow, Allow, Crawl-delay, Sitemap. Cache robots.txt per domain with TTL = 24 hours
- Content types: Accept HTML, PDF, DOC (reject binary, video, etc.)
Content Deduplication
Problem: Many pages have identical or near-identical content (mirrors, syndication, plagiarism).
- Exact dedup: MD5/SHA-256 hash of page content → check against seen hashes
- Near-dedup: SimHash or MinHash algorithm. Two pages are near-duplicates if Hamming distance of their SimHashes ≤ 3
- Storage: Bloom filter for exact hashes, SimHash table for near-dupes
URL Deduplication (Seen URLs)
- Solution: Bloom filter with 100 billion entries (~120 GB for 1% false positive rate)
- Before adding a URL to the frontier, check the Bloom filter
- URL normalization: lowercase scheme/host, remove fragment, remove default ports, sort query parameters, resolve relative paths, remove tracking parameters
DNS Resolver
- DNS lookups are slow (50-200ms) and become a bottleneck at scale
- Solution: Local DNS cache + custom DNS resolver
- Pre-resolve DNS for URLs in the frontier (batch DNS lookups)
- Cache DNS results with TTL
Parser / Link Extractor
- Parse HTML using robust parser (handles malformed HTML)
- Extract: links, text content, metadata, base tag
- Handle JavaScript-rendered pages via headless browser (expensive) or only crawl server-rendered HTML
Event Bus Design (Kafka)
Topic: web_crawler-events Partitions: 64 (scale consumers horizontally) Partition key: entity_id (user_id / order_id — preserves per-entity ordering) Retention: 7 days (compliance) or 24h (high-volume telemetry) Replication factor: 3, min.insync.replicas: 2 Producer: idempotent producer enabled (enable.idempotence=true) Consumer: consumer group "web_crawler-processors" - At-least-once delivery + idempotent handlers (dedup by event_id) - DLQ topic: web_crawler-events-dlq (poison messages after 3 retries) - Lag alert: consumer lag > 60s → scale workers Design a Web Crawler (Googlebot): async side effects MUST NOT block the synchronous API response. Sync path: validate → persist source of truth → publish event → return 201 Async path: consumers update caches, indexes, notifications, aggregates
Add Seed URLs
POST /api/v1/crawler/seeds
{
"urls": ["https://example.com", "https://news.ycombinator.com"],
"priority": "high"
}Get Crawl Status
GET /api/v1/crawler/status
Response: 200 OK
{
"pages_crawled_today": 892345678,
"pages_in_frontier": 5234567890,
"crawl_rate_per_sec": 11500,
"active_workers": 980,
"errors_today": 12345
}Block Domain
POST /api/v1/crawler/block
{
"domain": "spam-site.com",
"reason": "spam"
}Common Error Responses
400 Bad Request: invalid input, missing fields, or malformed JSON
401 Unauthorized: missing or invalid auth token or API key
403 Forbidden: authenticated but insufficient permissions
404 Not Found: resource ID does not exist
409 Conflict: duplicate write or version conflict; retry with idempotency key
422 Unprocessable Entity: valid syntax but invalid business logic
429 Too Many Requests: rate limit exceeded; honor Retry-After header
500 Internal Error: unexpected server fault; retry with idempotency key
503 Service Unavailable: dependency down or overloaded; use exponential backoff
202 Accepted: job queued; poll GET /jobs/{id} for status
408 Request Timeout: job still processing; continue pollingURL Frontier Entry
{
url: "https://example.com/page/123",
normalized_url: "https://example.com/page/123",
domain: "example.com",
priority: 2,
discovered_at: "2026-03-13T10:00:00Z",
last_crawled_at: "2026-03-12T08:00:00Z",
recrawl_interval: 86400, // seconds
retries: 0,
depth: 3, // hops from seed URL
referrer_url: "https://example.com/"
}Content Store (S3 / GFS)
Bucket: crawler-content
Key: {content_hash}
Value: {
url: "https://example.com/page/123",
content_type: "text/html",
status_code: 200,
headers: {...},
body: "<html>...",
crawled_at: "2026-03-13T10:00:00Z",
content_hash: "sha256:...",
simhash: "0xA3F2B1C4D5E6F7A8"
}Robots.txt Cache (Redis)
Key: robots:{domain}
Value: {
rules: [
{user_agent: "*", disallow: ["/admin", "/private"]},
{user_agent: "Googlebot", allow: ["/api"]}
],
crawl_delay: 2,
sitemaps: ["https://example.com/sitemap.xml"],
fetched_at: "2026-03-13T00:00:00Z"
}
TTL: 86400 (24 hours)Bloom Filter: Seen URLs
Type: Distributed Bloom Filter (Redis-backed or custom) Capacity: 100 billion URLs FPR: 1% Size: ~120 GB (10 bits per element × 100B) Hash functions: 7
| Concern | Solution |
|---|---|
| Worker crash | URL remains in frontier (not removed until crawl confirmed). Reassigned to another worker |
| DNS failure | Retry with exponential backoff; fallback to secondary DNS |
| HTTP timeout | Retry up to 3 times with backoff; then mark URL as failed and deprioritize |
| Content store failure | S3 with 11 nines durability + cross-region replication |
| Frontier data loss | Checkpoint frontier to disk periodically; rebuild from content store's crawled URLs |
| Bloom filter loss | Rebuild from frontier + content store URLs (takes hours but possible) |
Spider Traps and Infinite Loops
- Defenses: Max URL depth (stop after depth 15), Max URLs per domain (cap at 1M per cycle), URL pattern detection (if > 1000 URLs match same pattern stop following), Maximum page size (skip > 10 MB), Content hash dedup (if same content at different URLs → stop following)
Recrawl Strategy
- News sites: Every 15 minutes
- E-commerce (prices): Every few hours
- Wikipedia/docs: Every few days
- Static pages: Weekly or monthly
- Adaptive recrawl: Track how often a page changes → adjust recrawl interval dynamically. If page changed: halve interval (capped at MIN). If unchanged: double interval (capped at MAX).
Distributed Architecture
- Master-Worker: Master assigns URL batches to workers; workers fetch and report results
- Alternative (Masterless): Each worker manages its own set of domains via consistent hashing on domain name. No single point of failure. Workers coordinate via message queue (Kafka).
Legal and Ethical Considerations
- robots.txt MUST be respected
- Crawl-delay: Honor the specified delay between requests
- noindex meta tag: Don't index pages with noindex
- Terms of service: Some sites prohibit crawling beyond robots.txt
- Personal data: Don't store PII discovered during crawling (GDPR)
Sitemap Processing
- Parse sitemap.xml for a curated list of a site's pages
- Priority and changefreq hints from sitemap help with scheduling
- Sitemaps can reference other sitemaps (sitemap index files)
Interview Walkthrough
- Start with the URL frontier as the central scheduler — it decides what to crawl next while enforcing politeness constraints.
- Explain the two-tier frontier: priority queues (importance) feeding per-domain back queues (rate limiting) — this is the pattern interviewers probe.
- Use a Bloom filter for visited-URL deduplication to keep the frontier memory-efficient at billions of URLs.
- Describe worker distribution: master-worker for simplicity or consistent-hashing workers by domain for fault tolerance.
- Cover recrawl scheduling — adaptive intervals based on page change frequency, not a fixed global timer.
- Emphasize robots.txt and crawl-delay compliance; ignoring politeness gets your crawler blocked and is an instant red flag.
- Quantify throughput with Back-of-the-Envelope Estimation: pages/sec × average page size = bandwidth and storage needs.
- Common pitfall: a global BFS queue with no per-domain throttle — you hammer one host and trigger IP bans.
URL Frontier: Priority + Politeness Architecture
The URL Frontier acts as the central scheduler and is built as a two-tier system: **Front Queues (Priority)** and **Back Queues (Politeness)**. Together they balance finding the most important pages early while respecting domain rate limits.
1. Front Queues (Priority Allocation):
- URLs are classified into multiple priority levels from critical to low:
- Queue P1 (Critical): e.g.,
[google.com/news, bbc.com/latest] - Queue P2 (High): e.g.,
[wikipedia.org/..., nytimes.com/] - Queue P3 (Normal): e.g.,
[example.com/about, blog.io/post] - Queue P4 (Low): e.g.,
[random-site.xyz/page42]
- Queue P1 (Critical): e.g.,
- Priority Assignment Signals:
- PageRank: Pages on domains with higher PageRank get prioritised first.
- Freshness Need: Frequently updated directories (e.g. news sites) are scheduled on higher-priority tiers.
- Sitemap Hints: Explicit
<priority>hints in sitemaps (e.g.0.8high vs0.2low). - Historical Change Frequency: If historical scans show a page mutates daily, its priority is increased.
- Selection Process: The prioritizer selects URLs using a weighted random probability distribution: P1: 40% chance, P2: 30% chance, P3: 20% chance, and P4: 10% chance.
2. Back Queues (Politeness Rate Limiting):
- To avoid crashing web servers (Denial-of-Service), the crawler maintains exactly **one FIFO queue per target domain**.
- Each domain queue tracks performance states:
- Queue [google.com]: Last fetch timestamp, and active
crawl_delay(e.g. 2s) → Next crawl window is calculated dynamically. - Queue [wikipedia.org]: Long crawl delay (e.g. 5s) → next fetch scheduled 5s after the last successful download.
- Queue [google.com]: Last fetch timestamp, and active
- Worker Loop: Worker threads scan the back queues to find domains where
next_fetch_time ≤ now(), dequeue one URL, issue the HTTP fetch, and set the next available window.
Crawl Execution Flow: One URL End-to-End
Processing a single target URL (e.g., https://example.com/products/shoes) follows a strictly ordered synchronous timeline to maximize efficiency while validating constraints:
- Step 1: Dequeue from Frontier (0.1 ms)
Checks eligibility of the back queue (e.g.,[example.com]). Iflast_fetchwas 3s ago andcrawl_delayis 2s, the queue is eligible. Pops the URL from the queue. - Step 2: DNS Resolution (1 ms: Cached)
Checks the local DNS cache for the host ip (e.g.,example.com → 93.184.216.34). On a cache miss, queries recursive DNS resolvers and updates the cache with a standard TTL (e.g., 300s).
Optimization: Custom batch DNS prefetching executes in the background for upcoming back-queue items. - Step 3: Robots.txt Constraint Validation (0.1 ms: Cached)
Fetches cached robots rules from Redis (e.g., keyrobots:example.com). If the URL is matched against a block pattern (e.g.Disallow: /products/shoes) for our user agent, the crawler drops the URL instantly, logs a rejection, and moves to the next candidate. - Step 4: HTTP Fetch (200 ms: Network)
Leases a connection from the HTTP client pool for the domain. Transmits a `GET` request declaring a distinct User-Agent:User-Agent: MyCrawler/1.0 (+https://mycrawler.com/about)
Follows HTTP redirects (301, 302, 307) up to 5 levels max. Imposes a 30-second absolute timeout and caps response sizes at 10 MB to prevent crawler bloat. - Step 5: Content Deduplication (1 ms)
Computes the SimHash on the downloaded body. Checks the value against a seen-pages Bloom filter. If the Hamming distance is < 3 (near-duplicate), the page is skipped to prevent indexing duplicate content. Otherwise, it is written to the filter and processing continues. - Step 6: Parsing and Link Extraction (5 ms)
Runs robust HTML parsers to pull structured headers, meta tags, and all child anchor links (<a href="...">) along with JSON-LD micro-data and language parameters. - Step 7: Distributing Storage (10 ms)
Persists the raw HTML, headers, URL, and time metadata into GFS/S3 under a unique key generated by the page content hash.Write to S3: { url, content, headers, crawled_at, content_hash }(Deduplication at storage level usingcontent_hashkey). - Step 8: Discovered URLs Ingestion (1 ms)
Each extracted child URL is normalized, checked against the global seen-URLs Bloom filter, and added to the frontier with a calculated priority score if it is brand new.
Trap Protection: Incrementscrawled:{domain}in Redis. If a domain spans more than 1 million pages, crawling for that domain is halted to avoid infinite folder traps. - Step 9: Release & Back Queue Cooling
Sets `last_fetch_time = now()` forexample.com, locking the back queue from worker dequeues for the duration of the crawl delay.
Timeline Summary: Total time per URL:
~220 ms (dominated by network I/O). A single thread achieves ~4.5 URLs/sec. To scale to 1 billion pages per day, we scale up to ~2,500 active worker processes.
Distributed Coordination: Domain-Sharded Architecture
In a large-scale system with 1,000+ crawler workers, coordinating which thread crawls which domain is critical to prevent politeness rate violations and high locking overheads.
The Shared Frontier Concurrency Problem:
- If multiple workers query a single centralized queue of URLs, they might fetch from the same domain at the exact same moment.
- This issues sudden spike loads to web hosts, violating politeness agreements and generating massive distributed lock contention across workers.
The Consistent Hashing Solution:
- We partition domains deterministically across workers using consistent hashing:
domain_shard = hash(domain) % num_workers - For instance:
- Worker 0: assigned to
[google.com, stackoverflow.com, ...] - Worker 1: assigned to
[wikipedia.org, amazon.com, ...] - ...
- Worker 999: assigned to domains hashing to 999
- Worker 0: assigned to
- Each worker maintains its own local URL frontier and Back Queues for its assigned subset of domains.
- Zero-Lock Coordination: Because a domain is only ever processed by a single writer node, no distributed locks or inter-process rate limiters are required to guarantee politeness!
Data Ingestion and Failover via Kafka:
- Ingestion Pipeline: Extracted child URLs are published to a Kafka topic
discovered-urls, partitioned by domain hash. Kafka automatically routes URLs to the partition of the worker responsible for that domain. - When any worker discovers new URLs, it publishes to Kafka with
key = domain. Kafka routes to correct partition → correct worker consumes. - The worker consumes URLs for its assigned domains, checks the shared seen-URLs Bloom filter (Redis-backed or local with periodic sync), and adds to the local frontier if it is brand new.
- Failover and Rebalancing: If a worker (e.g. Worker 42) crashes, the Kafka consumer group triggers a rebalance. Another worker inherits its partition, retrieves the last crawl state and timestamps from Redis, and resumes crawling that domain's queue within 30 seconds.
- Mitigating Hot Domains: If a massive site (like
wikipedia.org) overloads a single worker, we split the domain into paths (e.g.wikipedia.org/wiki/A-Mvswikipedia.org/wiki/N-Z) and hash the path prefixes as separate domains.
Advantages over Centralized Frontier:
- ✓ No Single Point of Failure: Masterless architecture ensures there is no centralized coordinator/master node to fail.
- ✓ Linear Scalability: Adding more workers/machines dynamically scales the number of domains handled.
- ✓ Politeness by Design: Guaranteed by a single writer per domain.
- ✗ Load Imbalances (Hot Domains): Hot domains (e.g.,
google.com) are assigned to one worker creating uneven loads. Mitigated by path-based sub-crawler hashing.
BFS vs DFS vs Priority-Based Crawling
Selecting the path traversal model changes which pages are crawled first and impacts memory consumption:
- Breadth-First Search (BFS):
- Crawls all root seeds, then all depth-1 pages, followed by depth-2 pages.
- ✓ Pros: Discovers crucial landing pages early (homepage → navigation sections).
- ✗ Cons: Tends to waste valuable network bandwidth on low-value pages at the same depth level, and cannot prioritize important content.
- Depth-First Search (DFS):
- Follows links straight down a site structure before backtracking.
- ✓ Pros: Low memory consumption since only the active page traversal path is stored.
- ✗ Cons: Easily trapped in deep recursive page paths, and takes a long time to discover major sections of other domains.
- Priority-Based Crawling (⭐ Recommended):
- Scores each URL based on relative importance and crawls the highest-scored URLs first.
- Scoring Equation:
Score = α × PageRank(domain) + β × depth_penalty + γ × freshness_need - ✓ Pros: Guarantees the highest-value internet pages are discovered and cached first.
- ✓ Adaptive Recrawling: Can reprioritize dynamic schedulers based on real-time change discovery.
- ✗ Cons: Requires managing complex distributed priority queues.
Practical Hybrid Implementation: Start with BFS from seeds to quickly map a website's core architecture, switch to a priority-based scheduler for daily operations (focusing on important pages), and use DFS locally to walk sitemap hierarchies efficiently.
URL Normalization: Why It Matters
Websites often serve the exact same page content across multiple distinct URL formats. Normalization prevents wasting network and storage bandwidth on redundant crawls.
The Redundancy Problem:
Without normalization, the following 7 URLs would be treated as completely different entities, causing the crawler to fetch the same page 7 separate times (wasted bandwidth):
https://example.com/page(Canonical template)https://Example.COM/page→ lowercase hosthttps://example.com/page/→ trailing slash removalhttps://example.com/page?a=1&b=2→ query parameter sortinghttps://example.com/page?b=2&a=1→ same params, different orderhttp://example.com/page→ scheme normalizationhttps://example.com/./page/../page→ relative path segment resolution
Normalization Rules (Applied before Seen Bloom Filter check):
- Lowercase the scheme and host components.
- Remove default protocol ports (e.g.
:80for HTTP and:443for HTTPS). - Strip trailing slashes (except for root domains).
- Sort all query parameters alphabetically.
- Remove tracking parameters (e.g.,
utm_source,fbclid,gclid). - Resolve path references (e.g., strip
/.and/..). - Decode unreserved percent-encoded hex codes (e.g.
%41→A). - Extract and honor canonical tags (
<link rel="canonical" href="...">) where available.
Staff interviews expect you to articulate how the system evolves under real growth — not jump straight to the final architecture.
Phase 1 — MVP (single-process BFS)
Single crawler process, in-memory URL queue, SQLite for seen URLs. Sequential fetch with 1 req/sec global rate. No robots.txt (allow-all). Serves discovery for 1M pages.
Key components: Single process · In-memory queue · SQLite dedup · Basic HTTP fetcher
Move to next phase when: Queue exceeds memory; need parallel fetch; sites start blocking (403 rate >10%)
Phase 2 — Distributed with politeness
100 worker nodes. Redis-backed URL frontier partitioned by domain. Per-domain rate limiter (token bucket). robots.txt cache. DNS cache layer. SimHash near-dedup. Kafka output to indexer pipeline.
Key components: Worker fleet · Redis frontier · Rate limiter · robots.txt cache · SimHash dedup · Kafka
Move to next phase when: 1B pages/month target; 100M domains; need freshness scheduling
Phase 3 — Priority crawl at scale
Dual frontier: discovery (BFS) + freshness (re-crawl by PageRank/freshness score). Domain-level crawl budget allocator. Headless renderer for top-1M JS-heavy pages (separate slow lane). Global DNS resolver fleet. Crawl metadata DB for index pipeline coordination.
Key components: Dual frontier · Budget allocator · JS render lane · DNS fleet · Crawl metadata DB
Move to next phase when: Index pipeline backpressure; or adversarial crawl trap at scale
SLOs & Error Budgets
| Metric | Target | Rationale |
|---|---|---|
| Crawl throughput | 400 pages/sec sustained | 1B pages/month budget |
| Politeness violation rate | 0% | One violation can get entire IP range blocked |
| robots.txt compliance | 100% | Legal and ethical requirement |
| Near-dup detection recall | > 95% | Duplicates waste index storage and degrade search quality |
Incident Scenarios (2am reality)
| Scenario | How you detect | Mitigation |
|---|---|---|
| IP range blocked by major domain (Amazon, Google) | 403 rate for domain >50%; zero successful fetches for 30 min | Immediate stop crawling that domain; rotate to backup IP pool; review recent rate changes; contact domain via robots.txt contact; resume at 0.1 req/sec after 24h |
| Crawler trap domain consumes 50% of crawl budget | Single domain >500K pages queued; pages/sec for other domains drops | Emergency domain cap (1000 pages); add URL pattern to trap detector; purge frontier entries for domain; backfill budget to other domains |
| DNS resolver outage causes cascade of fetch failures | DNS cache miss rate 100%; fetch error rate >90%; all workers idle | Failover to secondary resolver pool; extend DNS cache TTL to 24h temporarily; serve from stale cache; scale resolver fleet |
Cost Drivers (Staff lens)
- Egress bandwidth: 400 pages/sec × 50 KB = 20 MB/sec = 50 TB/month
- Worker fleet: 100 nodes × 4 CPU (mostly I/O wait) — not CPU-bound
- Frontier storage: 10B URLs × 200 bytes = 2 TB Redis/Cassandra
- SimHash + seen-URL store: 1B entries × 100 bytes ≈ 100 GB Cassandra
Multi-Region & DR
Crawl workers in multiple regions for geo-distributed content and IP diversity. URL frontier is global (single source of truth) — domain partition ensures no duplicate fetch across regions. robots.txt cache replicated globally. DNS cache regional (TTL respected). Output (page content + metadata) streamed to indexer via Kafka with region tag.