Advanced Java Web Crawler Techniques: Distributed Crawling and Rate Limiting

URL Frontier (Kafka / RabbitMQ): partition by hostname to keep host URLs together.
Scheduler: enforces politeness, dequeues from frontier, pushes allowed jobs to downloaders.
Rate limiter (Redis): fast host-level state and atomic checks (Lua scripts).
Downloader workers (Java): HTTP client pool, retry/backoff, respect If-Modified-Since/ETag.
Parser & Extractor: HTML parser (jsoup), link normalization, content hashing for dedupe.
Metadata store (Cassandra / RocksDB): URL state, last-crawl, ETag, status.
Object storage (S3): raw payloads, large assets.
Monitoring & DLQ: metrics, per-host error tracking, dead-letter queue for persistent failures.

Partition the frontier by hostname hash so one scheduler instance serializes a host’s URLs.
Benefits: simplifies distributed politeness, reduces cross-node coordination, improves cache locality for robots.txt and last-crawl checks.

Per-host delays: honor robots.txt crawl-delay and default minimum (e.g., 1–2s).
Atomic check-and-set: use Redis Lua script to read last-access and update if allowed (prevents races).
Token-bucket / leaky-bucket: for smoother throughput where hosts allow bursts—keep per-host token buckets in Redis.
Jitter: add small random delays to avoid thundering-herd when windows open.
Backoff on errors: exponential backoff for 5xx / timeouts; longer penalties for ⁄₄₀₃.
Global politeness: per IP/proxy quotas to avoid overloading the proxy or getting IP blocks.

HTTP client: use HttpClient (java.net.http) or Apache HttpClient with connection pooling, timeouts, and HTTP/2 support.
robots.txt: fetch once per domain, cache in Redis with validation via If-Modified-Since. Use a robust parser (or implement standard rules + crawl-delay).
Concurrency: downloader thread pools + async non-blocking IO for scale.
Deduplication: store URL hash (SHA-256) and content fingerprint (SimHash/MD5) in metadata DB.
Scheduling: scheduler checks Redis atomic lock; if delayed, requeue with delay or put into a delay queue (Kafka delay topic or Redis sorted set with timestamp

Comments