This is Part 5 of the series Tor Protocol Full Analysis. In Part 4 we covered why Tor + P2P is risky. Here we build a crawler that routes all traffic through Tor, isolates circuits per target, and avoids DNS leaks.
If you missed Part 4, start here: /posts/tor-protocol-part-4
Legal & policy notice: This post is for lawful research and defensive use only. Always follow local laws and the target site's terms/robots policy.
Tor helps when you need source IP diversity or privacy while researching public content. It is not a free pass to be noisy or ignore crawl etiquette. Expect more latency, more blocks, and stricter rate limits.
1) Seeds, normalization, and onion-only guard
The API normalizes incoming URLs, rejects non-onion seeds, and either crawls synchronously or enqueues work.
python
@app.post("/crawl")def crawl(request: CrawlRequest, sync: bool = Query(False)) -> dict: urls = [normalize_url(url) for url in request.urls] non_onion = [url for url in urls if not is_onion_url(url)] if non_onion: raise HTTPException(status_code=400, detail="Only .onion URLs are allowed") if sync: return crawl_and_index(urls) for url in urls: enqueue_urls([url], depth=0, scope_host=scope_host(
Normalization also strips tracking params and rejects trap-like URLs (length, path, query abuse).
2) Queue and scheduler loop
The queue is a SQLite table. The scheduler claims due tasks, marks them in progress, and runs them concurrently.
python
rows = conn.execute( "SELECT url, depth, scope_host FROM crawl_queue WHERE status = 'pending' AND next_run_at <= ?", (now,),).fetchall()conn.execute("UPDATE crawl_queue SET status = 'in_progress' WHERE url = ?", (row["url"],)
3) Fetching over Tor (SOCKS)
All HTTP fetches use an HTTPX client created with a SOCKS proxy. Proxy candidates come from TOR_SOCKS_PROXY and TOR_SOCKS_PROXIES.
def _get_proxy_list() -> list[str]: return TOR_SOCKS_PROXIES or [TOR_SOCKS_PROXY]
3b) Proxy pool and failover logic
Each task picks a random primary proxy from the pool, then creates a candidate list that falls back to other proxies if the primary fails.
python
from random import choiceproxies = _get_proxy_list()primary = choice(proxies)return _proxy_candidates(primary, proxies)
python
def _proxy_candidates(primary: str, proxy_list: list[str]) -> list[str]: if len(proxy_list) <= 1: return [primary] return [primary] + [proxy for proxy in proxy_list if proxy != primary]
The fetch loop retries retryable HTTP statuses on the same proxy with exponential backoff, then fails over to the next proxy candidate.
python
retryable_statuses = {429, 500, 502, 503, 504}for proxy_url in proxy_candidates: with _build_client(proxy_url) as client: for attempt in range(CRAWL_RETRIES + 1): response = client.get(url, follow_redirects=True) if response.status_code in retryable_statuses and attempt < CRAWL_RETRIES: _retry_sleep(attempt) continue return response.status_code, ...
4) Parsing, discovery, and scope control
Only text/html responses are parsed. Link discovery is optional and scoped by host/allowlist, with depth and per-page caps.
python
if discover and html and status < 400: discovered = _extract_links(html, url, scope, allow_external, allowed_domains, max_links)
python
for node in tree.css("a[href], link[href]"): normalized = normalize_url(node.attributes.get("href"), base_url if not _within_scope(normalized, scope, allow_external, allowed_domains): continue
Robots.txt compliance is enforced with a cached parser.
python
if respect_robots and not _allowed_by_robots(url, proxy_candidates, robots_ttl): return _make_document(url, 0, "", "", "robots disallow"),
5) Store, index, and search
Each fetch is recorded in SQLite and indexed into Meilisearch with a stable doc_id. Search uses a custom query parser for advanced syntax.
A single Tor SOCKS proxy can become a bottleneck and is vulnerable to slow or bad exits. Running multiple Tor proxies gives you parallel circuits and better resilience. Envoy provides one stable SOCKS endpoint for the crawler while it load-balances and fails over across the Tor proxy pool.
Tor is the bottleneck (about ~1 MB/s per proxy), so the crawler is I/O bound. The parsing workload is tiny (title and meta tags), so CPU is rarely the limiting factor even at higher bandwidth. That makes Python a good tradeoff for iteration speed and parsing quality. You can add proxies and balance with Envoy, but there are hard limits on a shared network.
Crawling images or videos over Tor can carry significant legal risk and is strongly discouraged. Tor crawling may surface higher quality sites than conventional crawling, but it also makes it easier to go deeper into malicious or harmful content. I built this out of curiosity and still chose to stop for that reason.
Part 6: Building a controlled onion service to test your crawler end-to-end. /posts/tor-protocol-part-6
homelabird
Sharing hands-on cloud infrastructure and DevOps experience. Writing about Kubernetes, Terraform, and observability, and documenting lessons learned as a solo operator.