Crawling with the Tor Protocol (Part 5)

This is Part 5 of the series Tor Protocol Full Analysis. In Part 4 we covered why Tor + P2P is risky. Here we build a crawler that routes all traffic through Tor, isolates circuits per target, and avoids DNS leaks.

If you missed Part 4, start here: /posts/tor-protocol-part-4

Legal & policy notice: This post is for lawful research and defensive use only. Always follow local laws and the target site's terms/robots policy.

Architecture (summary)

flowchart LR
    ui --> crawl --> envoy -- lb --> tor-proxy

Why crawl through Tor

Tor helps when you need source IP diversity or privacy while researching public content. It is not a free pass to be noisy or ignore crawl etiquette. Expect more latency, more blocks, and stricter rate limits.

What this does and does not protect

Protects your origin IP from the target site.
Helps reduce correlation by avoiding a single exit IP.
Does NOT protect you from legal or policy violations.
Does NOT bypass strong bot defenses or CAPTCHAs.

Architecture

---
config:
  layout: elk
---
flowchart LR
 subgraph Clients["Clients"]
        Client["UI / REST / CLI"]
  end
 subgraph App["FastAPI Service"]
        API["API Routes"]
        Scheduler["Scheduler"]
        Crawler["Crawler"]
        Extractor["Metadata Extractor"]
        Query["Query Parser/Filter"]
  end
 subgraph Stores["Stores"]
        SQLite["SQLite (queue + pages)"]
        Meili["Meilisearch index"]
  end
 subgraph External["External"]
        Envoy["Envoy LB (optional)"]
        Tor["Tor SOCKS Proxy"]
        Onion[".onion Sites"]
  end
    Client -- HTTP --> API
    API -- enqueue / stats / recent --> SQLite
    API -- search --> Query
    Query -- query --> Meili
    Scheduler -- claim/mark --> SQLite
    Scheduler --> Crawler
    Crawler --> Extractor
    Crawler -- save pages --> SQLite
    Crawler -- index docs --> Meili
    Crawler -. SOCKS5 via Envoy .-> Envoy
    Envoy --> Tor
    Tor --> Onion

Metric (per proxy)	Example target	Notes
Requests/min	10-20	Keep low to reduce 429/403 rates.
Success rate	85-95%	Lower is normal on hostile sites.
Median latency	3-8s	Tor latency dominates.
429/403 rate	<5%	Spike = slow down or stop.

Crawling with the Tor Protocol (Part 5)

Tor Protocol Full Analysis

Crawling with the Tor Protocol (Part 5)

Architecture (summary)

Why crawl through Tor

What this does and does not protect

Architecture

Tor Protocol Full Analysis

Core crawler capabilities

Runtime & dependencies (suggested baseline)

Operational metrics (example baseline)

How it works (from the code)

Why Envoy in front of Tor proxies

Why Python is fast enough here

Ethics and operations guide

Warning: legal and content risk

Crawl sample (real result)

Running it over Tor (example)

Performance expectations

Circuit rotation and isolation

Next up

Crawling with the Tor Protocol (Part 5)

Tor Protocol Full Analysis

Crawling with the Tor Protocol (Part 5)#

Architecture (summary)#

Why crawl through Tor#

What this does and does not protect#

Architecture#

Tor Protocol Full Analysis

Core crawler capabilities#

Runtime & dependencies (suggested baseline)#

Operational metrics (example baseline)#

How it works (from the code)#

Why Envoy in front of Tor proxies#

Why Python is fast enough here#

Ethics and operations guide#

Warning: legal and content risk#

Crawl sample (real result)#

Running it over Tor (example)#

Performance expectations#

Circuit rotation and isolation#

Next up#

Related Posts

Why Tor + P2P/Torrent Is Risky (Part 4)

How the Tor Protocol Works (Part 1)

How to Use Tor Safely (Part 3)

Crawling with the Tor Protocol (Part 5)

Architecture (summary)

Why crawl through Tor

What this does and does not protect

Architecture

Core crawler capabilities

Runtime & dependencies (suggested baseline)

Operational metrics (example baseline)

How it works (from the code)

Why Envoy in front of Tor proxies

Why Python is fast enough here

Ethics and operations guide

Warning: legal and content risk

Crawl sample (real result)

Running it over Tor (example)

Performance expectations

Circuit rotation and isolation

Next up