HTTP Module¶
Warning: pre-1.0.0 - APIs and contracts may change.
The wxpath.http module provides the HTTP client infrastructure.
Submodules¶
| Module | Description |
|---|---|
| crawler | HTTP crawlers (Crawler, BaseCrawler, PlaywrightCrawler) |
| TODO: stats | Crawler statistics |
| TODO: policy | Retry, robots, throttling policies |
Quick Import¶
from wxpath.http.client import Crawler, Request, Response
from wxpath.http.client.cache import get_cache_backend
from wxpath.http.stats import CrawlerStats
Architecture¶
┌─────────────────┐
│ WXPathEngine │<────────────────────┐
└────────┬────────┘ │
│ │
▼ │
┌─────────────────┐ │
│ Crawler │ │
│ (BaseCrawler) │ │
└────────┬────────┘ │
│ │
┌────────────────────┼────────────────────┐ │
│ │ │ │
▼ ▼ ▼ │
┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
│ Throttler │ │ RobotsTxt │ │ RetryPolicy │ │
│ │ │ Policy │ │ │ │
└─────────────┘ └─────────────┘ └─────────────┘ │
│ │ │ │
└────────────────────┼────────────────────┘ │
│ │
▼ │
┌─────────────────┐ │
│ aiohttp Session │ │
│ (+ cache) │ ────> Response >────┘
└─────────────────┘
Crawler Types¶
Crawler (aiohttp)¶
Standard HTTP crawler using aiohttp. Best for most use cases.
from wxpath.http.client import Crawler
crawler = Crawler(
concurrency=16,
per_host=4,
respect_robots=True
)
MORE TO COME!
Request/Response Flow¶
- Engine submits
Requestto Crawler - Crawler checks robots.txt policy (if enabled)
- Throttler delays request if needed (if enabled)
- Request sent via aiohttp session
- Response cached (if enabled)
Responsereturned to engine
Concurrency Control¶
Two-level semaphore system: - Global semaphore: Limits total concurrent requests - Per-host semaphores: Limits concurrent requests per domain