Engine¶
Warning: pre-1.0.0 - APIs and contracts may change.
This model defines the execution engine, which coordinates crawling and extraction.
Runtime flow:
- Parses wxpath expressions, retrieving an AST.
- Processes the AST by dispatching to operation handlers.
- Gather processing tasks internally known as "intents" from operation handlers. Depending on the intent, the engine may:
- Crawl, (
CrawlIntent) by communicating to the Crawler by enqueuing CrawlTasks. - Extract, (
ExtractIntent) by dispatching to operation handlers that extract data. - Yield, (
DataIntent) by yielding data to the caller. - Process, (
ProcessIntent) by dispatching to operation handlers, effectively bridging XPath and wxpath expressions. - InfiniteCrawl, (
InfiniteCrawlIntent) by enqueuing infinite crawl tasks, which communicates to the engine that it should produce crawl intents AND process intents or extract intents or data intents.
- Crawl, (
This engine evolved from an earlier, monolithic wxpath engine that had the crawler tightly coupled to the engine. While coupling between the crawler and engine still exists, it does so in a much more modular way, by dispatching crawl tasks to the crawler's queue.
Location¶
from wxpath.core.runtime import WXPathEngine
from wxpath.core.runtime.engine import HookedEngineBase
WXPathEngine¶
Main class for executing wxpath expressions.
class WXPathEngine(HookedEngineBase):
def __init__(
self,
crawler: Crawler | None = None,
concurrency: int = 16,
per_host: int = 8,
respect_robots: bool = True,
allowed_response_codes: set[int] = None,
allow_redirects: bool = True
)
Parameters¶
| Name | Type | Default | Description |
|---|---|---|---|
crawler |
Crawler | None | Pre-configured crawler instance |
concurrency |
int | 16 | Global concurrent fetches |
per_host |
int | 8 | Per-host concurrent fetches |
respect_robots |
bool | True | Honor robots.txt directives |
allowed_response_codes |
set[int] | {200} | Accepted HTTP status codes |
allow_redirects |
bool | True | Follow HTTP redirects |
Attributes¶
| Name | Type | Description |
|---|---|---|
seen_urls |
set[str] | URLs already processed (deduplication) |
crawler |
Crawler | HTTP crawler instance |
allowed_response_codes |
set[int] | Accepted status codes |
allow_redirects |
bool | Redirect following enabled |
Methods¶
run¶
async def run(
self,
expression: str,
max_depth: int,
progress: bool = False,
yield_errors: bool = False
) -> AsyncGenerator[Any, None]
Execute a wxpath expression concurrently and yield results.
Parameters:
| Name | Type | Description |
|---|---|---|
expression |
str | wxpath expression to evaluate |
max_depth |
int | Maximum crawl depth |
progress |
bool | Show tqdm progress bar |
yield_errors |
bool | Yield error dicts for failed requests |
Yields: Extracted values (HtmlElement, WxStr, dict, etc.)
Example¶
import asyncio
from wxpath.core.runtime import WXPathEngine
from wxpath.http.client import Crawler
crawler = Crawler(
concurrency=8,
per_host=2,
headers={'User-Agent': 'my-bot/1.0'}
)
engine = WXPathEngine(
crawler=crawler,
allowed_response_codes={200, 301, 302}
)
async def main():
async for item in engine.run("url('https://example.com')//a/@href", max_depth=1):
print(item)
asyncio.run(main())
HookedEngineBase¶
Base class providing hook invocation methods.
class HookedEngineBase:
async def post_fetch_hooks(
self, body: bytes | str, task: CrawlTask
) -> bytes | str | None
async def post_parse_hooks(
self, elem: HtmlElement | None, task: CrawlTask
) -> HtmlElement | None
async def post_extract_hooks(
self, value: Any
) -> Any | None
Hook Methods¶
post_fetch_hooks¶
Run registered post_fetch hooks over a fetched response body.
Parameters:
- body - Raw response body bytes
- task - The CrawlTask that produced the response
Returns: Transformed body, or None if dropped
post_parse_hooks¶
Run registered post_parse hooks on a parsed DOM element.
Parameters:
- elem - Parsed lxml element
- task - The originating CrawlTask
Returns: Transformed element, or None if dropped
post_extract_hooks¶
Run registered post_extract hooks on extracted values.
Parameters:
- value - The extracted datum
Returns: Transformed value, or None if dropped
Execution Flow¶
With the above Hooks system, wxpath's execution flow is as follows:
- Parse expression into segments
- Create seed task with initial URL(s)
- Process tasks via BFS queue:
- Fetch URL
- Run post_fetch hooks
- Parse HTML
- Run post_parse hooks
- Execute remaining segments
- Emit intents (crawl, extract, data)
- Yield extracted data through post_extract hooks
- Continue until queue exhausted or max_depth reached