Hooks (Experimental)¶
Warning: pre-1.0.0 - APIs and contracts may change.
wxpath provides a pluggable hook system for customizing crawl and extraction behavior.
Hook Lifecycle¶
Hooks can intercept at three points:
- post_fetch - After HTTP response, before parsing
- post_parse - After HTML parsing, before extraction
- post_extract - After value extraction, before yielding
Creating Hooks¶
Basic Hook¶
from wxpath import hooks
@hooks.register
class MyHook:
def post_fetch(self, ctx, html_bytes):
"""Transform or filter response body."""
# Return bytes to continue, None to drop
return html_bytes
def post_parse(self, ctx, elem):
"""Transform or filter parsed element."""
# Return element to continue, None to drop
return elem
def post_extract(self, value):
"""Transform or filter extracted value."""
# Return value to yield, None to drop
return value
Async Hooks¶
from wxpath import hooks
@hooks.register
class AsyncHook:
async def post_fetch(self, ctx, html_bytes):
# Async operations supported
return html_bytes
async def post_parse(self, ctx, elem):
return elem
async def post_extract(self, value):
return value
Note: All hooks in a project should use the same style (sync or async). Mixing is not supported.
FetchContext¶
Hook methods receive a FetchContext with crawl metadata:
from wxpath.hooks.registry import FetchContext
# FetchContext fields:
# - url: str - Current URL being processed
# - backlink: str - URL that linked to this page
# - depth: int - Current crawl depth
# - segments: list - Remaining expression segments
# - user_data: dict - Custom data storage
Example Hooks¶
Language Filter¶
@hooks.register
class OnlyEnglish:
def post_parse(self, ctx, elem):
lang = elem.xpath('string(/html/@lang)').lower()[:2]
return elem if lang in ("en", "") else None
URL Logger¶
@hooks.register
class URLLogger:
def post_fetch(self, ctx, html_bytes):
print(f"Fetched: {ctx.url} (from {ctx.backlink})")
return html_bytes
Value Transformer¶
@hooks.register
class CleanText:
def post_extract(self, value):
if isinstance(value, str):
return value.strip()
return value
Content Filter¶
@hooks.register
class SkipErrors:
def post_fetch(self, ctx, html_bytes):
if b'error' in html_bytes.lower():
return None # Drop this response
return html_bytes
Built-in Hooks¶
SerializeXPathMapAndNodeHook¶
Converts XPathMap and XPathNode objects to plain Python types:
from wxpath.hooks import SerializeXPathMapAndNodeHook
from wxpath import hooks
hooks.register(SerializeXPathMapAndNodeHook)
JSONLWriter / NDJSONWriter¶
Writes extracted data to a newline-delimited JSON file:
Output file can be configured via WXPATH_OUT environment variable.
Features: - Non-blocking writes via background thread - Queue-based buffering - Automatic JSON serialization
Hook Registration¶
Decorator Style¶
Instance Registration¶
Class Registration¶
Accessing Registered Hooks¶
Execution Order¶
Hooks execute in registration order. Each hook can:
- Transform the value and pass to next hook
- Drop the value by returning None (stops pipeline)