Operations¶
Warning: pre-1.0.0 - APIs and contracts may change.
Operation handlers that execute wxpath segments. This module follows a dispatcher pattern, where each segment signature (wxpath function name or segment type, and its argument types) is mapped to a handler function.
This module (along with the parser) can both be tightened up in the following ways:
- Better type checking.
- Specifically, check that next segments are of the correct type.
- Less intents (
ProcessIntentmay be unnecessary). - More intuitive error messages.
Location¶
get_operator¶
Retrieve the handler function for a AST node type.
Parameters:
- binary_or_segment - AST node to find handler for
Returns: Handler function
OPS_REGISTER¶
Global dictionary mapping segment signatures to handlers.
Handler Registration¶
Handlers are registered with the @register decorator:
from wxpath.core.ops import register
from wxpath.core.parser import Xpath, String
@register(Xpath)
def handle_xpath(elem, segments, depth):
# Execute XPath on element
...
return [DataIntent(value=result)]
@register('url', (String,))
def handle_url_literal(elem, segments, depth):
# Fetch literal URL
url = segments[0].args[0].value
return [CrawlIntent(url=url, next_segments=segments[1:])]
TODO: Converge on a common function parameter type for the register decorator. Right now it allows for AST node type OR string.
Registered Handlers¶
XPath Handler¶
Signature: (Xpath,)
Executes XPath expressions on elements.
URL Literal Handler¶
Signature: ('url', (String,))
Yields a CrawlIntent for a literal URL. This signal eventually reaches the crawler.
URL XPath Handler¶
Signature: ('url', (Xpath,))
Yields CrawlIntents for URLs extracted by XPath.
URL Query Handler¶
Signature: ('//url', ...)
Yields CrawlIntents for URLs extracted by XPath.
URL Crawl Handler¶
Signature: ('///url', (Xpath,))
Recursive deep crawling.
URL Crawl with Extraction Handler¶
Signature: ('///url', (Xpath, str))
Deep crawl with inline extraction. Yields InfiniteCrawlIntent.
Binary (Map) Handler¶
Signature: (Binary, ...)
Handles the map operator (!). (More to come...)
Handler Return Values¶
Handlers return a list of intents:
def my_handler(elem, segments, depth) -> Iterable[Intent]:
return (
CrawlIntent(url="...", next_segments=...),
DataIntent(value={"key": "value"}),
ProcessIntent(elem=elem, next_segments=...),
)
RuntimeSetupError¶
Raised when handler registration fails (e.g., duplicate signature).