Declarative web graph traversal with XPath¶
Warning: pre-1.0.0 - APIs and contracts may change. NEW: TUI - Interactive terminal interface (powered by Textual) for testing wxpath expressions and extracting data.
wxpath is a declarative, deterministic (read: not powered by LLMs), async web crawler where traversal is expressed directly in XPath. Instead of writing imperative crawl loops, wxpath lets you describe what to follow and what to extract in a single expression.
Quick Start¶
import wxpath
expr = "url('https://quotes.toscrape.com')//a/@href"
for link in wxpath.wxpath_async_blocking_iter(expr):
print(link)
Key Features¶
- Declarative Traversal - Express web crawling logic in XPath-like syntax
- RAG-Ready Output - Extract clean, structured JSON hierarchies directly from the graph
- Concurrent Execution - Async-first design with automatic concurrency management
- XPath 3.1 Support - Full XPath 3.1 features including maps and arrays via
elementpath - Polite Crawling - Built-in robots.txt respect and adaptive throttling
- NEW: Persistent Crawls - Optional SQLite or Redis backends for persistent crawl results
- NEW: TUI - Interactive terminal interface for testing wxpath expressions
Installation¶
pip install wxpath
# Optional extras:
pip install wxpath[tui] # Interactive Terminal UI
pip install wxpath[cache-sqlite] # Persistence (SQLite)
pip install wxpath[cache-redis] # Persistence (Redis)
Core Concepts¶
The wxpath DSL extends the familiar XPath syntax with additional operators for web traversal and data extraction.
The url(...) Operator¶
The url(...) operator fetches content from a URL and returns it as an lxml.html.HtmlElement for further XPath processing:
Deep Crawling¶
///url(...)¶
The ///url(...) syntax enables recursive crawling up to a specified max_depth:
path_expr = """
url('https://quotes.toscrape.com')
///url(//a/@href)
//a/@href
"""
for item in wxpath.wxpath_async_blocking_iter(path_expr, max_depth=1):
print(item)
url('...', follow=...)¶
The follow parameter allows you to specify a follow path for recursive crawling at the root node. While this may seem redundant and duplicated behavior found with the ///url syntax, it is not. The follow parameter allows you to initiate, yes, a recursive crawl, however it also allows you to begin extracting data from the root node. This is useful in cases where you want to paginate through search pages.
path_expr = """
url('https://quotes.toscrape.com/tag/humor/', follow=//li[@class='next']/a/@href)
//div[@class='quote']
/map{
'author': (./span/small/text())[1],
'text': (./span[@class='text']/text())[1]
}
"""
XPath 3.1 Maps¶
Extract structured data using XPath 3.1 map syntax:
path_expr = """
url('https://example.com')
/map{
'title': //title/text() ! string(.),
'url': string(base-uri(.))
}
"""
Next Steps¶
- Getting Started - Detailed setup and first crawl
- Language Design - Understanding wxpath expressions
- API Reference - Complete API documentation
- Examples - More usage examples
- NEW: TUI Quickstart - Interactive terminal interface
- NEW: RAG Integrations - Integrations with LangChain