Declarative web graph traversal with XPath¶

Warning: pre-1.0.0 - APIs and contracts may change. NEW: TUI - Interactive terminal interface (powered by Textual) for testing wxpath expressions and extracting data.

wxpath is a declarative, deterministic (read: not powered by LLMs), async web crawler where traversal is expressed directly in XPath. Instead of writing imperative crawl loops, wxpath lets you describe what to follow and what to extract in a single expression.

Quick Start¶

import wxpath

expr = "url('https://quotes.toscrape.com')//a/@href"

for link in wxpath.wxpath_async_blocking_iter(expr):
    print(link)

Key Features¶

Declarative Traversal - Express web crawling logic in XPath-like syntax
RAG-Ready Output - Extract clean, structured JSON hierarchies directly from the graph
Concurrent Execution - Async-first design with automatic concurrency management
XPath 3.1 Support - Full XPath 3.1 features including maps and arrays via elementpath
Polite Crawling - Built-in robots.txt respect and adaptive throttling
NEW: Persistent Crawls - Optional SQLite or Redis backends for persistent crawl results
NEW: TUI - Interactive terminal interface for testing wxpath expressions

Installation¶

pip install wxpath

# Optional extras:
pip install wxpath[tui]           # Interactive Terminal UI
pip install wxpath[cache-sqlite]  # Persistence (SQLite)
pip install wxpath[cache-redis]   # Persistence (Redis)

Core Concepts¶

The wxpath DSL extends the familiar XPath syntax with additional operators for web traversal and data extraction.

The `url(...)` Operator¶

The url(...) operator fetches content from a URL and returns it as an lxml.html.HtmlElement for further XPath processing:

# Fetch a page and extract all links
"url('https://example.com')//a/@href"

Deep Crawling¶

`///url(...)`¶

The ///url(...) syntax enables recursive crawling up to a specified max_depth:

path_expr = """
url('https://quotes.toscrape.com')
  ///url(//a/@href)
    //a/@href
"""

for item in wxpath.wxpath_async_blocking_iter(path_expr, max_depth=1):
    print(item)

`url('...', follow=...)`¶

The follow parameter allows you to specify a follow path for recursive crawling at the root node. While this may seem redundant and duplicated behavior found with the ///url syntax, it is not. The follow parameter allows you to initiate, yes, a recursive crawl, however it also allows you to begin extracting data from the root node. This is useful in cases where you want to paginate through search pages.

path_expr = """
url('https://quotes.toscrape.com/tag/humor/', follow=//li[@class='next']/a/@href)
  //div[@class='quote']
    /map{
      'author': (./span/small/text())[1],
      'text': (./span[@class='text']/text())[1]
      }
"""

XPath 3.1 Maps¶

Extract structured data using XPath 3.1 map syntax:

path_expr = """
url('https://example.com')
  /map{
    'title': //title/text() ! string(.),
    'url': string(base-uri(.))
  }
"""

Next Steps¶

Getting Started - Detailed setup and first crawl
Language Design - Understanding wxpath expressions
API Reference - Complete API documentation
Examples - More usage examples
NEW: TUI Quickstart - Interactive terminal interface
NEW: RAG Integrations - Integrations with LangChain