Skip to content
                            __  __  
 _      ___  ______  ____ _/ /_/ /_ 
| | /| / / |/_/ __ \/ __ `/ __/ __ \
| |/ |/ />  </ /_/ / /_/ / /_/ / / /
|__/|__/_/|_/ .___/\__,_/\__/_/ /_/ 
           /_/

Declarative web graph traversal with XPath

Warning: pre-1.0.0 - APIs and contracts may change. NEW: TUI - Interactive terminal interface (powered by Textual) for testing wxpath expressions and extracting data.

wxpath is a declarative, deterministic (read: not powered by LLMs), async web crawler where traversal is expressed directly in XPath. Instead of writing imperative crawl loops, wxpath lets you describe what to follow and what to extract in a single expression.

Quick Start

import wxpath

expr = "url('https://quotes.toscrape.com')//a/@href"

for link in wxpath.wxpath_async_blocking_iter(expr):
    print(link)

Key Features

  • Declarative Traversal - Express web crawling logic in XPath-like syntax
  • RAG-Ready Output - Extract clean, structured JSON hierarchies directly from the graph
  • Concurrent Execution - Async-first design with automatic concurrency management
  • XPath 3.1 Support - Full XPath 3.1 features including maps and arrays via elementpath
  • Polite Crawling - Built-in robots.txt respect and adaptive throttling
  • NEW: Persistent Crawls - Optional SQLite or Redis backends for persistent crawl results
  • NEW: TUI - Interactive terminal interface for testing wxpath expressions

Installation

pip install wxpath

# Optional extras:
pip install wxpath[tui]           # Interactive Terminal UI
pip install wxpath[cache-sqlite]  # Persistence (SQLite)
pip install wxpath[cache-redis]   # Persistence (Redis)

Core Concepts

The wxpath DSL extends the familiar XPath syntax with additional operators for web traversal and data extraction.

The url(...) Operator

The url(...) operator fetches content from a URL and returns it as an lxml.html.HtmlElement for further XPath processing:

# Fetch a page and extract all links
"url('https://example.com')//a/@href"

Deep Crawling

///url(...)

The ///url(...) syntax enables recursive crawling up to a specified max_depth:

path_expr = """
url('https://quotes.toscrape.com')
  ///url(//a/@href)
    //a/@href
"""

for item in wxpath.wxpath_async_blocking_iter(path_expr, max_depth=1):
    print(item)

url('...', follow=...)

The follow parameter allows you to specify a follow path for recursive crawling at the root node. While this may seem redundant and duplicated behavior found with the ///url syntax, it is not. The follow parameter allows you to initiate, yes, a recursive crawl, however it also allows you to begin extracting data from the root node. This is useful in cases where you want to paginate through search pages.

path_expr = """
url('https://quotes.toscrape.com/tag/humor/', follow=//li[@class='next']/a/@href)
  //div[@class='quote']
    /map{
      'author': (./span/small/text())[1],
      'text': (./span[@class='text']/text())[1]
      }
"""

XPath 3.1 Maps

Extract structured data using XPath 3.1 map syntax:

path_expr = """
url('https://example.com')
  /map{
    'title': //title/text() ! string(.),
    'url': string(base-uri(.))
  }
"""

Next Steps