Language Design¶

Warning: pre-1.0.0 - APIs and contracts may change.

wxpath extends XPath with URL-fetching operators, enabling declarative web traversal expressed directly in path syntax.

Core Operators¶

`url(literal)`¶

Fetches content from a literal URL string:

"url('https://example.com')"

Returns an lxml.html.HtmlElement representing the parsed HTML document.

`url(xpath)`¶

Fetches content from URLs extracted by an XPath expression:

"url('https://example.com')//url(//a/@href)"

The inner XPath (//a/@href) extracts URLs from the current document, which are then fetched.

`///url(xpath)` - Deep Crawl (starting from descendants)¶

Indicates recursive^* crawling. The engine follows links extracted by the XPath expression up to max_depth:

^* The term "recursive" is used informally in this document, however, I do not mean DFS-style recursion. wxpath uses a FIFO queue, and we expand that queue by visiting the URLs present in the current document.

"""
url('https://example.com')
  ///url(//a/@href)
    //title/text()
"""

`url('...', follow=...)` - Deep Crawl (starting from root)¶

The follow parameter allows you to specify a follow path for recursive crawling at the root node. While this may seem redundant and duplicated behavior found with the ///url syntax, it is not. The follow parameter allows you to initiate, yes, a recursive crawl, however it also allows you to begin extracting data from the root node. This is useful in cases where you want to paginate through search pages.

path_expr = """
url('https://quotes.toscrape.com/tag/humor/', follow=//li[@class='next']/a/@href)
  //div[@class='quote']
    /map{
      'author': (./span/small/text())[1],
      'text': (./span[@class='text']/text())[1]
      }
"""

`url('...', follow=..., depth=...)` - Deep Crawl (starting from root) with depth limit¶

The depth parameter allows you to specify the maximum depth of the crawl. This is useful when you want to limit the depth of the crawl to a specific number of levels.

Expression Structure¶

A wxpath expression is a sequence of segments:

url(start) [///url(follow)] [extraction]

Where: - url(start) - Initial URL(s) to fetch - ///url(follow) - Optional recursive crawl pattern - extraction - XPath to extract data from fetched pages

Segment Types¶

Segment	Description
`url('...')`	Fetch from literal URL
`/url(xpath)`	Fetch from XPath-extracted URLs
`//url(xpath)`	One-hop link following
`///url(xpath)`	Recursive deep crawl
XPath	Standard XPath 3.1 expression

XPath 3.1 Features¶

wxpath uses elementpath for XPath 3.1 support, enabling:

Maps¶

"/map{ 'key1': //xpath1, 'key2': //xpath2 }"

Arrays¶

"/array{ //item1, //item2 }"

Built-in Functions¶

Standard XPath 3.1 functions plus wxpath-specific:

Function	Description
`base-uri(.)`	Current document URL
`wx:current-url()`	Current document URL
`wx:backlink(.)`	URL that linked to current page
`wx:depth(.)`	Current crawl depth
`wx:fetch-time()`	Fetch time
`wx:elapsed()`	Fetch time
`wx:status-code()`	Current HTTP status code
`wx:elem()`	Current HTML element (Warning: memory will explode with deep crawls)
`wx:internal-links()`	Internal links
`wx:external-links()`	External links
`wx:main-article-text()`	Main article text (reimplimentation of eatiht)

Future Potential Functions¶

Function	Description
`post(...)`	HTTP POST
`wx:limit()`	Cap results at expression level
`wx:clean-text()`	Normalizes whitespace and newlines
`wx:status-text()`	Current HTTP status text
`wx:delay()`	Sleep for the specified number of seconds
`wx:headers()`	HTTP headers

Examples¶

Extract Links with Context¶

"""
url('https://example.com')
  //a/map{
    'text': string(.),
    'href': @href,
    'source': string(base-uri(.))
  }
"""

Filtered Deep Crawl¶

"""
url('https://en.wikipedia.org/wiki/Main_Page')
  ///url(
    //a/@href[
      starts-with(., '/wiki/')
      and not(contains(., ':'))
    ]
  )
  /map{
    'title': //h1/span/text() ! normalize-space(.),
    'url': string(base-uri(.))
  }
"""

Pagination Pattern¶

"""
url('https://example.com/page/1')
  ///url(//a[@class='next']/@href)
    //div[@class='item']/text()
"""

Execution Model¶

Parse - Expression is parsed into an AST of segments
Seed - Initial url('...') creates the first crawl tasks
BFS-ish Traversal - Tasks are processed breadth-first with deduplication
Extraction - XPath segments extract data from fetched documents
Streaming - Results are yielded as soon as available

The engine handles: - URL deduplication (best-effort, per-crawl) - Concurrent fetching with configurable limits - robots.txt compliance - Adaptive throttling - Retry policies

Best Practices¶

Use XPath predicates to filter links early and avoid traversal explosion
Set appropriate max_depth to bound crawl scope
Add User-Agent headers for polite crawling
Use caching for development and resumable crawls