Skip to content

Configuration

Warning: pre-1.0.0 - APIs and contracts may change.

wxpath provides hierarchical configuration through the SETTINGS object.

Settings Structure

from wxpath.settings import SETTINGS, CRAWLER_SETTINGS, CACHE_SETTINGS

Crawler Settings

Access via CRAWLER_SETTINGS or SETTINGS.http.client.crawler:

Setting Type Default Description
concurrency int 16 Global concurrent requests
per_host int 8 Per-host concurrent requests
timeout int 15 Request timeout in seconds
headers dict {...} Default HTTP headers
proxies dict None Per-host proxy mapping
respect_robots bool True Honor robots.txt
auto_throttle_target_concurrency float None Target concurrent requests for throttler
auto_throttle_start_delay float 0.25 Initial throttle delay
auto_throttle_max_delay float 10.0 Maximum throttle delay

Cache Settings

Access via CACHE_SETTINGS or SETTINGS.http.client.cache:

Setting Type Default Description
enabled bool False Enable response caching
expire_after timedelta timedelta(days=7) Cache TTL in seconds
allowed_methods tuple ("GET", "HEAD") HTTP methods to cache
allowed_codes tuple (200, 203, 301, 302, 307, 308) Status codes to cache
ignored_params list ["utm_*", "fbclid"] Query params to ignore in cache key
backend str "sqlite" Cache backend ("sqlite" or "redis")
sqlite dict {...} SQLite backend settings
redis dict {...} Redis backend settings

For SQLite backend:

Setting Type Default Description
cache_name str "cache.db" SQLite cache name

For Redis backend:

Setting Type Default Description
redis.address str "redis://localhost:6379/0" Redis connection URL
cache_name str "wxpath:" Redis cache name

Configuration Examples

Setting Headers

from wxpath.settings import CRAWLER_SETTINGS

CRAWLER_SETTINGS.headers = {
    'User-Agent': 'my-crawler/1.0 (contact: you@example.com)',
    'Accept-Language': 'en-US,en;q=0.9'
}

Enabling Caching

from wxpath.settings import CACHE_SETTINGS

# SQLite backend (default)
CACHE_SETTINGS.enabled = True

# Redis backend
CACHE_SETTINGS.enabled = True
CACHE_SETTINGS.backend = "redis"
CACHE_SETTINGS.redis.address = "redis://localhost:6379/0"

Custom Concurrency

from wxpath.settings import CRAWLER_SETTINGS

CRAWLER_SETTINGS.concurrency = 32
CRAWLER_SETTINGS.per_host = 4

Proxy Configuration

from wxpath.settings import CRAWLER_SETTINGS
from collections import defaultdict

# Per-host proxies
CRAWLER_SETTINGS.proxies = {
    'example.com': 'http://proxy1:8080',
    'api.example.com': 'http://proxy2:8080'
}

# Default proxy for all hosts
CRAWLER_SETTINGS.proxies = defaultdict(lambda: 'http://default-proxy:8080')

Engine Configuration

For fine-grained control, configure the engine and crawler directly:

from wxpath import wxpath_async_blocking_iter
from wxpath.core.runtime import WXPathEngine
from wxpath.http.client import Crawler
from wxpath.http.policy.retry import RetryPolicy
from wxpath.http.policy.throttler import AutoThrottler
from wxpath.settings import CRAWLER_SETTINGS

CRAWLER_SETTINGS.headers = {'User-Agent': 'my-app/0.4.0 (contact: you@example.com)'}

# Custom retry policy
retry_policy = RetryPolicy(
    max_retries=3,
    retry_statuses={500, 502, 503, 504}
)

# Custom throttler
throttler = AutoThrottler(
    target_concurrency=2.0,
    start_delay=1.0,
    max_delay=30.0
)

# Create crawler
crawler = Crawler(
    concurrency=8,
    per_host=2,
    timeout=15,
    headers={'User-Agent': 'my-app/1.0'},
    retry_policy=retry_policy,
    throttler=throttler,
    respect_robots=True
)

# Create engine
engine = WXPathEngine(
    crawler=crawler,
    allowed_response_codes={200, 301, 302},
    allow_redirects=True
)

path_expr = """
url('https://quotes.toscrape.com/tag/humor/', follow=//li[@class='next']/a/@href)
  //div[@class='quote']
    /map{
      'author': (./span/small/text())[1],
      'text': (./span[@class='text']/text())[1]
      }
"""

# Use engine
for item in wxpath_async_blocking_iter(path_expr, max_depth=1, engine=engine):
    print(item)

AttrDict

Settings use AttrDict for dot-notation access:

from wxpath.settings import SETTINGS

# Both work
SETTINGS['http']['client']['crawler']['concurrency']
SETTINGS.http.client.crawler.concurrency