Fetchers

Q: How do I know which fetcher a URL actually needed?

Check .yosoi/fetch/ after the waterfall runs for a domain. The JSON file records which tier won. Run with --debug to also save the HTML Yosoi received from that tier.

Q: Can I pass custom Chrome arguments to the browser fetchers?

Not directly through the Pipeline API today. The VoidCrawl BrowserConfig is constructed inside the fetcher with headless, stealth, and no_sandbox options. Pass no_sandbox=True when running inside Docker or other sandboxed environments.

Yosoi fetches HTML before selector discovery and extraction. Every fetch goes through an HTMLFetcher instance. The default is auto: Yosoi starts with the simple HTTP tier when that is enough, then promotes to browser-backed fetching when the page needs rendered DOM evidence.

Fetcher Types

Fetcher	CLI flag	Python value	Description
Auto	`--fetcher auto`	`'auto'`	Default. Starts with simple HTTP and promotes to browser tiers when needed.
Simple	`--fetcher simple`	`'simple'`	Plain HTTP with realistic headers. Fast, no Chrome dependency.
Waterfall	`--fetcher waterfall`	`'waterfall'`	Simple → Headless → Headful. Adapts to the page automatically.
Headless	`--fetcher headless`	`'headless'`	Headless Chrome via VoidCrawl.
Headful	`--fetcher headful`	`'headful'`	Visible Chrome via VoidCrawl. Best bot evasion.

Auto Fetcher (default)

auto is the default because it keeps static pages cheap without making JavaScript-rendered pages a separate workflow. It begins with simple HTTP and uses the same browser-backed tiers as the waterfall when the probe or page result shows that static HTML is not enough.

uvx yosoi --url https://qscrape.dev/l1/news --contract NewsArticle

rows = await ys.scrape(url, Contract, policy=ys.Policy.from_env())

When to use: most workflows. Use an explicit fetcher only when you need a fixed tier for reproducibility, debugging, or deployment constraints.

Simple Fetcher

Sends HTTP requests with randomized user-agent headers and realistic browser fingerprints. Works for the majority of sites — any page where the content is present in the server response without needing JavaScript to render.

When to use: static sites, news portals, most product catalogues, anything where Cmd+U / Ctrl+U in a browser shows the content you want to extract.

Waterfall Fetcher

Tries three tiers in order and stops at the first that succeeds:

1. SimpleFetcher (plain HTTP)
   ├── success, no JS detected → return immediately
   └── failure or JS-rendered page → continue

2. HeadlessFetcher (headless Chrome + DOMLoader)
   ├── success → return immediately
   └── failure → continue

3. HeadfulFetcher (visible Chrome + DOMLoader)
   └── return result regardless (best-effort final tier)

The winning tier for each domain is cached in .yosoi/fetch/. On the next run, the waterfall is skipped entirely — Yosoi jumps straight to the cached tier.

uvx yosoi --url https://finance.yahoo.com --contract NewsArticle --fetcher waterfall

policy = ys.Policy.cascade(
    ys.Policy.from_env(),
    ys.Policy(scrape=ys.ScrapePolicy(fetcher_type='waterfall')),
)
rows = await ys.scrape(url, Contract, policy=policy)

When to use: mixed workloads, sites you haven’t tested yet, or anywhere you want Yosoi to adapt without configuring which tier to use.

Headless and Headful Fetchers

Both tiers use VoidCrawl^△ to drive a Chrome instance and DOMLoader to bring the page to a fully-loaded state. The difference is visibility:

Headless: Chrome runs without a window. Faster, suitable for most dynamic sites.
Headful: Chrome shows a visible window. Harder for anti-bot systems to distinguish from a real user. Use when headless gets blocked.

Both require VoidCrawl to be installed:

uv add voidcrawl

See DOMLoader for how the page-loading behavior tree works.

Detecting Which Tier Is Needed

You don’t need to figure this out yourself when using the waterfall — it runs a HEAD probe before the first full fetch to check for JavaScript signals:

Content-Length under 5,000 bytes on an HTML response
Framework headers (X-Powered-By: Next.js, server headers for Vercel/Netlify)
Hard-block status codes (403, 429, 503)
Chunked transfer with no Content-Length (streaming SSR)

When the HEAD probe returns positive, the waterfall skips Simple HTTP and starts from headless Chrome.

For a deeper explanation of what makes a site require JavaScript, see Understanding the Web.

Strategy Cache

The waterfall stores the winning fetcher tier per domain in .yosoi/fetch/. This prevents re-running the three-tier probe on every request for a domain Yosoi has already learned about.

.yosoi/
  fetch/
    fetch_finance_yahoo_com.json
    fetch_shop_example_com.json

Each file records the tier and the highest selector strategy level that worked:

{
  "domain": "finance.yahoo.com",
  "fetcher": "headless",
  "selector_level": "css",
  "discovered_at": "2026-05-23T14:00:00Z"
}

To re-run the waterfall for a specific domain (e.g. after a site redesign), delete its fetch file:

rm .yosoi/fetch/fetch_finance_yahoo_com.json

Or pass --force to also re-run selector discovery:

uvx yosoi --url https://finance.yahoo.com --contract NewsArticle --fetcher waterfall --force

Using a Shared Fetcher Across URLs

When processing multiple URLs sequentially, passing a single fetcher instance avoids the overhead of creating and closing a client per URL. The waterfall’s Chrome instances start lazily on first need and stay open for the batch.

from yosoi.core.fetcher import create_fetcher

async with create_fetcher('waterfall') as fetcher:
    for url in urls:
        await pipeline.process_url(url, fetcher=fetcher)

process_urls() already does this internally — it creates one shared fetcher for the entire batch.

FAQs

How do I know which fetcher a URL actually needed?

Check .yosoi/fetch/ after the waterfall runs for a domain. The JSON file records which tier won. Run with --debug to also save the HTML Yosoi received from that tier.

Does concurrent processing (workers > 1) work with browser fetchers?

Yes. Each concurrent worker creates its own fetcher instance. For the waterfall, Chrome starts lazily per worker on first need. The per-domain strategy cache is shared across workers via the filesystem.

Can I pass custom Chrome arguments to the browser fetchers?

Not directly through the Pipeline API today. The VoidCrawl BrowserConfig is constructed inside the fetcher with headless, stealth, and no_sandbox options. Pass no_sandbox=True when running inside Docker or other sandboxed environments.

References

△ VoidCrawl. Cascading Labs. Rust-native CDP browser automation for Python via PyO3. https://github.com/CascadingLabs/VoidCrawl

○ DOMLoader. Cascading Labs. Behavior-tree page loader for Yosoi browser fetchers. /guides/dom-loader/

◑ Understanding the Web. Cascading Labs. How HTML, the DOM, and JavaScript frameworks affect what Yosoi can see. /guides/understanding-the-web/