Skip to content
Cascading Labs QScrape VoidCrawl Yosoi

DOMLoader

Yosoi’s static HTML fetcher handles most sites — content is in the server response, and CSS selectors work immediately. For pages that require JavaScript to render content (single-page apps, infinite scroll feeds, accordion-gated data), Yosoi ships a browser-backed fetcher tier that drives the page to a fully-loaded state before discovery begins.

The component that manages this is DOMLoader.

What DOMLoader Does

DOMLoader sits inside the browser-based fetcher and runs after the initial page load. It works through a behavior tree — a priority-ordered sequence of probes and actions — that clears obstacles (cookie banners, modals) and exhausts content triggers (load-more buttons, pagination, infinite scroll) before capturing the final HTML.

navigate(url)
DOMLoader.run(tab)
clear obstacles → exhaust content triggers → wait for DOM stability
tab.content() → raw HTML
HTMLCleaner → LLM discovery

The behavior tree restarts after every successful action. It only stops when every node in the tree returns FAILURE — meaning nothing left to do.

How the Behavior Tree Works

The tree has two levels: a Selector at the root tries each branch in order and returns SUCCESS on the first that works; a Sequence checks a condition then runs an action, returning FAILURE if the condition is false.

Selector
├── Sequence(HasOverlay, Selector(Sequence(HasCloseButton, ClickClose), Skip))
├── Sequence(HasTrigger(LOAD_MORE), ClickTrigger)
├── Sequence(HasTrigger(ACCORDION), ClickTrigger)
├── Sequence(HasTrigger(TAB), ClickTrigger)
├── Sequence(HasTrigger(PAGINATION), ClickTrigger)
└── Sequence(HasTrigger(INFINITE_SCROLL), Scroll)

Priority order — obstacles fire first so a cookie banner never blocks a load-more probe:

KindWhat it finds
CookieConsent banners with accept buttons
PopupModal dialogs with close buttons
Age GateAge verification screens
Load More”Load more”, “Show more” buttons
AccordionCollapsed [aria-expanded="false"] sections
TabUnselected [role="tab"] panels
PaginationNext-page links and a[rel="next"]
Infinite ScrollBottom-of-page scroll when content count is divisible by 10

DOM Stability

Each action ends with WaitForDOMStable — a MutationObserver that resolves after quiet_ms milliseconds of DOM silence. This means the next tree tick starts from a fully-settled page, not from a mid-render state.

WaitForDOMStable uses MutationObserver so it responds to actual DOM activity. Unlike a fixed asyncio.sleep, it adapts to fast and slow renders equally.

Using the Browser Fetchers

Browser-backed fetching is available through three fetcher types. Pass --fetcher on the CLI or fetcher_type= in Python:

uv run yosoi --url https://example.com --fetcher waterfall
uv run yosoi --url https://example.com --fetcher headless
uv run yosoi --url https://example.com --fetcher headful
async for item in pipeline.scrape(url, fetcher_type='waterfall'):
...
FetcherDescription
simplePlain HTTP — fast, no browser, works for static HTML
waterfallSimple → Headless → Headful (tries each tier in order)
headlessHeadless Chrome via VoidCrawl
headfulVisible Chrome via VoidCrawl (best bot evasion)

waterfall is the recommended choice for mixed workloads: it uses simple HTTP for static pages and escalates to Chrome only when needed.

The Waterfall Fetcher

JSFetcher (the waterfall) runs three tiers in order and stops at the first that succeeds:

1. SimpleFetcher (plain HTTP)
├── success, no JS → return
└── fail or requires_js → continue
2. HeadlessFetcher (headless Chrome + DOMLoader)
├── success → return
└── fail → continue
3. HeadfulFetcher (visible Chrome + DOMLoader)
└── return regardless (best-effort)

The winning tier for each domain is cached in .yosoi/fetch/. Subsequent runs skip the waterfall entirely and jump straight to the cached tier.

pipeline = Pipeline(ys.auto_config(), contract=Product)
async for item in pipeline.scrape('https://finance.yahoo.com/news', fetcher_type='waterfall'):
print(item.get('headline'))

Tuning DOMLoader

DOMLoader exposes several parameters through HeadlessFetcher and HeadfulFetcher:

ParameterDefaultEffect
max_cycles20Maximum behavior tree restarts before stopping
quiet_ms800Milliseconds of DOM silence that counts as stable
max_click_cycles50Maximum clicks per trigger before exhaustion
max_scroll_cycles10Maximum scroll iterations for infinite scroll

For most sites the defaults work well. Raise max_scroll_cycles for feeds with many pages; lower quiet_ms for fast-rendering SPAs.

Requirements

Browser fetchers require VoidCrawl:

uv add voidcrawl

VoidCrawl is a Rust-native Chrome DevTools Protocol client exposed to Python via PyO3. See VoidCrawl for more detail.

FAQs

When should I use the waterfall instead of the simple fetcher?

Use waterfall when you’re scraping a mix of static and dynamic pages and don’t want to think about which is which. It adds latency on static pages (one failed Chrome attempt before returning the plain HTTP result) but is otherwise transparent. For known dynamic sites, use headless or headful directly.

My page loads content but DOMLoader doesn't find it. What's wrong?

Run with --debug to save the HTML that Yosoi sees after DOMLoader finishes. If the content is present, the issue is with selector discovery — not loading. If the content is absent, the page likely uses a trigger pattern not covered by the current catalogues (catalogues.py). Check which patterns DOMLoader probes for and compare against what the page actually uses.

Can I use DOMLoader without the full waterfall?

Yes — fetcher_type='headless' or fetcher_type='headful' use DOMLoader directly without the Simple HTTP tier.

Does DOMLoader interact with authenticated pages?

Not currently. DOMLoader probes for publicly-observable DOM patterns (cookie banners, load-more buttons). It does not handle login forms, session cookies, or authentication flows. Pass pre-authenticated HTML to Yosoi if the target requires login.

References

VoidCrawl. Cascading Labs. Rust-native CDP browser automation for Python via PyO3. https://github.com/CascadingLabs/VoidCrawl

MutationObserver. MDN Web Docs. Web API for watching for changes to the DOM tree. https://developer.mozilla.org/en-US/docs/Web/API/MutationObserver

Behavior Trees. Wikipedia. Computational model used in robotics and game AI for task selection. https://en.wikipedia.org/wiki/Behavior_tree_(artificial_intelligence,_robotics_and_control)