DOMLoader
Yosoi’s static HTML fetcher handles most sites — content is in the server response, and CSS selectors work immediately. For pages that require JavaScript to render content (single-page apps, infinite scroll feeds, accordion-gated data), Yosoi ships a browser-backed fetcher tier that drives the page to a fully-loaded state before discovery begins.
The component that manages this is DOMLoader.
What DOMLoader Does
DOMLoader sits inside the browser-based fetcher and runs after the initial page load. It works through a behavior tree — a priority-ordered sequence of probes and actions — that clears obstacles (cookie banners, modals) and exhausts content triggers (load-more buttons, pagination, infinite scroll) before capturing the final HTML.
navigate(url) ↓DOMLoader.run(tab) ↓ clear obstacles → exhaust content triggers → wait for DOM stability ↓tab.content() → raw HTML ↓HTMLCleaner → LLM discoveryThe behavior tree restarts after every successful action. It only stops when every node in the tree returns FAILURE — meaning nothing left to do.
How the Behavior Tree Works
The tree has two levels: a Selector at the root tries each branch in order and returns SUCCESS on the first that works; a Sequence checks a condition then runs an action, returning FAILURE if the condition is false.
Selector├── Sequence(HasOverlay, Selector(Sequence(HasCloseButton, ClickClose), Skip))├── Sequence(HasTrigger(LOAD_MORE), ClickTrigger)├── Sequence(HasTrigger(ACCORDION), ClickTrigger)├── Sequence(HasTrigger(TAB), ClickTrigger)├── Sequence(HasTrigger(PAGINATION), ClickTrigger)└── Sequence(HasTrigger(INFINITE_SCROLL), Scroll)Priority order — obstacles fire first so a cookie banner never blocks a load-more probe:
| Kind | What it finds |
|---|---|
| Cookie | Consent banners with accept buttons |
| Popup | Modal dialogs with close buttons |
| Age Gate | Age verification screens |
| Load More | ”Load more”, “Show more” buttons |
| Accordion | Collapsed [aria-expanded="false"] sections |
| Tab | Unselected [role="tab"] panels |
| Pagination | Next-page links and a[rel="next"] |
| Infinite Scroll | Bottom-of-page scroll when content count is divisible by 10 |
DOM Stability
Each action ends with WaitForDOMStable — a MutationObserver that resolves after quiet_ms milliseconds of DOM silence. This means the next tree tick starts from a fully-settled page, not from a mid-render state.
WaitForDOMStable uses MutationObserver so it responds to actual DOM activity. Unlike a fixed asyncio.sleep, it adapts to fast and slow renders equally.
Using the Browser Fetchers
Browser-backed fetching is available through three fetcher types. Pass --fetcher on the CLI or fetcher_type= in Python:
uv run yosoi --url https://example.com --fetcher waterfalluv run yosoi --url https://example.com --fetcher headlessuv run yosoi --url https://example.com --fetcher headfulasync for item in pipeline.scrape(url, fetcher_type='waterfall'): ...| Fetcher | Description |
|---|---|
simple | Plain HTTP — fast, no browser, works for static HTML |
waterfall | Simple → Headless → Headful (tries each tier in order) |
headless | Headless Chrome via VoidCrawl |
headful | Visible Chrome via VoidCrawl (best bot evasion) |
waterfall is the recommended choice for mixed workloads: it uses simple HTTP for static pages and escalates to Chrome only when needed.
The Waterfall Fetcher
JSFetcher (the waterfall) runs three tiers in order and stops at the first that succeeds:
1. SimpleFetcher (plain HTTP) ├── success, no JS → return └── fail or requires_js → continue
2. HeadlessFetcher (headless Chrome + DOMLoader) ├── success → return └── fail → continue
3. HeadfulFetcher (visible Chrome + DOMLoader) └── return regardless (best-effort)The winning tier for each domain is cached in .yosoi/fetch/. Subsequent runs skip the waterfall entirely and jump straight to the cached tier.
pipeline = Pipeline(ys.auto_config(), contract=Product)
async for item in pipeline.scrape('https://finance.yahoo.com/news', fetcher_type='waterfall'): print(item.get('headline'))Tuning DOMLoader
DOMLoader exposes several parameters through HeadlessFetcher and HeadfulFetcher:
| Parameter | Default | Effect |
|---|---|---|
max_cycles | 20 | Maximum behavior tree restarts before stopping |
quiet_ms | 800 | Milliseconds of DOM silence that counts as stable |
max_click_cycles | 50 | Maximum clicks per trigger before exhaustion |
max_scroll_cycles | 10 | Maximum scroll iterations for infinite scroll |
For most sites the defaults work well. Raise max_scroll_cycles for feeds with many pages; lower quiet_ms for fast-rendering SPAs.
Requirements
Browser fetchers require VoidCrawl:
uv add voidcrawlVoidCrawl is a Rust-native Chrome DevTools Protocol client exposed to Python via PyO3. See VoidCrawl for more detail.
FAQs
When should I use the waterfall instead of the simple fetcher?
Use waterfall when you’re scraping a mix of static and dynamic pages and don’t want to think about which is which. It adds latency on static pages (one failed Chrome attempt before returning the plain HTTP result) but is otherwise transparent. For known dynamic sites, use headless or headful directly.
My page loads content but DOMLoader doesn't find it. What's wrong?
Run with --debug to save the HTML that Yosoi sees after DOMLoader finishes. If the content is present, the issue is with selector discovery — not loading. If the content is absent, the page likely uses a trigger pattern not covered by the current catalogues (catalogues.py). Check which patterns DOMLoader probes for and compare against what the page actually uses.
Can I use DOMLoader without the full waterfall?
Yes — fetcher_type='headless' or fetcher_type='headful' use DOMLoader directly without the Simple HTTP tier.
Does DOMLoader interact with authenticated pages?
Not currently. DOMLoader probes for publicly-observable DOM patterns (cookie banners, load-more buttons). It does not handle login forms, session cookies, or authentication flows. Pass pre-authenticated HTML to Yosoi if the target requires login.
References
△ VoidCrawl. Cascading Labs. Rust-native CDP browser automation for Python via PyO3. https://github.com/CascadingLabs/VoidCrawl
○ MutationObserver. MDN Web Docs. Web API for watching for changes to the DOM tree. https://developer.mozilla.org/en-US/docs/Web/API/MutationObserver
◑ Behavior Trees. Wikipedia. Computational model used in robotics and game AI for task selection. https://en.wikipedia.org/wiki/Behavior_tree_(artificial_intelligence,_robotics_and_control)