Debugging
When something goes wrong, Yosoi gives you a few tools to figure out what happened and where. This guide covers the --debug flag, common failure modes, and how to recover.
The --debug Flag
Pass --debug (or -d) to the CLI, or set OutputPolicy(debug_html=True), to save a snapshot of the HTML that was sent to the LLM:
uvx yosoi --url https://qscrape.dev --contract Product --debugpolicy = ys.Policy.cascade( ys.Policy.from_env(), ys.Policy(output=ys.OutputPolicy(debug_html=True)),)pipeline = ys.Pipeline(policy=policy, contract=Product)Snapshots are saved to .yosoi/debug_html/ as plain HTML files, named by domain and timestamp. Open them in a browser to see exactly what the LLM was working with.
Common Failure Modes
1. API Key Invalid or Missing
Symptoms: An authentication error from the LLM provider immediately after discovery starts.
Fix: Check your .env file. The key name must match the provider format: GROQ_KEY, GEMINI_KEY, OPENAI_KEY, etc. Run with a simple test URL to confirm the key works before debugging anything else.
2. Target URL Is Inaccessible
Symptoms: Empty or near-empty debug HTML. The LLM returns selectors that don’t match anything.
Fix: Open the URL in a browser and compare page source with the inspected DOM. If JavaScript is required, try --fetcher auto, --fetcher headless, or --fetcher headful before reaching for external browser automation. Run with --debug again to confirm the HTML snapshot contains the content you expect.
3. Wrong Root Selector
Symptoms: Multi-item extraction yields one giant item (root too broad) or many items with mostly None fields (root too narrow).
Fix: Pin the root on your contract after inspecting the page source:
class Product(ys.Contract): root = ys.css('article.product') # pin the correct container name: str = ys.Title() price: float = ys.Price()See E-Commerce Catalogue: Automatic vs. Pinned Root for a detailed walkthrough.
4. Stale Selector Cache
Symptoms: Extraction used to work but now returns None for some or all fields. The target site has been redesigned.
Fix: Force re-discovery to clear the cached selectors for that domain:
uvx yosoi --url https://qscrape.dev --contract Product --forcepolicy = ys.Policy.cascade( ys.Policy.from_env(), ys.Policy(scrape=ys.ScrapePolicy(force=True)),)pipeline = ys.Pipeline(policy=policy, contract=Product)You can also manually delete the cache file from .yosoi/selectors/ and run again.
5. Context Window Overflow
Symptoms: The LLM returns truncated, garbled, or missing selectors. Large pages with deeply nested HTML are most susceptible.
Fix: There is no built-in mitigation yet (see Smart Batching on the roadmap). Current workarounds:
- Use a model with a larger context window (e.g. Gemini◑ models with 1M+ tokens)
- Trim the HTML yourself before passing it in
- Target a more specific page (e.g. a category page instead of the homepage)
6. Coercion / Validation Errors
Symptoms: Pydantic◇ ValidationError with field-level details. The selector matched, but the extracted text couldn’t be coerced to the field’s type.
Fix: Run with --debug to see the raw extracted values. Common causes:
- A
floatfield receives text like"$12.99"but the coercion type doesn’t strip currency. Useys.Price()instead of a barefloatfield. - A field expecting a single value receives a list (or vice versa). Check whether the field should be
list[str]orstr. - A custom type’s
coercefunction doesn’t handle edge cases. Add a descriptiveValueErrorto surface the issue.
Inspecting the Selector Cache
Cached selectors live in .yosoi/selectors/ as JSON files, one per domain. You can read them directly to see what the LLM discovered:
cat .yosoi/selectors/qscrape.dev.json | python -m json.toolEach entry maps a field name to a selector string. If a selector looks wrong, you have two options:
- Edit the cache file directly and re-run extraction (no LLM call needed)
- Pin the selector on the contract with
yosoi_selectorand force re-discovery
from pydantic import Field
class Product(ys.Contract): name: str = Field( description='Product name', json_schema_extra={'yosoi_selector': {'primary': 'h2.product-title'}}, )Checklist
When something breaks, work through this in order:
- Run with
--debugand inspect.yosoi/debug_html/ - Open the target URL in a browser — is the content visible without JavaScript?
- Check
.yosoi/selectors/— do the cached selectors look reasonable? - Try
--forceto re-discover from scratch - Pin the
rootif multi-item extraction is off - Switch to a larger-context model if the page is very large
References
△ Playwright. Microsoft. Browser automation library for end-to-end testing and web scraping. https://playwright.dev/python/
○ Selenium. OpenQA. Browser automation framework for web testing. https://www.selenium.dev/
◑ Gemini. Google. Large language model family with extended context windows. https://ai.google.dev/
◇ Pydantic. Pydantic Services Inc. Data validation library for Python. https://docs.pydantic.dev/