Skip to content
Cascading Labs QScrape VoidCrawl Yosoi

Configuration

Environment Variables

VariableRequiredDescription
GROQ_KEYOne of theseGroq API key
GEMINI_KEYOne of theseGoogle Gemini API key
OPENAI_KEYOne of theseOpenAI API key
CEREBRAS_KEYOne of theseCerebras API key
OPENROUTER_KEYOne of theseOpenRouter API key
YOSOI_MODELOptionalDefault model in provider:model format (e.g. groq:llama-3.3-70b-versatile). Read by Policy.from_env().
YOSOI_FORCEOptionalTruthy value forces rediscovery instead of replaying the cached contract. Read into ScrapePolicy.force.
YOSOI_FETCHER_TYPEOptionalDefault fetch tier: auto, simple, headless, headful, waterfall. Read into ScrapePolicy.fetcher_type.
YOSOI_SELECTOR_LEVELOptionalDefault selector ceiling (e.g. all, css, xpath, role). Defaults to all; read into ScrapePolicy.selector_level.
YOSOI_DISCOVERY_MODEOptionalDiscovery mode: auto, static, mcp. Read into DiscoveryPolicy.mode.
YOSOI_CROSS_ORIGIN_DOMOptionalTruthy value opts browser fetchers into cross-origin DOM access (see Cross-origin DOM access). Read into ScrapePolicy.cross_origin_dom. Default off.
YOSOI_SEARCH_BACKENDOptionalDefault DDGS backend string, such as google,bing,brave. Read into SearchPolicy.backend.
YOSOI_SEARCH_REGIONOptionalDefault search region, such as us-en. Read into SearchPolicy.region.
YOSOI_SEARCH_SAFESEARCHOptionalSearch safesearch setting: on, moderate, or off. Read into SearchPolicy.safesearch.
YOSOI_SEARCH_MAX_RESULTSOptionalDefault result limit for ys.search and yosoi search. Read into SearchPolicy.max_results.
YOSOI_SEARCH_PAGEOptionalDefault search results page. Read into SearchPolicy.page.
YOSOI_SEARCH_TIMELIMITOptionalDefault DDGS time limit such as d, w, m, or y. Read into SearchPolicy.timelimit.
YOSOI_LOG_LEVELOptionalLogging level: DEBUG, INFO, WARNING, ERROR, ALL (default: DEBUG)
YOSOI_SESSION_IDOptionalOverride the auto-generated Langfuse session id for this process. Equivalent to the --session-id CLI flag.
LANGFUSE_PUBLIC_KEYOptionalLangfuse project public key. Enables observability when set together with the secret key.
LANGFUSE_SECRET_KEYOptionalLangfuse project secret key.
LANGFUSE_BASE_URLOptionalLangfuse host. Defaults to https://cloud.langfuse.com. Set to http://localhost:3000 for the bundled self-hosted stack. Read into TelemetryPolicy.

These are the most commonly used provider keys. Yosoi supports 25+ providers — each with its own environment variable. You only need one.

Local Storage

Yosoi stores all state in .yosoi/ in your project root (gitignored by default):

.yosoi/
selectors/ # Cached selector JSON per domain
logs/ # Run logs (run_YYYYMMDD_HHMMSS.log)
debug_html/ # Extracted HTML snapshots (--debug only)
content/ # Extracted output files (JSON, CSV, etc.)
stats.json # Cumulative LLM call and usage statistics

Observability

Yosoi ships first-class Langfuse integration. Set LANGFUSE_PUBLIC_KEY and LANGFUSE_SECRET_KEY (plus optional LANGFUSE_BASE_URL) to start exporting traces. Without them, observability is a silent no-op and the pipeline runs unchanged.

The mapping is deliberate: one process = one session, one URL = one trace, the (sub)domain = the user_id. So filtering by user in the Langfuse UI gives you “everything we’ve ever scraped on shop.example.com”, and filtering by session narrows it to a single run. Subdomains are intentionally distinct — shop.example.com does not roll up into example.com.

For the full picture (boot script, Python config, span tree, eval tagging) see the Observability section.

Policy

ys.Policy is the public configuration surface. It is a frozen, serializable tree that can include model selection, scrape behavior, search defaults, discovery settings, telemetry, output formats, downloads, crawl policy, and atom trust settings. Raw secrets are not stored in the policy artifact. Use ys.SecretRef.env(...) to point at a secret and let Yosoi resolve it into a runtime-only ResolvedRunSpec.

import yosoi as ys
policy = ys.Policy.cascade(
ys.Policy.from_env(),
ys.Policy(
model=ys.ModelPolicy.from_string(
'groq:llama-3.3-70b-versatile',
credential_ref=ys.SecretRef.env('GROQ_KEY'),
),
scrape=ys.ScrapePolicy(
force=False,
fetcher_type='auto',
selector_level=ys.SelectorLevel.XPATH,
),
discovery=ys.DiscoveryPolicy(max_concurrent=3),
search=ys.SearchPolicy(backend='google,bing,brave', max_results=10),
telemetry=ys.TelemetryPolicy(
langfuse_public_key_ref=ys.SecretRef.env('LANGFUSE_PUBLIC_KEY'),
langfuse_secret_key_ref=ys.SecretRef.env('LANGFUSE_SECRET_KEY'),
langfuse_host='http://localhost:3000',
),
output=ys.OutputPolicy(formats=('jsonl',), quiet=False),
),
)
rows = await ys.scrape(url, YourContract, policy=policy)

Policy.from_env() reads the environment variables above, including YOSOI_MODEL, YOSOI_FORCE, YOSOI_DISCOVERY_MODE, YOSOI_SEARCH_*, YOSOI_ATOM_READS, YOSOI_ATOM_TRUST, and Langfuse settings. Policy.cascade(...) merges layers from lowest to highest precedence, so a call-site policy can override env defaults without mutating global state.

Providing the API key

There are two ways to give a model its key, and neither stores the raw secret in the policy — it never appears in model_dump(), repr(), or policy_hash:

  • Env-resolved (recommended for deployments) — point at an environment variable with ys.SecretRef.env('GROQ_KEY'). The key is read only when you call resolve_run_spec():

    policy = ys.Policy(
    model=ys.ModelPolicy.from_string(
    'groq:llama-3.3-70b-versatile',
    credential_ref=ys.SecretRef.env('GROQ_KEY'),
    ),
    )
    spec = policy.resolve_run_spec() # reads GROQ_KEY from os.environ
  • Direct (for a key you already hold) — pass api_key= to from_string(...) (or any provider helper such as ys.groq(...)). It is kept runtime-only, so resolve_run_spec() needs no environment mapping:

    policy = ys.Policy(
    model=ys.ModelPolicy.from_string('groq:llama-3.3-70b-versatile', api_key=my_key),
    )
    spec = policy.resolve_run_spec() # no env dict needed

    Passing a raw {'GROQ_KEY': ...} mapping to resolve_run_spec() is reserved for tests and tooling that need to resolve against a synthetic environment — prefer one of the two forms above in application code.

Cross-origin DOM access

By default, browser fetchers cannot run JavaScript inside a cross-origin iframe that Chrome isolates out-of-process (e.g. an embedded google.com frame). Set ScrapePolicy.cross_origin_dom=True (or the YOSOI_CROSS_ORIGIN_DOM env var) to launch Chrome with site-isolation field trials disabled so frame-scoped evaluation can reach those origins. Requires VoidCrawl ≥ 0.3.5.

policy = ys.Policy(
model=ys.ModelPolicy.from_string('groq:llama-3.3-70b-versatile', api_key=my_key),
scrape=ys.ScrapePolicy(fetcher_type='headless', cross_origin_dom=True),
)
rows = await ys.scrape(url, YourContract, policy=policy)

This is opt-in and off by default because it weakens the browser’s security isolation for the whole session — only enable it when you actually need to read or drive an isolated cross-origin frame. The simple (non-browser) tier ignores it.

Discovery concurrency

Per-field LLM fan-out within one URL is capped by an asyncio.Semaphore. The default cap is 5; tune it with DiscoveryPolicy:

policy = ys.Policy.cascade(
ys.Policy.from_env(),
ys.Policy(discovery=ys.DiscoveryPolicy(max_concurrent=3)),
)
pipeline = ys.Pipeline(policy=policy, contract=YourContract)
FieldTypeRangeDefaultEffect
DiscoveryPolicy.max_concurrentint1-505Caps how many per-field LLM calls fan out concurrently within one URL via asyncio.gather + asyncio.Semaphore. Increase for higher throughput on small contracts; decrease if you’re hitting LLM rate limits or want more deterministic ordering.

For the four-dimension concurrency model (cross-session / inter-URL / intra-URL / per-domain write), see Instrumenting pipelines — Concurrency.

FAQs

What happens if I set multiple provider keys?

Yosoi picks one based on a built-in fallback order (Groq, Gemini, Cerebras, OpenAI, OpenRouter). To control which provider and model are used, set YOSOI_MODEL to a provider:model string (e.g. groq:llama-3.3-70b-versatile).

Can I change the .yosoi/ storage location?

Not currently. The directory is always created in the working directory where Yosoi is run.

Is .yosoi/ safe to commit to version control?

The selector cache is safe to commit if you want to share discovered selectors across a team. The logs/, debug_html/, and content/ subdirectories are noisy and should stay gitignored.

How do I enable debug HTML snapshots?

Pass --debug when running the CLI. Snapshots are saved to .yosoi/debug_html/ and are useful for diagnosing extraction failures.

Can multiple CLI invocations share one Langfuse session?

Yes. Pass --session-id <id> (or set YOSOI_SESSION_ID=<id> in the environment) so every invocation under that orchestrator rolls up into one logical session in the Langfuse UI.

References

Groq API. Groq, Inc. Low-latency LLM inference. https://console.groq.com/docs/

Gemini API. Google. Gemini language model API. https://ai.google.dev/gemini-api/docs

OpenAI API. OpenAI. GPT model API. https://platform.openai.com/docs/

Cerebras API. Cerebras Systems. High-speed LLM inference on wafer-scale hardware. https://inference-docs.cerebras.ai/

OpenRouter. OpenRouter. Unified API for LLM providers. https://openrouter.ai/docs

Langfuse. Langfuse. Open-source LLM observability for production AI. https://langfuse.com/docs