Skip to content
Cascading Labs QScrape VoidCrawl Yosoi

Classes

Generated from yosoi v0.0.2a19. Only symbols in __all__ are listed.

BrowserProfilePolicy

ClaudeSDKModel

pydantic-ai model backed by the Claude Agent SDK CLI transport.

request

request(messages: list[ModelMessage], _model_settings: ModelSettings | None, model_request_parameters: ModelRequestParameters) -> ModelResponse

Run one Claude Agent SDK request.

Contract

Base class for user-defined scraping contracts.

action_fields

action_fields() -> dict[str, dict[str, Any]]

Return {field_name: action_config} for fields annotated with yosoi_action.

These fields are excluded from CSS selector discovery and verification — their values are captured by running the action during fetch.

coerce_field

coerce_field(name: str, value: object, source_url: str = '') -> object

Coerce + validate a single field’s value the way the full model would.

Runs the same per-field pipeline as :meth:_apply_validators_and_coerce for one field: the inner Validators transform, Yosoi semantic-type coercion, then the field’s Pydantic type + Annotated validators (via a TypeAdapter). Raises pydantic.ValidationError / ValueError on a type or validator failure.

This is the single per-field value oracle reused by JS discovery (reject a script whose output the declared type rejects) and scrape-time enforcement, so a ys.js() field is validated by its declared type — not a heuristic.

define

define(name: str) -> ContractBuilder

Start a fluent ContractBuilder for the given contract name.

discovery_field_names

discovery_field_names() -> set[str]

Return the set of flattened field names used for discovery and cache keys.

Non-Contract fields keep their original name; nested Contract fields are expanded to {parent}_{child} keys. This matches the key format used by snapshots, field_descriptions(), and get_selector_overrides(). Action fields (yosoi_action) are excluded — they have no CSS selector.

field_default

field_default(name: str) -> object

Return a field’s default value (or None when it has no static default).

field_descriptions

field_descriptions() -> dict[str, str]

Return a mapping of field name to description, excluding selector overrides.

Nested Contract-typed fields are expanded to flat {parent}_{child} keys. When the child contract has a pinned root, the description includes a scoping hint. When the child has root = ys.discover(), a co-location hint is added.

file_fields

file_fields() -> dict[str, dict[str, Any]]

Return {field_name: action_config} for ys.File download fields.

A subset of :meth:action_fields filtered to type == 'file'. These run on the live browser tab during fetch and resolve to the view chosen by the field’s declared type (DownloadRecord / path / bytes / text / parsed structure).

from_spec

from_spec(spec: ContractSpec | dict[str, Any]) -> type[Contract]

Rehydrate a Contract class from a :class:ContractSpec or raw dict.

frozen_fields

frozen_fields() -> set[str]

Return the set of field names marked frozen=True (yosoi_frozen).

A frozen field with a cached selector is never re-discovered, even when drift is detected — it replays the cached selector unchanged (CAS-123).

generate_manifest

generate_manifest() -> str

Return a markdown table documenting all contract fields and their config.

get_root

get_root() -> SelectorEntry | None

Return the root selector if explicitly set on the contract class. Returns: SelectorEntry | None — SelectorEntry for the repeating container element, or None.

get_selector_overrides

get_selector_overrides() -> dict[str, dict[str, str]]

Return selector overrides defined on fields via yosoi_selector. Returns: dict[str, dict[str, str]] — Mapping of field name to selector dict (e.g. {"primary": "h1.title"}). dict[str, dict[str, str]] — Nested contract overrides use flat {parent}_{child} keys.

is_grouped

is_grouped() -> bool

Return True if the contract explicitly configures multi-item mode.

list_fields

list_fields() -> dict[str, type]

Return {field_name: inner_type} for fields annotated as list[T].

nested_contracts

nested_contracts() -> dict[str, type[Contract]]

Return a mapping of field name → child Contract class for Contract-typed fields.

to_model

to_model(base: type[BaseModel] = BaseModel, name: str | None = None, include: set[str] | None = None, exclude: set[str] | None = None, extra_fields: Any = {}) -> type[BaseModel]

Project this contract’s fields onto an arbitrary pydantic base.

The single blessed contract→model/ODM export path: pass base=beanie.Document for Mongo, a Django-Ninja Schema for an API surface, or the default BaseModel — no hand-written dict -> model adapter, no plugin, and beanie/pymongo are never imported by Yosoi (the base is caller-injected, keeping the lazy import graph clean). The field types, yosoi_type and descriptions ride along automatically via each field’s json_schema_extra, so they survive into model_json_schema.

This does NOT replace a consumer’s semantic layer (status enums, run_id stamping, granularity de-biasing) — those are genuine consumer decisions, correctly outside Yosoi’s fail-fast/no-decide charter. It deletes only the boilerplate field-restating half: the consumer base declares the extra envelope fields and inherits the extraction fields. Args:

  • base type[BaseModel] — Base class for the generated model (BaseModel / beanie.Document / Ninja Schema / …). Caller owns its import.
  • name str | None — Class name for the generated model. Defaults to f'{cls.__name__}Model'.
  • include set[str] | None — If given, only project these contract field names.
  • exclude set[str] | None — Field names to drop (applied after include). Use this to skip names that collide with ODM internals (id, revision_id).
  • **extra_fields Any — Caller’s envelope fields, in pydantic create_model shape (run_id=(str, ...), captured_at=(datetime, None)).

Returns: type[BaseModel] — A new pydantic model subclassing base with the projected fields.

Raises:

  • ValueError — If include names an unknown field, or an extra_fields name collides with a projected contract field (ambiguous — the caller must rename or exclude the contract field first).

to_selector_model

to_selector_model() -> type[BaseModel]

Generate a Pydantic model mapping each contract field to FieldSelectors.

This ensures that the LLM agent knows exactly which fields to find selectors for, preserving any descriptions or hints provided in the contract. Fields with a yosoi_selector override are excluded — their selectors are provided directly and do not require AI discovery. Nested Contract-typed fields are expanded to flat {parent}_{child} entries.

to_spec

to_spec() -> ContractSpec

Reflect this contract into a serializable :class:ContractSpec.

undiscovered_action_fields

undiscovered_action_fields() -> dict[str, str]

Return {field_name: description} for JS action fields with no pre-authored script.

These fields require LLM-driven JS discovery (CAS-92) before they can be evaluated on a live browser tab.

variant

variant(name: str, description: str) -> type[Contract]

Declare a redundant sibling contract differing ONLY by NL intent.

Two near-identical contracts that share a field set but mean different things — a sponsored/ad result vs an organic one, both {url, title} — used to collide: the contract signature ignored the docstring, so they shared one per-domain cache slot and the second discovery clobbered the first (the failure nimbal’s serp_contracts.py had to abandon). With the docstring now folded into :func:contract_signature, declaring the variants gives each a DISTINCT signature, hence a distinct cached selector on the SAME domain, and the docstring is threaded into the discovery agent so it can actually pick the ad-rail container vs the organic list.

Example::

OrganicLink = Link.variant('OrganicLink', 'A free/organic search result link.')
AdLink = Link.variant('AdLink', 'A paid/sponsored result link.')

Args:

  • name str — Class name for the new contract. Must be unique — the global _CONTRACT_REGISTRY is __name__-keyed, so a duplicate name would clobber a sibling and make resolve_contract ambiguous.
  • description str — The disambiguating NL intent (becomes the class docstring).

Returns: type[Contract] — A new Contract subclass inheriting cls’s fields, with its own type[Contract] — docstring.

Raises:

  • ValueError — If name is empty, equals cls.__name__, or is already registered (would clobber the registry).
  • TypeError — If description is empty (the whole point of a variant).

CrawlBudget

CrawlPolicy

effective_allowed_hosts

effective_allowed_hosts(seeds: tuple[str, ...] = ...) -> tuple[str, ...]

to_runtime_config

to_runtime_config(seeds: tuple[str, ...] = ...) -> CrawlRuntimeConfig

CrawlRunSummary

content_type_counts

content_type_counts() -> dict[str, int]

path_prefix_counts

path_prefix_counts(depth: int = ...) -> dict[str, int]

representative_urls

representative_urls(limit: int | None = ..., html_only: bool = ...) -> list[str]

scrape_target_urls

scrape_target_urls(limit: int | None = ..., html_only: bool = ...) -> list[str]

CrawlRuntimeConfig

CrawlSafety

CrawlTarget

DiscoveryPolicy

DownloadPolicy

DownloadRecord

Provenance for one downloaded file (and the value of a ys.DownloadRecord field).

Treat path as a quarantined location: the bytes have passed the allowed_types gate but should still be handled as untrusted input.

EscalationPolicy

FingerprintPolicy

JobPosting

Contract for job listing pages.

MapHost

Host inventory derived from discovered map URLs.

MapRequest

Canonical request for ys.map / yosoi map.

MapResult

Machine-readable sitemap inventory.

MapSitemap

One sitemap probe and its outcome.

MapUrl

One URL discovered from a sitemap.

ModelPolicy

from_string

from_string(model: str, api_key: str | None = ..., kwargs: Any = {}) -> ModelPolicy

NewsArticle

Default contract matching the original 5-field behavior.

OpenCodeModel

pydantic-ai model backed by a running OpenCode server.

preflight

preflight() -> None

Fail fast with an actionable error when the OpenCode server is unreachable.

request

request(messages: list[ModelMessage], _model_settings: ModelSettings | None, model_request_parameters: ModelRequestParameters) -> ModelResponse

Run one OpenCode request.

OutputPolicy

Controls human and file output for a run.

Use quiet=False for examples and demos where Yosoi should show progress, selected URLs, tables, and scrape results. Keep the default quiet=True for library use where callers consume returned Python values. formats chooses persisted output shapes in SQLite; flat_files additionally mirrors them to .yosoi/content files for workflows that need file artifacts. json_output/ plain_output switch terminal shape for automation.

PageAcquisition

Fetch, clean, and observe a page without owning crawl or scrape semantics.

acquire

acquire(url: str, fetcher: Any, action_scripts: Mapping[str, str] | None = None, download_specs: Mapping[str, DownloadSpec] | None = None) -> PageSnapshot

Acquire one page through the provided fetcher.

PageFingerprint

A page’s structural identity — compute ONCE from HTML, then compare cheaply.

The clean surface for the whole fingerprint: PageFingerprint.of(html) extracts the layer feature sets once; a.matches(b) / a.similarity(b) compare them. Adding a layer (L3 network) is a new field + one term in :meth:similarity — generalizable by construction.

Matching is CONJUNCTIVE and fail-closed: two pages are the same shape only if EVERY layer clears its threshold, so a coarse layer can never force a merge (on real Yahoo, L2 rates a different template ~0.9, but the skeleton ~0.4 vetoes it). A match only PROPOSES a fingerprint-sourced reuse, which the strict trust policy quarantines by default — the fingerprint proposes, the trust policy decides what is served.

Waterfall-aware: a fingerprint carries layers from whatever fetch tier produced it — static HTML gives skeleton/semantic/identity; a browser tier adds the rendered AX spine (L2); a CDP tier will add the network layer (L3). Matching compares only the layers SUBSTANTIVELY PRESENT IN BOTH (a too-thin or absent optional layer abstains — neither vetoes nor vacuously merges).

KNOWN LIMITATIONS (not yet resolved — both safe today because nothing compares cross-tier on the read path yet, and the optional-layer thresholds are PROVISIONAL):

  1. Cross-tier compare (rich seed vs thin replay) silently falls back to the common layers, so the seed’s high-trust layers go unchecked. The intended invariant — “a replay thinner than the seed must ABSTAIN, not match on absence” — needs explicit per-fingerprint carriage tracking and lands with the read-path wiring (see the waterfall plan).
  2. The optional layers can vacuously AGREE on FRAMEWORK-GLOBAL features (e.g. data-mw on every MediaWiki page, or main/navigation roles on every page): such features clear the thinness floor yet carry no template-DISCRIMINATING signal, so identity/ax can score ~1.0 and rubber-stamp a structural near-merge instead of vetoing it. Cardinality is not a trust proxy. The real fix (a framework-global stop-set / IDF-style down-weighting) needs real L2 data to tune; until then these layers can refine but are NOT trusted to authorize a match — which is exactly why a fingerprint-sourced reuse stays strict-quarantined.

matches

matches(other: PageFingerprint, skeleton_threshold: float = SKELETON_SIMILARITY_THRESHOLD, semantic_threshold: float = SEMANTIC_SIMILARITY_THRESHOLD, identity_threshold: float = IDENTITY_SIMILARITY_THRESHOLD, ax_threshold: float = AX_SIMILARITY_THRESHOLD, network_threshold: float = NETWORK_SIMILARITY_THRESHOLD, endpoint_threshold: float = ENDPOINT_SIMILARITY_THRESHOLD) -> bool

Whether two pages are the same shape (conjunctive, fail-closed).

of

of(html: str, ax_snapshot: Any = None, headers: Any = None, endpoints: Any = None) -> PageFingerprint

Compute a page’s fingerprint from its HTML (do this once per page).

Optional richer layers populate only when their fetch-tier signal is supplied, so a static fetch fingerprints on L1 alone (the waterfall principle): pass ax_snapshot (rendered accessibility tree, browser tiers) for the L2 AX-spine layer, headers (the response header map) for the L3-lite network layer, and endpoints (VoidCrawl’s PII-safe PageResponse.endpoints) for the L3 endpoint-path skeleton.

similarity

similarity(other: PageFingerprint, skeleton_threshold: float = SKELETON_SIMILARITY_THRESHOLD, semantic_threshold: float = SEMANTIC_SIMILARITY_THRESHOLD, identity_threshold: float = IDENTITY_SIMILARITY_THRESHOLD, ax_threshold: float = AX_SIMILARITY_THRESHOLD, network_threshold: float = NETWORK_SIMILARITY_THRESHOLD, endpoint_threshold: float = ENDPOINT_SIMILARITY_THRESHOLD) -> PageSimilarity

Per-layer Jaccard plus the conjunctive same-shape verdict against other.

Thresholds default to the tuned operating point but are overridable — bring your own. A degenerate fingerprint on either side forces same_shape=False (fail closed). The optional layers (identity, rendered AX, network) are conjunctive ONLY when both pages carry them substantively (the waterfall “compare on the common layer” rule).

PagePolicy

to_runtime_config

to_runtime_config() -> PageRuntimeConfig

PageRuntimeConfig

PageSnapshot

Acquired page data before crawl/scrape-specific interpretation.

Pipeline

Main pipeline for discovering and saving CSS selectors with retry logic.

Fetches HTML, cleans it, runs LLM-based selector discovery, then verifies and stores the selectors. Behavior is split across focused mixin modules; the public API (scrape, process_url, process_urls) lives here.

process_url

process_url(url: str, force: bool | None = None, max_fetch_retries: int = 2, max_discovery_retries: int = 3, skip_verification: bool = False, fetcher_type: str = 'auto', output_format: str | list[str] | None = None, fetcher: Any | None = None) -> None

Process a single URL: discover, verify, and save selectors.

process_urls

process_urls(urls: list[str], force: bool | None = None, skip_verification: bool = False, fetcher_type: str = 'auto', max_fetch_retries: int = 2, max_discovery_retries: int = 3, output_format: str | list[str] | None = None, workers: int = 1, on_complete: Callable[[str, bool, float], Awaitable[None]] | None = None, on_start: Callable[[str], Awaitable[None]] | None = None, origin: Literal['cli', 'script'] = 'script') -> dict[str, list[str]]

Process multiple URLs and collect results.

scrape

scrape(url: str, force: bool | None = None, max_fetch_retries: int = 2, max_discovery_retries: int = 3, skip_verification: bool = False, fetcher_type: str = 'auto', output_format: str | list[str] | None = None, fetcher: Any | None = None) -> AsyncIterator[ContentMap]

Async generator yielding individual content items from a URL.

Policy

allows_source

allows_source(source: str) -> bool

cascade

cascade(layers: _Policy | None = ()) -> Policy

check_crawl

check_crawl(seeds: tuple[str, ...] = ...) -> PolicyCheck

for_crawl

for_crawl(preset: CrawlPresetName | None = ..., overrides: Any = {}) -> Policy

from_env

from_env(env: Mapping[str, str] | None = ...) -> Policy

output_trust

output_trust(source: str) -> _Trust

page_runtime

page_runtime(scrape: _ScrapePolicy | None = ..., crawl: _CrawlPolicy | None = ...) -> PageRuntimeConfig

require_crawl

require_crawl() -> CrawlPolicy

resolve_run_spec

resolve_run_spec(env: Mapping[str, str] | None = ...) -> ResolvedRunSpec

source_trust

source_trust(source: str) -> _Trust

PolicyCheck

ResolvedRunSpec

SchedulerPolicy

ScrapePolicy

SearchHit

Normalized web search hit.

SearchPolicy

SearchRequest

Canonical request for ys.search / yosoi search.

from_policy

from_policy(query: str, policy: Policy | None = None, kind: SearchKind | None = None, provider: SearchProvider | None = None, backend: str | None = None, region: str | None = None, safesearch: SafeSearch | None = None, max_results: int | None = None, page: int | None = None, timelimit: str | None = None) -> SearchRequest

Build a search request from effective policy plus explicit call-site overrides.

SearchResult

Machine-readable search result envelope.

SecretRef

env

env(name: str) -> SecretRef

resolve

resolve(env: Mapping[str, str] | None = ...) -> str | None

SnapshotStatus

Operational health state for a cached field snapshot.

TelemetryPolicy

Video

Contract for video pages (YouTube-style).