Classes
Generated from yosoi
v0.0.2a19. Only symbols in__all__are listed.
BrowserProfilePolicy
ClaudeSDKModel
pydantic-ai model backed by the Claude Agent SDK CLI transport.
request
request(messages: list[ModelMessage], _model_settings: ModelSettings | None, model_request_parameters: ModelRequestParameters) -> ModelResponse
Run one Claude Agent SDK request.
Contract
Base class for user-defined scraping contracts.
action_fields
action_fields() -> dict[str, dict[str, Any]]
Return {field_name: action_config} for fields annotated with yosoi_action.
These fields are excluded from CSS selector discovery and verification — their values are captured by running the action during fetch.
coerce_field
coerce_field(name: str, value: object, source_url: str = '') -> object
Coerce + validate a single field’s value the way the full model would.
Runs the same per-field pipeline as :meth:_apply_validators_and_coerce
for one field: the inner Validators transform, Yosoi semantic-type
coercion, then the field’s Pydantic type + Annotated validators (via a
TypeAdapter). Raises pydantic.ValidationError / ValueError on a
type or validator failure.
This is the single per-field value oracle reused by JS discovery (reject a
script whose output the declared type rejects) and scrape-time enforcement,
so a ys.js() field is validated by its declared type — not a heuristic.
define
define(name: str) -> ContractBuilder
Start a fluent ContractBuilder for the given contract name.
discovery_field_names
discovery_field_names() -> set[str]
Return the set of flattened field names used for discovery and cache keys.
Non-Contract fields keep their original name; nested Contract fields are
expanded to {parent}_{child} keys. This matches the key format used
by snapshots, field_descriptions(), and get_selector_overrides().
Action fields (yosoi_action) are excluded — they have no CSS selector.
field_default
field_default(name: str) -> object
Return a field’s default value (or None when it has no static default).
field_descriptions
field_descriptions() -> dict[str, str]
Return a mapping of field name to description, excluding selector overrides.
Nested Contract-typed fields are expanded to flat {parent}_{child} keys.
When the child contract has a pinned root, the description includes a scoping hint.
When the child has root = ys.discover(), a co-location hint is added.
file_fields
file_fields() -> dict[str, dict[str, Any]]
Return {field_name: action_config} for ys.File download fields.
A subset of :meth:action_fields filtered to type == 'file'. These run on
the live browser tab during fetch and resolve to the view chosen by the field’s
declared type (DownloadRecord / path / bytes / text / parsed structure).
from_spec
from_spec(spec: ContractSpec | dict[str, Any]) -> type[Contract]
Rehydrate a Contract class from a :class:ContractSpec or raw dict.
frozen_fields
frozen_fields() -> set[str]
Return the set of field names marked frozen=True (yosoi_frozen).
A frozen field with a cached selector is never re-discovered, even when drift is detected — it replays the cached selector unchanged (CAS-123).
generate_manifest
generate_manifest() -> str
Return a markdown table documenting all contract fields and their config.
get_root
get_root() -> SelectorEntry | None
Return the root selector if explicitly set on the contract class.
Returns: SelectorEntry | None — SelectorEntry for the repeating container element, or None.
get_selector_overrides
get_selector_overrides() -> dict[str, dict[str, str]]
Return selector overrides defined on fields via yosoi_selector.
Returns: dict[str, dict[str, str]] — Mapping of field name to selector dict (e.g. {"primary": "h1.title"}). dict[str, dict[str, str]] — Nested contract overrides use flat {parent}_{child} keys.
is_grouped
is_grouped() -> bool
Return True if the contract explicitly configures multi-item mode.
list_fields
list_fields() -> dict[str, type]
Return {field_name: inner_type} for fields annotated as list[T].
nested_contracts
nested_contracts() -> dict[str, type[Contract]]
Return a mapping of field name → child Contract class for Contract-typed fields.
to_model
to_model(base: type[BaseModel] = BaseModel, name: str | None = None, include: set[str] | None = None, exclude: set[str] | None = None, extra_fields: Any = {}) -> type[BaseModel]
Project this contract’s fields onto an arbitrary pydantic base.
The single blessed contract→model/ODM export path: pass
base=beanie.Document for Mongo, a Django-Ninja Schema for an API
surface, or the default BaseModel — no hand-written dict -> model
adapter, no plugin, and beanie/pymongo are never imported by Yosoi (the
base is caller-injected, keeping the lazy import graph clean). The field
types, yosoi_type and descriptions ride along automatically via each
field’s json_schema_extra, so they survive into model_json_schema.
This does NOT replace a consumer’s semantic layer (status enums, run_id stamping, granularity de-biasing) — those are genuine consumer decisions, correctly outside Yosoi’s fail-fast/no-decide charter. It deletes only the boilerplate field-restating half: the consumer base declares the extra envelope fields and inherits the extraction fields. Args:
basetype[BaseModel]— Base class for the generated model (BaseModel/beanie.Document/ NinjaSchema/ …). Caller owns its import.namestr | None— Class name for the generated model. Defaults tof'{cls.__name__}Model'.includeset[str] | None— If given, only project these contract field names.excludeset[str] | None— Field names to drop (applied afterinclude). Use this to skip names that collide with ODM internals (id,revision_id).**extra_fieldsAny— Caller’s envelope fields, in pydanticcreate_modelshape (run_id=(str, ...),captured_at=(datetime, None)).
Returns: type[BaseModel] — A new pydantic model subclassing base with the projected fields.
Raises:
ValueError— Ifincludenames an unknown field, or anextra_fieldsname collides with a projected contract field (ambiguous — the caller must rename orexcludethe contract field first).
to_selector_model
to_selector_model() -> type[BaseModel]
Generate a Pydantic model mapping each contract field to FieldSelectors.
This ensures that the LLM agent knows exactly which fields to find selectors for,
preserving any descriptions or hints provided in the contract.
Fields with a yosoi_selector override are excluded — their selectors are
provided directly and do not require AI discovery.
Nested Contract-typed fields are expanded to flat {parent}_{child} entries.
to_spec
to_spec() -> ContractSpec
Reflect this contract into a serializable :class:ContractSpec.
undiscovered_action_fields
undiscovered_action_fields() -> dict[str, str]
Return {field_name: description} for JS action fields with no pre-authored script.
These fields require LLM-driven JS discovery (CAS-92) before they can be evaluated on a live browser tab.
variant
variant(name: str, description: str) -> type[Contract]
Declare a redundant sibling contract differing ONLY by NL intent.
Two near-identical contracts that share a field set but mean different
things — a sponsored/ad result vs an organic one, both {url, title} —
used to collide: the contract signature ignored the docstring, so they
shared one per-domain cache slot and the second discovery clobbered the
first (the failure nimbal’s serp_contracts.py had to abandon). With
the docstring now folded into :func:contract_signature, declaring the
variants gives each a DISTINCT signature, hence a distinct cached selector
on the SAME domain, and the docstring is threaded into the discovery agent
so it can actually pick the ad-rail container vs the organic list.
Example::
OrganicLink = Link.variant('OrganicLink', 'A free/organic search result link.')AdLink = Link.variant('AdLink', 'A paid/sponsored result link.')Args:
namestr— Class name for the new contract. Must be unique — the global_CONTRACT_REGISTRYis__name__-keyed, so a duplicate name would clobber a sibling and makeresolve_contractambiguous.descriptionstr— The disambiguating NL intent (becomes the class docstring).
Returns: type[Contract] — A new Contract subclass inheriting cls’s fields, with its own type[Contract] — docstring.
Raises:
ValueError— Ifnameis empty, equalscls.__name__, or is already registered (would clobber the registry).TypeError— Ifdescriptionis empty (the whole point of a variant).
CrawlBudget
CrawlPolicy
effective_allowed_hosts
effective_allowed_hosts(seeds: tuple[str, ...] = ...) -> tuple[str, ...]
to_runtime_config
to_runtime_config(seeds: tuple[str, ...] = ...) -> CrawlRuntimeConfig
CrawlRunSummary
content_type_counts
content_type_counts() -> dict[str, int]
path_prefix_counts
path_prefix_counts(depth: int = ...) -> dict[str, int]
representative_urls
representative_urls(limit: int | None = ..., html_only: bool = ...) -> list[str]
scrape_target_urls
scrape_target_urls(limit: int | None = ..., html_only: bool = ...) -> list[str]
CrawlRuntimeConfig
CrawlSafety
CrawlTarget
DiscoveryPolicy
DownloadPolicy
DownloadRecord
Provenance for one downloaded file (and the value of a ys.DownloadRecord field).
Treat path as a quarantined location: the bytes have passed the
allowed_types gate but should still be handled as untrusted input.
EscalationPolicy
FingerprintPolicy
JobPosting
Contract for job listing pages.
MapHost
Host inventory derived from discovered map URLs.
MapRequest
Canonical request for ys.map / yosoi map.
MapResult
Machine-readable sitemap inventory.
MapSitemap
One sitemap probe and its outcome.
MapUrl
One URL discovered from a sitemap.
ModelPolicy
from_string
from_string(model: str, api_key: str | None = ..., kwargs: Any = {}) -> ModelPolicy
NewsArticle
Default contract matching the original 5-field behavior.
OpenCodeModel
pydantic-ai model backed by a running OpenCode server.
preflight
preflight() -> None
Fail fast with an actionable error when the OpenCode server is unreachable.
request
request(messages: list[ModelMessage], _model_settings: ModelSettings | None, model_request_parameters: ModelRequestParameters) -> ModelResponse
Run one OpenCode request.
OutputPolicy
Controls human and file output for a run.
Use quiet=False for examples and demos where Yosoi should show progress,
selected URLs, tables, and scrape results. Keep the default quiet=True for
library use where callers consume returned Python values. formats chooses
persisted output shapes in SQLite; flat_files additionally mirrors them to
.yosoi/content files for workflows that need file artifacts. json_output/
plain_output switch terminal shape for automation.
PageAcquisition
Fetch, clean, and observe a page without owning crawl or scrape semantics.
acquire
acquire(url: str, fetcher: Any, action_scripts: Mapping[str, str] | None = None, download_specs: Mapping[str, DownloadSpec] | None = None) -> PageSnapshot
Acquire one page through the provided fetcher.
PageFingerprint
A page’s structural identity — compute ONCE from HTML, then compare cheaply.
The clean surface for the whole fingerprint: PageFingerprint.of(html) extracts the
layer feature sets once; a.matches(b) / a.similarity(b) compare them. Adding a
layer (L3 network) is a new field + one term in :meth:similarity — generalizable by
construction.
Matching is CONJUNCTIVE and fail-closed: two pages are the same shape only if EVERY layer
clears its threshold, so a coarse layer can never force a merge (on real Yahoo, L2 rates
a different template ~0.9, but the skeleton ~0.4 vetoes it). A match only PROPOSES a
fingerprint-sourced reuse, which the strict trust policy quarantines by default — the
fingerprint proposes, the trust policy decides what is served.
Waterfall-aware: a fingerprint carries layers from whatever fetch tier produced it — static HTML gives skeleton/semantic/identity; a browser tier adds the rendered AX spine (L2); a CDP tier will add the network layer (L3). Matching compares only the layers SUBSTANTIVELY PRESENT IN BOTH (a too-thin or absent optional layer abstains — neither vetoes nor vacuously merges).
KNOWN LIMITATIONS (not yet resolved — both safe today because nothing compares cross-tier on the read path yet, and the optional-layer thresholds are PROVISIONAL):
- Cross-tier compare (rich seed vs thin replay) silently falls back to the common layers, so the seed’s high-trust layers go unchecked. The intended invariant — “a replay thinner than the seed must ABSTAIN, not match on absence” — needs explicit per-fingerprint carriage tracking and lands with the read-path wiring (see the waterfall plan).
- The optional layers can vacuously AGREE on FRAMEWORK-GLOBAL features (e.g.
data-mwon every MediaWiki page, ormain/navigationroles on every page): such features clear the thinness floor yet carry no template-DISCRIMINATING signal, so identity/ax can score ~1.0 and rubber-stamp a structural near-merge instead of vetoing it. Cardinality is not a trust proxy. The real fix (a framework-global stop-set / IDF-style down-weighting) needs real L2 data to tune; until then these layers can refine but are NOT trusted to authorize a match — which is exactly why afingerprint-sourced reuse stays strict-quarantined.
matches
matches(other: PageFingerprint, skeleton_threshold: float = SKELETON_SIMILARITY_THRESHOLD, semantic_threshold: float = SEMANTIC_SIMILARITY_THRESHOLD, identity_threshold: float = IDENTITY_SIMILARITY_THRESHOLD, ax_threshold: float = AX_SIMILARITY_THRESHOLD, network_threshold: float = NETWORK_SIMILARITY_THRESHOLD, endpoint_threshold: float = ENDPOINT_SIMILARITY_THRESHOLD) -> bool
Whether two pages are the same shape (conjunctive, fail-closed).
of
of(html: str, ax_snapshot: Any = None, headers: Any = None, endpoints: Any = None) -> PageFingerprint
Compute a page’s fingerprint from its HTML (do this once per page).
Optional richer layers populate only when their fetch-tier signal is supplied, so a static
fetch fingerprints on L1 alone (the waterfall principle): pass ax_snapshot (rendered
accessibility tree, browser tiers) for the L2 AX-spine layer, headers (the response
header map) for the L3-lite network layer, and endpoints (VoidCrawl’s PII-safe
PageResponse.endpoints) for the L3 endpoint-path skeleton.
similarity
similarity(other: PageFingerprint, skeleton_threshold: float = SKELETON_SIMILARITY_THRESHOLD, semantic_threshold: float = SEMANTIC_SIMILARITY_THRESHOLD, identity_threshold: float = IDENTITY_SIMILARITY_THRESHOLD, ax_threshold: float = AX_SIMILARITY_THRESHOLD, network_threshold: float = NETWORK_SIMILARITY_THRESHOLD, endpoint_threshold: float = ENDPOINT_SIMILARITY_THRESHOLD) -> PageSimilarity
Per-layer Jaccard plus the conjunctive same-shape verdict against other.
Thresholds default to the tuned operating point but are overridable — bring your own.
A degenerate fingerprint on either side forces same_shape=False (fail closed). The
optional layers (identity, rendered AX, network) are conjunctive ONLY when both pages carry
them substantively (the waterfall “compare on the common layer” rule).
PagePolicy
to_runtime_config
to_runtime_config() -> PageRuntimeConfig
PageRuntimeConfig
PageSnapshot
Acquired page data before crawl/scrape-specific interpretation.
Pipeline
Main pipeline for discovering and saving CSS selectors with retry logic.
Fetches HTML, cleans it, runs LLM-based selector discovery, then verifies
and stores the selectors. Behavior is split across focused mixin modules; the
public API (scrape, process_url, process_urls) lives here.
process_url
process_url(url: str, force: bool | None = None, max_fetch_retries: int = 2, max_discovery_retries: int = 3, skip_verification: bool = False, fetcher_type: str = 'auto', output_format: str | list[str] | None = None, fetcher: Any | None = None) -> None
Process a single URL: discover, verify, and save selectors.
process_urls
process_urls(urls: list[str], force: bool | None = None, skip_verification: bool = False, fetcher_type: str = 'auto', max_fetch_retries: int = 2, max_discovery_retries: int = 3, output_format: str | list[str] | None = None, workers: int = 1, on_complete: Callable[[str, bool, float], Awaitable[None]] | None = None, on_start: Callable[[str], Awaitable[None]] | None = None, origin: Literal['cli', 'script'] = 'script') -> dict[str, list[str]]
Process multiple URLs and collect results.
scrape
scrape(url: str, force: bool | None = None, max_fetch_retries: int = 2, max_discovery_retries: int = 3, skip_verification: bool = False, fetcher_type: str = 'auto', output_format: str | list[str] | None = None, fetcher: Any | None = None) -> AsyncIterator[ContentMap]
Async generator yielding individual content items from a URL.
Policy
allows_source
allows_source(source: str) -> bool
cascade
cascade(layers: _Policy | None = ()) -> Policy
check_crawl
check_crawl(seeds: tuple[str, ...] = ...) -> PolicyCheck
for_crawl
for_crawl(preset: CrawlPresetName | None = ..., overrides: Any = {}) -> Policy
from_env
from_env(env: Mapping[str, str] | None = ...) -> Policy
output_trust
output_trust(source: str) -> _Trust
page_runtime
page_runtime(scrape: _ScrapePolicy | None = ..., crawl: _CrawlPolicy | None = ...) -> PageRuntimeConfig
require_crawl
require_crawl() -> CrawlPolicy
resolve_run_spec
resolve_run_spec(env: Mapping[str, str] | None = ...) -> ResolvedRunSpec
source_trust
source_trust(source: str) -> _Trust
PolicyCheck
ResolvedRunSpec
SchedulerPolicy
ScrapePolicy
SearchHit
Normalized web search hit.
SearchPolicy
SearchRequest
Canonical request for ys.search / yosoi search.
from_policy
from_policy(query: str, policy: Policy | None = None, kind: SearchKind | None = None, provider: SearchProvider | None = None, backend: str | None = None, region: str | None = None, safesearch: SafeSearch | None = None, max_results: int | None = None, page: int | None = None, timelimit: str | None = None) -> SearchRequest
Build a search request from effective policy plus explicit call-site overrides.
SearchResult
Machine-readable search result envelope.
SecretRef
env
env(name: str) -> SecretRef
resolve
resolve(env: Mapping[str, str] | None = ...) -> str | None
SnapshotStatus
Operational health state for a cached field snapshot.
TelemetryPolicy
Video
Contract for video pages (YouTube-style).