Skip to content
Cascading Labs QScrape VoidCrawl Yosoi

Page Identity and Reuse

Page identity reuse lets Yosoi discover a selector once, save it as a field atom, and reuse it on other pages that share the same page shape. The current implementation is intentionally conservative: reuse is keyed by an exact page-shape bucket and guarded by a policy object. If the atom index cannot answer every requested field unambiguously, Yosoi falls back to normal discovery.

This is the first integration slice from the old generalization work. The fuzzy fingerprinting scorer exists for measurement and diagnostics, but it is not the read key yet.

Discover
Selectors verified on a page
Normal discovery still owns the first selector facts for a contract and page.
Store
Field atoms
page_shape region_role field_name yosoi_type
Gate
Policy and trust
Atom reads must be enabled, unambiguous, and allowed by the trust tier.
Fallback
Discovery remains the escape path
Any ambiguity, missing field, or disallowed source falls back to normal discovery.
Page identity reuse is exact-bucket and policy-gated today. Fuzzy fingerprints are evidence, not the read key.

What Gets Reused

Yosoi stores per-field atoms instead of whole selector-cache files:

UnitPurpose
ContractSpecA serializable contract shape that can be fingerprinted, stored, or passed inline
PageFingerprintA content-free page-shape signature for comparing HTML structure
FieldAtomA reusable field selector keyed by page shape, region role, field name, and Yosoi type
PolicyThe trust boundary for atom reads and source tiers

An atom is domain-aware but not domain-owned. domains_seen is provenance; the read key is the page shape plus field identity. For rootless fields, the region fallback includes the contract name, so those atoms are intentionally less general.

Policy Controls

Atom reads are off unless the policy enables them. The default trust tier is strict.

YOSOI_ATOM_READS=1
YOSOI_ATOM_TRUST=strict

Strict trust accepts selector sources that were verified, manual, or LLM-discovered. fingerprint is classified as quarantined evidence: strict serving rejects it, while yellow trust may serve it as quarantined output.

from yosoi.policy import Policy
policy = Policy(atom_reads=True, trust_tier='strict')

The resolver accepts a policy value directly. Deeper resolver code should not read environment variables itself; environment parsing belongs at the boundary.

Inline Contract Specs

resolve_contract() now accepts a registered contract name, a ContractSpec, or a dictionary that validates as a ContractSpec.

import yosoi as ys
from yosoi.utils.contracts import resolve_contract
class Quote(ys.Contract):
name: str = ys.Field(description='company name')
price: str = ys.Field(description='current share price')
spec = Quote.to_spec()
resolved = resolve_contract(spec)

If an inline spec matches an existing contract fingerprint, Yosoi returns the registered contract. Otherwise, it builds a runtime contract from the spec.

Expected Behavior

Use this checklist when testing page identity reuse:

  1. Discover on one page with atom reads enabled.
  2. Scrape another page with the same shape.
  3. Confirm the second page can resolve from atoms without rediscovering selectors.
  4. Change the target to a structurally different page and confirm Yosoi falls back to discovery.
  5. Run with strict trust first; use yellow trust only for experiments where fingerprint-tier reuse is acceptable.

Current Limits

  • Fuzzy PageFingerprint.similarity() is not wired into reads.
  • Reuse is exact-bucket only.
  • There is no verify-on-reuse step yet.
  • Drift health and signal-lane queues are tracked follow-up work.
  • Network endpoint signatures need VoidCrawl endpoint capture before they can become part of the page identity model.

FAQs

Is page identity reuse the same thing as fuzzy fingerprint matching?

No. Page identity reuse is the selector-serving path. Today it uses exact page-shape buckets and policy checks. Fuzzy fingerprinting is useful diagnostic evidence, but it is not the read key.

What happens when an atom read is ambiguous?

Yosoi falls back to normal discovery. Atom reads must answer every requested field unambiguously before they can serve a scrape.

Why are atoms field-level instead of contract-level?

Field atoms let compatible contracts share individual selector facts without pretending the whole contract is the same. A contract with title and url can share those facts with a larger contract that also asks for snippet.

Should I use yellow trust in production?

Start with strict trust. Yellow trust is for experiments where serving quarantined fingerprint-tier evidence is acceptable and you are watching the output closely.