Skip to content
Cascading Labs QScrape VoidCrawl Yosoi

Fingerprinting

Yosoi fingerprinting helps answer whether selector knowledge from one fetched page is relevant to another fetched page when URLs, text, ads, counts, or site chrome change.

Today, weighted fingerprints are primarily auditable similarity evidence. They do not automatically cause fuzzy selector reuse. The current read path is still conservative: exact page-shape buckets and policy decide whether selector evidence may serve a scrape.

Capture
Fetched page
HTML plus optional browser, header, and endpoint evidence.
Reduce
Content-free layers
skeletonsemanticidentityAXnetworkendpoint
Compare
Weighted Jaccard
Generic chrome has less influence; discriminating template evidence has more.
Decide
Same shape?
Every available layer with enough signal must pass its threshold.
Fingerprinting compares page shape without storing page content or authorizing selector reuse by itself.

What It Does

A page fingerprint compares page shape without keeping page content.

  • Content-free means values are stripped out. Yosoi keeps signals such as tag paths, landmark names, schema types, data-* keys, accessibility roles, header names, and endpoint skeletons.
  • Layer means one family of signals, such as DOM skeleton, semantic landmarks, identity attributes, accessibility roles, network headers, or endpoints.
  • Carried layer means both pages have enough signal for that layer to vote. Optional layers with too little signal abstain.

The shape decision asks whether every available layer with enough signal clears its threshold. This prevents shared navigation, footers, or framework chrome from forcing unrelated pages to look similar.

Page Layers

PageFingerprint.of(...) builds these feature sets:

LayerFingerprint fieldSimilarity evidence fieldSourceKeepsIgnores
Template skeletonskeletonskeletonHTMLdepth-2 tag paths, identity presence, first two non-hash class tokens per nodetext values, repeated row counts, CSS hash noise
Static semanticssemanticsemanticHTMLlandmarks, roles, heading count bands, schema.org typesheading text and values
Identity attributesidentityidentityHTMLdata-* keysdata-* values, raw id values
Rendered AX spineax_spineaxbrowser AX snapshotdistinct accessibility rolesaccessible names, counts, coordinates
Header network signaturenetworknetworkresponse headersheader names and cookie namesheader values and cookie values
Endpoint skeletonendpointsendpointendpoint listhost plus path with numeric, UUID, hex, redacted, or digit-heavy segments collapsedendpoint values supplied only in query strings, duplicate calls

Endpoint inputs are expected to come from the producer already stripped of query strings, fragments, and secrets. Alpha path IDs such as ticker symbols are a known current limit: they are not collapsed by the current normalizer.

The public shape check is:

from yosoi import PageFingerprint
seed = PageFingerprint.of(seed_html)
candidate = PageFingerprint.of(candidate_html)
similarity = seed.similarity(candidate)
same_shape = similarity.same_shape

Richer fetch tiers can pass optional evidence:

seed = PageFingerprint.of(seed_html, ax_snapshot=seed_ax, headers=seed_headers, endpoints=seed_endpoints)
candidate = PageFingerprint.of(
candidate_html,
ax_snapshot=candidate_ax,
headers=candidate_headers,
endpoints=candidate_endpoints,
)

ax_snapshot comes from rendered browser accessibility data, headers from response metadata, and endpoints from browser or CDP network capture.

The Score

Raw Jaccard counts overlap:

raw Jaccard = size(A shared with B) / size(A or B)

Weighted Jaccard uses the same overlap idea, but each feature contributes its layer-specific weight:

weighted Jaccard =
sum(weight(feature) for feature shared by A and B)
/
sum(weight(feature) for feature in A or B)

In code, this is weighted_jaccard(a, b, layer=...) in yosoi/generalization/fingerprint.py.

numerator = sum(_feature_weight(feature, layer=layer) for feature in a & b)
denominator = sum(_feature_weight(feature, layer=layer) for feature in a | b)
score = numerator / denominator

Raw Jaccard and containment are still reported for debugging. The carried-layer decision uses the weighted score.

Worked Example: Semantic Chrome

lm: features are HTML landmarks. schema: features are schema.org structured-data types.

An article page and a product page might share the same site header, nav, and footer while advertising different schema types:

shared_chrome = {'lm:header', 'lm:nav', 'lm:footer'}
product = shared_chrome | {'schema:Product'}
article = shared_chrome | {'schema:Article'}

Raw Jaccard says the pages overlap:

intersection = {lm:header, lm:nav, lm:footer} = 3 features
union = {lm:header, lm:nav, lm:footer, schema:Product, schema:Article} = 5 features
raw Jaccard = 3 / 5 = 0.60

Weighted Jaccard says the shared evidence is weak, because header/nav/footer are generic chrome and schema types are discriminating.

Semantic weights in current code:

lm:header = 0.35
lm:nav = 0.35
lm:footer = 0.35
schema:* = 3.00

The weighted score is:

intersection weight = 0.35 + 0.35 + 0.35 = 1.05
union weight = 1.05 + 3.00 + 3.00 = 7.05
weighted Jaccard = 1.05 / 7.05 = 0.149

That is below the default semantic threshold of 0.50, so the semantic layer vetoes a same-shape match even though raw Jaccard looked high.

Worked Example: Partial Identity

Identity features use data-* keys, not values.

page A identity = {data:data-testid, data:data-component, data:data-state}
page B identity = {data:data-testid, data:data-component, data:data-flag}

Raw Jaccard:

intersection = {data:data-testid, data:data-component} = 2
union = {data:data-testid, data:data-component, data:data-state, data:data-flag} = 4
raw Jaccard = 2 / 4 = 0.50

Weighted Jaccard gives testid and component more influence:

data:data-testid = 2.00
data:data-component = 2.00
data:data-state = 1.00
data:data-flag = 1.00
intersection weight = 2.00 + 2.00 = 4.00
union weight = 4.00 + 1.00 + 1.00 = 6.00
weighted Jaccard = 4.00 / 6.00 = 2 / 3

This explains the audit evidence: the pages share the higher-value identity keys but not the whole identity namespace. Because matching uses >=, an identity threshold of exactly 2 / 3 passes, while any threshold above 2 / 3 vetoes. The default identity threshold is 0.40, so this layer would pass by default.

Worked Example: Containment

Containment is reported as evidence for subset cases. It does not replace the weighted Jaccard gate.

base = {lm:main, schema:Product}
enriched = {lm:main, schema:Product, lm:aside, h2:mid}

Raw Jaccard:

2 shared / 4 total = 0.50

Semantic weighted Jaccard with current weights:

lm:main = 1.50
schema:Product = 3.00
lm:aside = 1.00
h2:mid = 1.25
intersection weight = 1.50 + 3.00 = 4.50
union weight = 4.50 + 1.00 + 1.25 = 6.75
weighted Jaccard = 4.50 / 6.75 = 2 / 3

Containment asks whether the smaller set is contained in the larger one:

2 shared / 2 smaller-side features = 1.00

This is useful when one page has an extra rail, ad module, or recommendation block. It explains why two pages may still be related, while the weighted layer score remains the decision score.

Decision Rule

matches() is conjunctive and fail-closed:

  1. A degenerate page never matches. Thin or blank pages can otherwise produce misleading perfect overlaps.
  2. Skeleton and semantic layers always vote on non-degenerate pages.
  3. Optional layers vote only when both pages carry at least 3 features.
  4. Optional layers with fewer than 3 features abstain. They do not pass and they do not veto.
  5. Every voting layer must clear its threshold using weighted Jaccard.

Current defaults:

LayerThreshold
Skeleton0.40
Semantic0.50
Identity0.40
AX0.50
Network0.50
Endpoint0.50

Thresholds can be overridden by callers, but the degenerate-page guard still wins. Treat threshold changes as a way to inspect or experiment with evidence, not as the main way to force reuse.

Contract And Field Identity

Page fingerprints only decide whether two fetched pages have the same shape. They do not replace contract identity or serving policy.

Yosoi keeps three identities separate:

IdentityCodePurposeKey material
ContractContractSpec.fingerprintseparates different user asksname, doc, schema version, sorted field names, selected per-field identity fields, root, nested specs, validators
Page shapepage_shape_fp() / PageFingerprintcompares fetched page templatesstructural HTML, semantics, optional AX/header/endpoint layers
Field selector atomFieldAtom.keyreuses one selector factpage shape, root/region, field name, Yosoi type

Contract fingerprints exclude per-field descriptions. The per-field fingerprint subset is yosoi_type, pinned selector, delimiter, frozen flag, python type, and action type.

That separation matters:

NewsArticle(title, url, author, published_at)
!=
NewsArticle(title, url, author, published_at, summary)

Those contracts have different fingerprints. But if both contracts ask for url inside the same page shape and root region, Yosoi can identify compatible field-level evidence for possible reuse, subject to policy.

Field Atoms

A field atom is one reusable selector fact:

on page_shape S,
inside region/root R,
field F of type T
is extracted by selector X

Its key is:

(page_shape, region_role, field_name, yosoi_type)

The domain is provenance, not identity. Contract names are provenance for rooted fields, but rootless fields fall back to region_role = name:<contract>, so the contract name can become part of atom identity when there is no root region.

Domain-independent identity does not mean domain-independent serving. Reuse policy still decides whether a selector may be used.

Example:

Contract A: OrganicResult(title, url)
Contract B: SearchResult(title, url, snippet)

If both use the same page shape and root region for url, the compatible url evidence may be represented by the same atom. SearchResult only needs additional evidence for snippet.

Serving Policy Caveat

The fingerprint pieces exist, but selector serving remains conservative:

  • ContractSpec.fingerprint exists.
  • PageFingerprint and page_shape_fp() exist.
  • FieldAtom and AtomStore exist.
  • fingerprint is classified as a quarantined source in the trust lattice.
  • Strict serving rejects fingerprint-sourced atoms by default; yellow trust may serve them as quarantined evidence.
  • The read path does not fully rely on fuzzy page-fingerprint similarity for serving selectors yet.

So fingerprinting already produces weighted, auditable similarity evidence. Reuse policy still decides whether that evidence is allowed to serve selectors.

FAQs

Does a high fingerprint score automatically reuse selectors?

No. Fingerprinting produces auditable shape evidence. Selector serving is still controlled by exact page-shape buckets and policy.

Why not use raw Jaccard for every layer?

Raw Jaccard treats every feature equally. That can overvalue shared site chrome such as headers, nav, and footers. Weighted Jaccard gives more influence to features that better distinguish templates.

Why do optional layers abstain?

Thin optional layers can produce misleading perfect overlap. If either page carries fewer than 3 features for identity, AX, network, or endpoint evidence, that layer abstains instead of passing or vetoing.

Where should I look for selector-serving behavior?

Read Page Identity and Reuse for the current serving path. This page explains the fingerprint evidence that can inform reuse decisions.