Operations API
The current public surface is operation-first. Python helpers and CLI commands compile user input into typed request models, execute the same runtime, and return typed result envelopes that are safe to serialize.
| Operation | Python helper | Request | Result | CLI |
|---|---|---|---|---|
| Scrape URLs with contracts | ys.scrape(...) | ys.ScrapeRequest | ys.ScrapeResult | yosoi scrape |
| Search the web for source URLs | ys.search(...) | ys.SearchRequest | ys.SearchResult | yosoi search |
| Crawl seed URLs | ys.crawl(...) or ys.run_crawl(...) | ys.CrawlRequest | ys.CrawlResult or CrawlRunSummary | yosoi crawl |
| Map sitemap URLs or subdomains | ys.map(...) | ys.MapRequest | ys.MapResult | yosoi map |
Scrape
ys.scrape(...) accepts one URL or many URLs and one contract or many contracts. It returns a ScrapeResult envelope with one ScrapeUnitResult per URL and contract pair.
import yosoi as ys
result = await ys.scrape( 'https://qscrape.dev/l1/news/articles/', ys.NewsArticle, model='groq:llama-3.3-70b-versatile',)
for unit in result.results: print(unit.url, unit.contract, unit.status, unit.record_count) print(unit.records)Stable fields on each unit include selector_source, cache_decision, llm_used, llm_reason, quality_status, quality_issues, expected_record_count, record_count, records, and error.
CLI equivalent:
uvx yosoi scrape https://qscrape.dev/l1/news/articles/ \ --contract @NewsArticle \ --model groq:llama-3.3-70b-versatile \ --jsonUse --request request.json when an agent or another service already produced a ScrapeRequest JSON document. Use --dump-request to inspect what the CLI will execute.
Search
ys.search(...) wraps DDGS search and normalizes provider rows into ranked hits plus a URL list. Policy can provide durable defaults for backend, region, safesearch, page, time limit, and result count.
import yosoi as ys
result = await ys.search( 'Cascading Labs Yosoi selector discovery', limit=5, backend='google,bing,brave',)
for hit in result.hits: print(hit.rank, hit.title, hit.url)CLI equivalent:
uvx yosoi search "Cascading Labs Yosoi selector discovery" \ --limit 5 \ --backend google,bing,brave \ --jsonCrawl
ys.crawl(...) runs the crawl coordinator over one or more seeds. For machine-safe output, build a CrawlRequest and call ys.run_crawl(...).
import yosoi as ys
request = ys.CrawlRequest.from_axes( ['https://qscrape.dev/'], limit=25, compact=True, policy=ys.Policy(crawl=ys.CrawlPolicy(budget=ys.CrawlBudget(max_pages=25))),)result = await ys.run_crawl(request)print(result.status)print(result.summary)CLI equivalent:
uvx yosoi crawl https://qscrape.dev/ \ --limit 25 \ --compact \ --jsonUse --stress --run-id ID to store crawl-run metrics and emit compact, stress-friendly output.
Map
ys.map(...) discovers sitemap URLs from robots.txt, default sitemap locations, and nested sitemap indexes. With subdomain discovery enabled, it shells out to subfinder and returns normalized host inventory.
import yosoi as ys
site = await ys.map('qscrape.dev', max_urls=200)print(site.status, site.root_host)for row in site.urls[:10]: print(row.url)CLI equivalent:
uvx yosoi map qscrape.dev --max-urls 200 --jsonuvx yosoi map qscrape.dev --subdomains --jsonys.MapResult includes sitemaps, urls, hosts, subdomains, and errors, so it is a good source-discovery step before crawl or scrape planning.