Skip to content
Cascading Labs QScrape VoidCrawl Yosoi

Scaling

Yosoi scales today by combining local selector caches, concurrent workers, browser fetchers, and observability. Larger distributed caches and external queue integrations are planned, but they are not required for ordinary batch scraping.

Current Support

  • Use Pipeline.process_urls(..., workers=N) or the CLI --workers flag for concurrent URL processing.
  • Use .yosoi/selectors/ as the local selector cache and .yosoi/fetch/ as the learned fetcher strategy cache.
  • Use Observability when you need traces for discovery, extraction, and model behavior.
  • Keep one shared fetcher instance per batch when you are writing Python orchestration code.

Planned Integrations

The integrations below are planned or in progress:

IntegrationRole
RedisDistributed selector cache and job state
RabbitMQURL queue and worker coordination
PrefectWorkflow orchestration and scheduling
LangfuseLLM observability and prompt tracing
PersistenceDurable result storage across runs
TursoEmbedded distributed SQLite for selector snapshots

Treat these as roadmap items until a guide documents a concrete configuration.

FAQs

What is the first scaling knob I should use?

Start with --workers or Pipeline.process_urls(..., workers=N). Increase slowly while watching target-site rate limits and LLM-provider limits.

Do I need Redis or RabbitMQ to run batches?

No. Current concurrent processing runs inside one Python process. External queues are planned for multi-machine orchestration.

How do I share selector discoveries across machines today?

Share the .yosoi/selectors/ directory through your deployment artifact or storage layer. Native distributed selector storage is still roadmap work.

References

Redis. Redis Ltd. In-memory data structure store used as a database, cache, and message broker. https://redis.io/docs/

RabbitMQ. Broadcom. Open-source message broker supporting multiple messaging protocols. https://www.rabbitmq.com/docs/

Prefect. Prefect Technologies. Workflow orchestration platform for data and ML pipelines. https://docs.prefect.io/

Langfuse. Langfuse. Open-source LLM observability, tracing, and analytics platform. https://langfuse.com/docs

Turso. ChiselStrike. Embedded distributed SQLite built on libSQL. https://docs.turso.tech/