Scaling
Yosoi scales today by combining local selector caches, concurrent workers, browser fetchers, and observability. Larger distributed caches and external queue integrations are planned, but they are not required for ordinary batch scraping.
Current Support
- Use
Pipeline.process_urls(..., workers=N)or the CLI--workersflag for concurrent URL processing. - Use
.yosoi/selectors/as the local selector cache and.yosoi/fetch/as the learned fetcher strategy cache. - Use Observability when you need traces for discovery, extraction, and model behavior.
- Keep one shared fetcher instance per batch when you are writing Python orchestration code.
Planned Integrations
The integrations below are planned or in progress:
| Integration | Role |
|---|---|
| Redis△ | Distributed selector cache and job state |
| RabbitMQ○ | URL queue and worker coordination |
| Prefect◑ | Workflow orchestration and scheduling |
| Langfuse◇ | LLM observability and prompt tracing |
| Persistence | Durable result storage across runs |
| Turso★ | Embedded distributed SQLite for selector snapshots |
Treat these as roadmap items until a guide documents a concrete configuration.
FAQs
What is the first scaling knob I should use?
Start with --workers or Pipeline.process_urls(..., workers=N). Increase slowly while watching target-site rate limits and LLM-provider limits.
Do I need Redis or RabbitMQ to run batches?
No. Current concurrent processing runs inside one Python process. External queues are planned for multi-machine orchestration.
How do I share selector discoveries across machines today?
Share the .yosoi/selectors/ directory through your deployment artifact or storage layer. Native distributed selector storage is still roadmap work.
References
△ Redis. Redis Ltd. In-memory data structure store used as a database, cache, and message broker. https://redis.io/docs/
○ RabbitMQ. Broadcom. Open-source message broker supporting multiple messaging protocols. https://www.rabbitmq.com/docs/
◑ Prefect. Prefect Technologies. Workflow orchestration platform for data and ML pipelines. https://docs.prefect.io/
◇ Langfuse. Langfuse. Open-source LLM observability, tracing, and analytics platform. https://langfuse.com/docs
★ Turso. ChiselStrike. Embedded distributed SQLite built on libSQL. https://docs.turso.tech/