- Release
- DocPull 5.0.0
- Boundary
- --budget 0
- Default
- Local-first evidence
- Published
- June 23, 2026
- Release
- DocPull 5.0.0
- Default path
- Local-first evidence
- Budget policy
- --budget 0 blocks paid-capable routes
TL;DR
DocPull 5.0 is live on PyPI and GitHub. The release turns DocPull from page capture into a local-first evidence workflow for agents: collect source-grounded context, preserve the files behind the answer, and make every provider or cloud escalation explicit.
The operating order is simple: local first, open signals second, bring your own key third. In v5, that order is enforced across the CLI, Python SDK, policy files, providers, rendering routes, benchmarks, and MCP tools.
$0 enforced.Budget boundary
The headline feature is practical: a zero-dollar run should stay zero-dollar.
docpull <URL> --budget 0 -o ./evidenceWith --budget 0, paid-capable provider and cloud routes are blocked before execution. Local routes still run where they can: cache, direct HTTP, sitemap discovery, local extraction, indexing, pack analysis, monitors, and local browser rendering.
Tavily, Exa, Parallel, Vercel Sandbox, and E2B remain available as escalation paths. Cost control stops being a convention spread across scripts and becomes a policy DocPull checks before execution.

| Route | Use when | Notes |
|---|---|---|
| DocPull local | Need inspectable evidence without spending. | Default $0 route: cache, HTTP, discovery, local render, packs. |
| Tavily | Need broad live-web search or answer discovery. | Paid-capable provider; use after open signals. |
| Exa | Need semantic source discovery. | Paid-capable provider; useful for adjacent pages. |
| Parallel | Need hosted research with source trails. | Paid-capable provider; use as an escalation path. |
| Vercel Sandbox | Need isolated cloud execution or build context. | Paid-capable cloud route; useful for reproducible runs. |
| E2B | Need code or browser work in a hosted sandbox. | Paid-capable cloud route; useful for dynamic pages. |
Discover before search
Many sites already publish useful structure. They may expose an llms.txt file, RSS or Atom feeds, OpenAPI references, sitemap indexes, or public documentation trees on GitHub.
docpull discover scan https://docs.example.com -o ./packs/discoveryThe scanner writes candidate_sources.ndjson, the same source-candidate contract used by provider imports. Local discovery and BYOK provider discovery can feed the same downstream workflow.
Start with the structure a site already publishes, review the candidate list, and pay for broader discovery only when open signals are not enough.
Evidence formats
A hosted response may answer a question today. A local evidence pack can still be inspected tomorrow.
DocPull produces durable artifacts across several output shapes: Markdown with frontmatter, streamed or chunked NDJSON, SQLite with FTS5 search, Google Open Knowledge Format bundles, cached archives and mirrors, source metadata, manifests, indexes, and agent-ready context packs.
| Output shape | Preserves | Use when |
|---|---|---|
| Markdown + frontmatter | Readable source snapshots | Review, cite, diff, commit |
| NDJSON / JSONL | Page, chunk, candidate, and run records | RAG, agents, warehouses, evals |
| SQLite + FTS5 | Searchable local corpus | Offline search, QA, debugging |
| Google OKF | Portable Markdown/YAML bundle | Agent and human knowledge exchange |
| Archives + mirrors | Cached pages and snapshots | Reports, refreshes, stale checks |
| Metadata + manifests | URLs, route steps, timestamps, hashes | Provenance, accounting, audits |
| Context packs + skills | Curated sources and retrieval hints | Codex, MCP, agent workflows |
| Downstream outputs | Small exports with source trails | Sheets, n8n, Vercel AI, CrewAI, warehouses, launch media |
The same source pack can also feed downstream production work: Sheets CSV/TSV for review, warehouse NDJSON or Parquet for analytics, app-facing context for Vercel AI and CrewAI, and source-grounded launch media like the clips in this article.
Those files can be read by an agent, reviewed by a person, indexed by a RAG system, cited in a report, compared with a newer capture, audited for stale sources, and refreshed when the underlying pages change.
Keep the receipt
Runs involving a budget or paid-capable route can write an accounting artifact: run.accounting.json.
It records non-secret route and cost metadata: budget limits, estimated and actual paid cost when known, paid request counts, local browser seconds, HTTP request counts, cache hits, blocked actions, and route steps.
The source files tell you what evidence was collected. The accounting file tells you how the collection happened. Together, they give teams the trail they need after a surprising answer, a failed refresh, or a surprising bill.

Browser only when needed
Not every page needs a browser. Static and server-rendered pages can often be captured faster through direct HTTP and framework-aware extraction. That remains the default path.
docpull render https://example.com/app --runtime local --budget 0Some pages only expose useful content after JavaScript runs. For those cases, v5 adds an explicit local renderer through an external agent-browser-compatible CLI, so the base package stays browser-free unless you opt in. Cloud rendering remains available when needed, but it must be chosen explicitly. Under --budget 0, cloud renderers are blocked.
Measure free-first
Free-first is easy to say. The harder question is how often the free path actually completes the task.
DocPull 5.0 adds a zero-dollar benchmark mode:
docpull benchmark quick --zero-dollar --target-set zero-dollar --provider allEach target lands in an explicit class: complete_for_0, complete_with_local_browser, partial_for_0, requires_provider, requires_cloud_browser, or blocked_by_policy.
That gives the project a concrete improvement target: raise the share of tasks completed locally instead of adding providers for their own sake.

Escalation ladder
Local-first does not mean local-only. Providers and cloud-browser infrastructure are useful when a task genuinely needs them. The point of DocPull 5.0 is to make that boundary visible before money or external infrastructure enters the run.
| Route | Use when | What you keep | Cost / boundary |
|---|---|---|---|
| DocPull core | Public or authorized static/SSR docs, blogs, OpenAPI specs, feeds, filings, vendor pages. | Markdown, NDJSON, metadata, manifests, indexes, evidence packs. | Allowed under --budget 0. No provider key. |
| DocPull local render | Direct HTTP returns a JavaScript shell but local browser rendering works. | Rendered HTML via agent-browser plus DocPull conversion artifacts. | Allowed under --budget 0. Requires local runtime. |
| Tavily | Fast web search, site mapping, broad discovery before local fetch. | Candidate sources or provider context DocPull can normalize. | Paid-capable. Dry-run first; blocked by --budget 0 live runs. |
| Exa | Keyword search misses intent: semantic discovery, similar pages, competitors, long-tail sources. | Relevant URLs and extracts for downstream evidence packs. | Paid-capable. Use when meaning-based discovery earns it. |
| Parallel | Richer live-web search, extract, and context-pack workflows across multiple sources. | Provider context packs, then DocPull scoring, citations, entities, briefs. | Paid-capable. Escalate after local/open discovery. |
| Vercel Sandbox | Local rendering is unsuitable and isolated browser execution through Vercel helps. | Same agent-browser JSON contract as local rendering. | Cloud route. Explicit runtime only; blocked under --budget 0. |
| E2B Sandbox | Need API-keyed sandbox, prebuilt template, or file-based render result transport. | Sandboxed render payload for DocPull conversion and artifacts. | Cloud route. Explicit runtime only; blocked under --budget 0. |
| Full browser automation | Need clicks, login state, app workflows, CAPTCHA walls, or private dashboards. | Export rendered HTML or content, then pass it into the evidence pipeline. | Not a hidden fallback. Choose deliberately for interactive tasks. |
The intended escalation order is:
- Try open discovery: sitemaps, feeds, llms.txt, OpenAPI references, and public docs repositories.
- Try local rendering when direct HTTP only returns a shell.
- Dry-run a BYOK provider route before making a live request.
- Use a live provider when its discovery or extraction capability is worth the external request.
- Use cloud rendering only when local rendering or infrastructure is the blocker.
The practical question is not “which provider can I call?” It is: what is the next lowest-cost route that could complete this task while preserving evidence?
Who it is for
DocPull 5.0 is for teams that need live web context without losing the file trail behind that context.
| Audience | Why it matters |
|---|---|
| Agent builders | Need fresh context with a source trail behind the answer. |
| RAG teams | Want durable, re-indexable source packs outside the vector database. |
| Researchers | Need local corpora they can cite, inspect, compare, and refresh. |
| Cost-conscious teams | Need budget boundaries before execution, not after a bill arrives. |
| Developers | Want local control without rejecting providers when they are actually useful. |
What it does not pretend
DocPull will not complete every website locally. It does not claim to provide a proprietary web-scale index. It does not use stealth scraping or CAPTCHA bypass as a hidden fallback. It does not argue that paid providers and cloud browsers have no value.
Try it
Install or upgrade:
pip install -U docpullThen run one real URL you already care about:
docpull https://www.python.org/blogs/ --single -o ./python-news
docpull <URL> --budget 0 -o ./evidence
docpull discover scan <URL> -o ./packs/discovery
docpull render <URL> --runtime local --budget 0
docpull benchmark quick --zero-dollar --target-set zero-dollar --provider allNo provider key is required for the first pass. Run with --budget 0, inspect the files DocPull leaves behind, then decide whether a provider or cloud browser is actually worth adding.
Start locally, use open signals first, escalate with evidence, and keep the receipt.