DocPullField Notev5.0.0 liveJune 23, 20268 min

Release: DocPull 5.0.0
Boundary: --budget 0
Default: Local-first evidence

Published: June 23, 2026
Release: DocPull 5.0.0
Default path: Local-first evidence
Budget policy: --budget 0 blocks paid-capable routes

TL;DR

DocPull 5.0 is live on PyPI and GitHub. The release turns DocPull from page capture into a local-first evidence workflow for agents: collect source-grounded context, preserve the files behind the answer, and make every provider or cloud escalation explicit.

The operating order is simple: local first, open signals second, bring your own key third. In v5, that order is enforced across the CLI, Python SDK, policy files, providers, rendering routes, benchmarks, and MCP tools.

Fresh web evidence. Local first. $0 enforced.

Launch clip: discover public sources, preserve the trail, and keep the first pass local.

Budget boundary

The headline feature is practical: a zero-dollar run should stay zero-dollar.

docpull <URL> --budget 0 -o ./evidence

With --budget 0, paid-capable provider and cloud routes are blocked before execution. Local routes still run where they can: cache, direct HTTP, sitemap discovery, local extraction, indexing, pack analysis, monitors, and local browser rendering.

Tavily, Exa, Parallel, Vercel Sandbox, and E2B remain available as escalation paths. Cost control stops being a convention spread across scripts and becomes a policy DocPull checks before execution.

DocPull budget proof showing local routes available while Parallel, Tavily, and Exa are blocked by a zero-dollar budget. — Budget proof: paid-capable provider routes are blocked before execution when the run is budgeted at zero.

Route	Use when	Notes
DocPull local	Need inspectable evidence without spending.	Default $0 route: cache, HTTP, discovery, local render, packs.
Tavily	Need broad live-web search or answer discovery.	Paid-capable provider; use after open signals.
Exa	Need semantic source discovery.	Paid-capable provider; useful for adjacent pages.
Parallel	Need hosted research with source trails.	Paid-capable provider; use as an escalation path.
Vercel Sandbox	Need isolated cloud execution or build context.	Paid-capable cloud route; useful for reproducible runs.
E2B	Need code or browser work in a hosted sandbox.	Paid-capable cloud route; useful for dynamic pages.

Discover before search

Many sites already publish useful structure. They may expose an llms.txt file, RSS or Atom feeds, OpenAPI references, sitemap indexes, or public documentation trees on GitHub.

docpull discover scan https://docs.example.com -o ./packs/discovery

The scanner writes candidate_sources.ndjson, the same source-candidate contract used by provider imports. Local discovery and BYOK provider discovery can feed the same downstream workflow.

Start with the structure a site already publishes, review the candidate list, and pay for broader discovery only when open signals are not enough.

Launch clip: open site hints converge into the same candidate source contract used downstream.

Evidence formats

A hosted response may answer a question today. A local evidence pack can still be inspected tomorrow.

DocPull produces durable artifacts across several output shapes: Markdown with frontmatter, streamed or chunked NDJSON, SQLite with FTS5 search, Google Open Knowledge Format bundles, cached archives and mirrors, source metadata, manifests, indexes, and agent-ready context packs.

Output shape	Preserves	Use when
Markdown + frontmatter	Readable source snapshots	Review, cite, diff, commit
NDJSON / JSONL	Page, chunk, candidate, and run records	RAG, agents, warehouses, evals
SQLite + FTS5	Searchable local corpus	Offline search, QA, debugging
Google OKF	Portable Markdown/YAML bundle	Agent and human knowledge exchange
Archives + mirrors	Cached pages and snapshots	Reports, refreshes, stale checks
Metadata + manifests	URLs, route steps, timestamps, hashes	Provenance, accounting, audits
Context packs + skills	Curated sources and retrieval hints	Codex, MCP, agent workflows
Downstream outputs	Small exports with source trails	Sheets, n8n, Vercel AI, CrewAI, warehouses, launch media

The same source pack can also feed downstream production work: Sheets CSV/TSV for review, warehouse NDJSON or Parquet for analytics, app-facing context for Vercel AI and CrewAI, and source-grounded launch media like the clips in this article.

Those files can be read by an agent, reviewed by a person, indexed by a RAG system, cited in a report, compared with a newer capture, audited for stale sources, and refreshed when the underlying pages change.

Keep the receipt

Runs involving a budget or paid-capable route can write an accounting artifact: run.accounting.json.

It records non-secret route and cost metadata: budget limits, estimated and actual paid cost when known, paid request counts, local browser seconds, HTTP request counts, cache hits, blocked actions, and route steps.

The source files tell you what evidence was collected. The accounting file tells you how the collection happened. Together, they give teams the trail they need after a surprising answer, a failed refresh, or a surprising bill.

DocPull accounting proof showing run.accounting.json fields including maximum paid cost, request counts, cache hits, local browser seconds, blocked actions, and route steps. — Accounting proof: the evidence pack carries a non-secret receipt for the route that produced it.

Browser only when needed

Not every page needs a browser. Static and server-rendered pages can often be captured faster through direct HTTP and framework-aware extraction. That remains the default path.

docpull render https://example.com/app --runtime local --budget 0

Some pages only expose useful content after JavaScript runs. For those cases, v5 adds an explicit local renderer through an external agent-browser-compatible CLI, so the base package stays browser-free unless you opt in. Cloud rendering remains available when needed, but it must be chosen explicitly. Under --budget 0, cloud renderers are blocked.

Launch clip: direct HTTP stays first; local rendering is explicit; cloud browsers do not become a hidden fallback.

Measure free-first

Free-first is easy to say. The harder question is how often the free path actually completes the task.

DocPull 5.0 adds a zero-dollar benchmark mode:

docpull benchmark quick --zero-dollar --target-set zero-dollar --provider all

Each target lands in an explicit class: complete_for_0, complete_with_local_browser, partial_for_0, requires_provider, requires_cloud_browser, or blocked_by_policy.

That gives the project a concrete improvement target: raise the share of tasks completed locally instead of adding providers for their own sake.

DocPull benchmark proof showing one complete zero-dollar run, benchmark score 91, pack score 100, and zero live provider cost. — Benchmark proof: the zero-dollar path is measurable, not just a positioning claim.

Escalation ladder

Local-first does not mean local-only. Providers and cloud-browser infrastructure are useful when a task genuinely needs them. The point of DocPull 5.0 is to make that boundary visible before money or external infrastructure enters the run.

Route	Use when	What you keep	Cost / boundary
DocPull core	Public or authorized static/SSR docs, blogs, OpenAPI specs, feeds, filings, vendor pages.	Markdown, NDJSON, metadata, manifests, indexes, evidence packs.	Allowed under --budget 0. No provider key.
DocPull local render	Direct HTTP returns a JavaScript shell but local browser rendering works.	Rendered HTML via agent-browser plus DocPull conversion artifacts.	Allowed under --budget 0. Requires local runtime.
Tavily	Fast web search, site mapping, broad discovery before local fetch.	Candidate sources or provider context DocPull can normalize.	Paid-capable. Dry-run first; blocked by --budget 0 live runs.
Exa	Keyword search misses intent: semantic discovery, similar pages, competitors, long-tail sources.	Relevant URLs and extracts for downstream evidence packs.	Paid-capable. Use when meaning-based discovery earns it.
Parallel	Richer live-web search, extract, and context-pack workflows across multiple sources.	Provider context packs, then DocPull scoring, citations, entities, briefs.	Paid-capable. Escalate after local/open discovery.
Vercel Sandbox	Local rendering is unsuitable and isolated browser execution through Vercel helps.	Same agent-browser JSON contract as local rendering.	Cloud route. Explicit runtime only; blocked under --budget 0.
E2B Sandbox	Need API-keyed sandbox, prebuilt template, or file-based render result transport.	Sandboxed render payload for DocPull conversion and artifacts.	Cloud route. Explicit runtime only; blocked under --budget 0.
Full browser automation	Need clicks, login state, app workflows, CAPTCHA walls, or private dashboards.	Export rendered HTML or content, then pass it into the evidence pipeline.	Not a hidden fallback. Choose deliberately for interactive tasks.

The intended escalation order is:

Try open discovery: sitemaps, feeds, llms.txt, OpenAPI references, and public docs repositories.
Try local rendering when direct HTTP only returns a shell.
Dry-run a BYOK provider route before making a live request.
Use a live provider when its discovery or extraction capability is worth the external request.
Use cloud rendering only when local rendering or infrastructure is the blocker.

The practical question is not “which provider can I call?” It is: what is the next lowest-cost route that could complete this task while preserving evidence?

Who it is for

DocPull 5.0 is for teams that need live web context without losing the file trail behind that context.

Audience	Why it matters
Agent builders	Need fresh context with a source trail behind the answer.
RAG teams	Want durable, re-indexable source packs outside the vector database.
Researchers	Need local corpora they can cite, inspect, compare, and refresh.
Cost-conscious teams	Need budget boundaries before execution, not after a bill arrives.
Developers	Want local control without rejecting providers when they are actually useful.

What it does not pretend

DocPull will not complete every website locally. It does not claim to provide a proprietary web-scale index. It does not use stealth scraping or CAPTCHA bypass as a hidden fallback. It does not argue that paid providers and cloud browsers have no value.

The promise is narrower and more useful: begin with the least expensive, most inspectable path; preserve the evidence you collect; make escalation visible; and record how the run happened.

Try it

Install or upgrade:

pip install -U docpull

Then run one real URL you already care about:

docpull https://www.python.org/blogs/ --single -o ./python-news

docpull <URL> --budget 0 -o ./evidence

docpull discover scan <URL> -o ./packs/discovery

docpull render <URL> --runtime local --budget 0

docpull benchmark quick --zero-dollar --target-set zero-dollar --provider all

No provider key is required for the first pass. Run with --budget 0, inspect the files DocPull leaves behind, then decide whether a provider or cloud browser is actually worth adding.

Start locally, use open signals first, escalate with evidence, and keep the receipt.