Skip to main content

DocPull 5.0: Local-First Web Evidence for Agents

DocPull 5.0 turns web capture into a local-first evidence workflow: collect source-grounded context, enforce zero-dollar runs, preserve artifacts, and make provider or cloud escalation explicit.

DocPullField Notev5.0.0 live8 min
Release
DocPull 5.0.0
Boundary
--budget 0
Default
Local-first evidence
Published
June 23, 2026
Release
DocPull 5.0.0
Default path
Local-first evidence
Budget policy
--budget 0 blocks paid-capable routes

TL;DR

DocPull 5.0 is live on PyPI and GitHub. The release turns DocPull from page capture into a local-first evidence workflow for agents: collect source-grounded context, preserve the files behind the answer, and make every provider or cloud escalation explicit.

The operating order is simple: local first, open signals second, bring your own key third. In v5, that order is enforced across the CLI, Python SDK, policy files, providers, rendering routes, benchmarks, and MCP tools.

Fresh web evidence. Local first. $0 enforced.
Launch clip: discover public sources, preserve the trail, and keep the first pass local.

Budget boundary

The headline feature is practical: a zero-dollar run should stay zero-dollar.

docpull <URL> --budget 0 -o ./evidence

With --budget 0, paid-capable provider and cloud routes are blocked before execution. Local routes still run where they can: cache, direct HTTP, sitemap discovery, local extraction, indexing, pack analysis, monitors, and local browser rendering.

Tavily, Exa, Parallel, Vercel Sandbox, and E2B remain available as escalation paths. Cost control stops being a convention spread across scripts and becomes a policy DocPull checks before execution.

DocPull budget proof showing local routes available while Parallel, Tavily, and Exa are blocked by a zero-dollar budget.
Budget proof: paid-capable provider routes are blocked before execution when the run is budgeted at zero.
RouteUse whenNotes
DocPull localNeed inspectable evidence without spending.Default $0 route: cache, HTTP, discovery, local render, packs.
TavilyNeed broad live-web search or answer discovery.Paid-capable provider; use after open signals.
ExaNeed semantic source discovery.Paid-capable provider; useful for adjacent pages.
ParallelNeed hosted research with source trails.Paid-capable provider; use as an escalation path.
Vercel SandboxNeed isolated cloud execution or build context.Paid-capable cloud route; useful for reproducible runs.
E2BNeed code or browser work in a hosted sandbox.Paid-capable cloud route; useful for dynamic pages.

Evidence formats

A hosted response may answer a question today. A local evidence pack can still be inspected tomorrow.

DocPull produces durable artifacts across several output shapes: Markdown with frontmatter, streamed or chunked NDJSON, SQLite with FTS5 search, Google Open Knowledge Format bundles, cached archives and mirrors, source metadata, manifests, indexes, and agent-ready context packs.

Output shapePreservesUse when
Markdown + frontmatterReadable source snapshotsReview, cite, diff, commit
NDJSON / JSONLPage, chunk, candidate, and run recordsRAG, agents, warehouses, evals
SQLite + FTS5Searchable local corpusOffline search, QA, debugging
Google OKFPortable Markdown/YAML bundleAgent and human knowledge exchange
Archives + mirrorsCached pages and snapshotsReports, refreshes, stale checks
Metadata + manifestsURLs, route steps, timestamps, hashesProvenance, accounting, audits
Context packs + skillsCurated sources and retrieval hintsCodex, MCP, agent workflows
Downstream outputsSmall exports with source trailsSheets, n8n, Vercel AI, CrewAI, warehouses, launch media

The same source pack can also feed downstream production work: Sheets CSV/TSV for review, warehouse NDJSON or Parquet for analytics, app-facing context for Vercel AI and CrewAI, and source-grounded launch media like the clips in this article.

Those files can be read by an agent, reviewed by a person, indexed by a RAG system, cited in a report, compared with a newer capture, audited for stale sources, and refreshed when the underlying pages change.

Keep the receipt

Runs involving a budget or paid-capable route can write an accounting artifact: run.accounting.json.

It records non-secret route and cost metadata: budget limits, estimated and actual paid cost when known, paid request counts, local browser seconds, HTTP request counts, cache hits, blocked actions, and route steps.

The source files tell you what evidence was collected. The accounting file tells you how the collection happened. Together, they give teams the trail they need after a surprising answer, a failed refresh, or a surprising bill.

DocPull accounting proof showing run.accounting.json fields including maximum paid cost, request counts, cache hits, local browser seconds, blocked actions, and route steps.
Accounting proof: the evidence pack carries a non-secret receipt for the route that produced it.

Browser only when needed

Not every page needs a browser. Static and server-rendered pages can often be captured faster through direct HTTP and framework-aware extraction. That remains the default path.

docpull render https://example.com/app --runtime local --budget 0

Some pages only expose useful content after JavaScript runs. For those cases, v5 adds an explicit local renderer through an external agent-browser-compatible CLI, so the base package stays browser-free unless you opt in. Cloud rendering remains available when needed, but it must be chosen explicitly. Under --budget 0, cloud renderers are blocked.

Launch clip: direct HTTP stays first; local rendering is explicit; cloud browsers do not become a hidden fallback.

Measure free-first

Free-first is easy to say. The harder question is how often the free path actually completes the task.

DocPull 5.0 adds a zero-dollar benchmark mode:

docpull benchmark quick --zero-dollar --target-set zero-dollar --provider all

Each target lands in an explicit class: complete_for_0, complete_with_local_browser, partial_for_0, requires_provider, requires_cloud_browser, or blocked_by_policy.

That gives the project a concrete improvement target: raise the share of tasks completed locally instead of adding providers for their own sake.

DocPull benchmark proof showing one complete zero-dollar run, benchmark score 91, pack score 100, and zero live provider cost.
Benchmark proof: the zero-dollar path is measurable, not just a positioning claim.

Escalation ladder

Local-first does not mean local-only. Providers and cloud-browser infrastructure are useful when a task genuinely needs them. The point of DocPull 5.0 is to make that boundary visible before money or external infrastructure enters the run.

RouteUse whenWhat you keepCost / boundary
DocPull corePublic or authorized static/SSR docs, blogs, OpenAPI specs, feeds, filings, vendor pages.Markdown, NDJSON, metadata, manifests, indexes, evidence packs.Allowed under --budget 0. No provider key.
DocPull local renderDirect HTTP returns a JavaScript shell but local browser rendering works.Rendered HTML via agent-browser plus DocPull conversion artifacts.Allowed under --budget 0. Requires local runtime.
TavilyFast web search, site mapping, broad discovery before local fetch.Candidate sources or provider context DocPull can normalize.Paid-capable. Dry-run first; blocked by --budget 0 live runs.
ExaKeyword search misses intent: semantic discovery, similar pages, competitors, long-tail sources.Relevant URLs and extracts for downstream evidence packs.Paid-capable. Use when meaning-based discovery earns it.
ParallelRicher live-web search, extract, and context-pack workflows across multiple sources.Provider context packs, then DocPull scoring, citations, entities, briefs.Paid-capable. Escalate after local/open discovery.
Vercel SandboxLocal rendering is unsuitable and isolated browser execution through Vercel helps.Same agent-browser JSON contract as local rendering.Cloud route. Explicit runtime only; blocked under --budget 0.
E2B SandboxNeed API-keyed sandbox, prebuilt template, or file-based render result transport.Sandboxed render payload for DocPull conversion and artifacts.Cloud route. Explicit runtime only; blocked under --budget 0.
Full browser automationNeed clicks, login state, app workflows, CAPTCHA walls, or private dashboards.Export rendered HTML or content, then pass it into the evidence pipeline.Not a hidden fallback. Choose deliberately for interactive tasks.

The intended escalation order is:

  1. Try open discovery: sitemaps, feeds, llms.txt, OpenAPI references, and public docs repositories.
  2. Try local rendering when direct HTTP only returns a shell.
  3. Dry-run a BYOK provider route before making a live request.
  4. Use a live provider when its discovery or extraction capability is worth the external request.
  5. Use cloud rendering only when local rendering or infrastructure is the blocker.

The practical question is not “which provider can I call?” It is: what is the next lowest-cost route that could complete this task while preserving evidence?

Who it is for

DocPull 5.0 is for teams that need live web context without losing the file trail behind that context.

AudienceWhy it matters
Agent buildersNeed fresh context with a source trail behind the answer.
RAG teamsWant durable, re-indexable source packs outside the vector database.
ResearchersNeed local corpora they can cite, inspect, compare, and refresh.
Cost-conscious teamsNeed budget boundaries before execution, not after a bill arrives.
DevelopersWant local control without rejecting providers when they are actually useful.

What it does not pretend

DocPull will not complete every website locally. It does not claim to provide a proprietary web-scale index. It does not use stealth scraping or CAPTCHA bypass as a hidden fallback. It does not argue that paid providers and cloud browsers have no value.

The promise is narrower and more useful: begin with the least expensive, most inspectable path; preserve the evidence you collect; make escalation visible; and record how the run happened.

Try it

Install or upgrade:

pip install -U docpull

Then run one real URL you already care about:

docpull https://www.python.org/blogs/ --single -o ./python-news

docpull <URL> --budget 0 -o ./evidence

docpull discover scan <URL> -o ./packs/discovery

docpull render <URL> --runtime local --budget 0

docpull benchmark quick --zero-dollar --target-set zero-dollar --provider all

No provider key is required for the first pass. Run with --budget 0, inspect the files DocPull leaves behind, then decide whether a provider or cloud browser is actually worth adding.

Start locally, use open signals first, escalate with evidence, and keep the receipt.