DocPullBenchmarkValidated benchmarkUpdated June 10, 20264 min

Targets: 8 web surfaces
Runs: 120 provider runs
Spend: $0.470 incl. credits

Generated: June 9, 2026
Targets: 8
Cases: 40 (120 runs)
Routing gaps: 6

TL;DR

DocPull core crawls pages directly — no paid APIs, no provider spend. In this routing test it was still the most reliable default: best or tied-best on five of eight targets (pass^3 @90 = 88%). It is not a drop-in replacement for the paid paths — direct crawl and search-based discovery are different jobs. Parallel and Exa earn their place on archived docs, pricing pages, and provider-doc targets.

We ran DocPull core, Parallel Search, Parallel Context, Tavily Search + Extract, and Exa Search Contents against eight targets, three times each, under fixed limits. DocPull is the control layer: providers plug in for discovery or extraction, and DocPull normalizes, scores, and records every result in one matrix.

Recommended router policy

The matrix backs up the reasoning, but the operating rule is short:

Default to DocPull core for known static/server-rendered URLs — direct crawl, no provider spend.
Escalate to Parallel Search or Exa when discovery matters: archived docs, pricing pages, and provider-doc targets.
Mark failed or zero-record provider-target pairs as routing gaps until the next scheduled run.

Best path by target

Parallel docs100

DocPull coreExa

Direct crawl and Exa Search Contents both reached canonical docs cleanly with full coverage.

Exa docs100

DocPull core

Known source URLs crawled cleanly without needing external discovery. Provider paths struggled here (one routing gap, two failed cells, and one zero-records cell).

Tavily docs100

Parallel SearchExa

External discovery found stronger canonical pages than direct crawl alone.

Raindrop docs100

DocPull coreParallel SearchExa

Multiple paths reached the same high-quality public source context.

DocPull docs86

DocPull coreExa

Same public path and limits as every other target; the score exposed docs-structure gaps rather than receiving special treatment.

Next.js docs91

DocPull coreParallel Search

JS-heavy structure reduced extraction quality for some provider paths.

Python 2.7 stdlib100

Parallel Search

Search-based discovery handled the archived index better than direct crawl alone.

Tavily pricing100

Parallel SearchContextExa

Provider discovery helped because freshness and URL selection mattered.

Target	Best path	Score	Why it won
Parallel docs	DocPull coreExa	100	Direct crawl and Exa Search Contents both reached canonical docs cleanly with full coverage.
Exa docs	DocPull core	100	Known source URLs crawled cleanly without needing external discovery. Provider paths struggled here (one routing gap, two failed cells, and one zero-records cell).
Tavily docs	Parallel SearchExa	100	External discovery found stronger canonical pages than direct crawl alone.
Raindrop docs	DocPull coreParallel SearchExa	100	Multiple paths reached the same high-quality public source context.
DocPull docs	DocPull coreExa	86	Same public path and limits as every other target; the score exposed docs-structure gaps rather than receiving special treatment.
Next.js docs	DocPull coreParallel Search	91	JS-heavy structure reduced extraction quality for some provider paths.
Python 2.7 stdlib	Parallel Search	100	Search-based discovery handled the archived index better than direct crawl alone.
Tavily pricing	Parallel SearchContextExa	100	Provider discovery helped because freshness and URL selection mattered.

Provider × target heatmap

The full provider-by-target scores, for readers who want the raw cells behind the best-path table:

Parallel docsTarget

DocPull

100

P Search

P Context

94 [89–94]

Tavily

Exa

100

Exa docsTarget

DocPull

100

P Search

P Context

no result

Tavily

no result

Exa

Tavily docsTarget

DocPull

P Search

100

P Context

89 [89–94]

Tavily

Exa

100

Raindrop docsTarget

DocPull

100

P Search

100

P Context

Tavily

Exa

100

DocPull docsTarget

DocPull

P Search

P Context

no result

Tavily

no result

Exa

Next.js docsTarget

DocPull

P Search

P Context

79 [79–80]

Tavily

Exa

Python 2.7 stdlibTarget

DocPull

P Search

100

P Context

90 [86–90]

Tavily

Exa

Tavily pricingTarget

DocPull

P Search

100

P Context

100

Tavily

Exa

100

Target	DocPull	P Search	P Context	Tavily	Exa
Parallel docs	100	95	94 [89–94]	94	100
Exa docs	100	0	no result	no result	90
Tavily docs	94	100	89 [89–94]	85	100
Raindrop docs	100	100	94	94	100
DocPull docs	86	0	no result	no result	86
Next.js docs	91	91	79 [79–80]	79	85
Python 2.7 stdlib	98	100	90 [86–90]	78	92
Tavily pricing	91	100	100	93	100

How to read this

This is a routing test under fixed limits (8 pages, depth 1, 5 search results, 2 extracts) — not a head-to-head provider benchmark. The paths do different jobs: core crawl walks a page graph; provider paths search and extract. The best-path table picks the strongest route, not a provider winner.
Each cell is a 0–100 score across coverage, cleanliness, source fidelity, freshness, and density. 0 = empty pack, “no result” = routing gap, and [min–max] shows variance across runs.
DocPull docs scored 86 because of real docs-structure gaps — not special treatment. It went through the same public path as every other target.

Reliability

The median hides variance. pass^k is stricter: a cell only passes if every repetition cleared the threshold. For routing, one good run out of three isn't good enough.

DocPull core

pass^3 @80

100%

pass^3 @90

88%

Exa Search Contents

pass^3 @80

100%

pass^3 @90

75%

Parallel Search + Context

pass^3 @80

79%

pass^3 @90

57%

Tavily Search + Extract

pass^3 @80

67%

pass^3 @90

50%

Path	pass^3 @80	pass^3 @90
DocPull core	100%	88%
Exa Search Contents	100%	75%
Parallel Search + Context	79%	57%
Tavily Search + Extract	67%	50%

86.1% of cells passed at 80 and 66.7% at 90 (36 qualifying cells; four that failed all three trials are excluded and listed under routing gaps). DocPull core is the most reliable default; Exa is the strongest paid escalation.

Cost

Provider spend totaled $0.414. Tavily credits are converted at the stated per-credit price. Latency is tracked in Raindrop traces but not used to rank paths.

DocPull core$0.000

Direct crawl; no provider spend.

Parallel Search$0.120

Best on discovery-heavy targets and archived docs.

Parallel Context$0.126

Freshness-sensitive pages when URLs are found; 18 / 24 cells billed (six routing gaps).

Tavily Search + Extract$0.056

Compact search/extract fallback. Converted from 7 extract credits at the stated $0.008/credit policy; billed in credits, not USD.

Exa Search Contents$0.168

Fast search-content retrieval and provider docs.

Total (paid providers)$0.414

USD-billed spend (Parallel and Exa) across 8 targets × N=3 runs; $0.470 including converted Tavily credits.

Path	Total (N=3, 8 targets)	Median per cell	Best use
DocPull core	$0.000	$0.000	Direct crawl; no provider spend.
Parallel Search	$0.120	$0.015	Best on discovery-heavy targets and archived docs.
Parallel Context	$0.126	$0.021	Freshness-sensitive pages when URLs are found; 18 / 24 cells billed (six routing gaps).
Tavily Search + Extract	$0.056	—	Compact search/extract fallback. Converted from 7 extract credits at the stated $0.008/credit policy; billed in credits, not USD.
Exa Search Contents	$0.168	$0.021	Fast search-content retrieval and provider docs.
Total (paid providers)	$0.414	—	USD-billed spend (Parallel and Exa) across 8 targets × N=3 runs; $0.470 including converted Tavily credits.

Routing gaps

Six provider-target pairs returned no usable context. They stay in the matrix as routing data: the router skips them until the next run.

Target / path	Reason
Exa docs/Parallel Search	Returned zero records (empty pack, score 0).
Exa docs/Parallel Context	All three repetitions failed during extraction.
Exa docs/Tavily Search + Extract	No extractable URLs under the limits.
DocPull docs/Parallel Search	Returned zero records (empty pack, score 0).
DocPull docs/Parallel Context	No URLs available for the extractor under the limits.
DocPull docs/Tavily Search + Extract	No extractable URLs under the limits.

Reliable agents need retrieval they can measure, route, and audit.

Methodology and reproducibility

Reference material: the scoring policy, run settings, trace details, and the exact command to reproduce this matrix. The sections above stand without it.

Method notes

Targets: Five tool docs sites plus three hard targets: JS-heavy docs, archived docs, and pricing.
Providers: DocPull core, Parallel Search, Parallel Context, Tavily Search + Extract, and Exa Search Contents.
Repetition: Each provider-target cell ran N=3 times. The heatmap shows the median across runs; cells with run-to-run variance render the spread as [min–max]. Per-run artifacts live under run-1/, run-2/, run-3/ subdirs in the benchmark output.
Scoring: Weighted score: coverage 30%, cleanliness 20%, source fidelity 20%, freshness 15%, density 15%. The weights are heuristic — the sub-score signals are the load-bearing detail, not the headline number.
Boilerplate detection: Used inside the cleanliness and density dimensions. It is a substring sniff on English navigation phrases ("skip to main content", "on this page", "cookie", etc.) and will under-report localized boilerplate.
Freshness signal: A presence test for target-specific terms in URL, title, or first 5000 characters of body. It is a heuristic freshness signal, not proof that the page is current — it does not check page modification time.
Empty packs: Cells that returned zero records score 0, not the arithmetic average of vacuous high sub-scores. See exa_docs / Parallel Search and docpull_docs / Parallel Search below for two real cases.
Reliability (pass^k): Alongside the median, every cell is also scored with pass^k: the fraction of cells whose worst run still clears a threshold. The article reports pass^3 @80 and pass^3 @90. Cells that failed all three trials produce no scores and are excluded from the pass^k denominator (36 of 40 cells qualify); they remain in the routing-gaps list.
Production signals: Each cell also records latency, normalized cost, routing gaps, URL fidelity, and usable context returned per token.
Cost: Live provider spend totaled $0.414 across N=3 repetitions, from Parallel Search, Parallel Context, and Exa. Tavily billed 7 extract credits, shown converted at the stated $0.008/credit policy ($0.056).
Trace policy: Raindrop stores metadata only: provider, target, score, latency, cost, URLs, counts, signal names, and artifact path.
Crawl limits: Caps: 8 pages, depth 1, 5 search results, 2 extracts.

Score dimensions

Each cell is a weighted 0-100 score. Roughly: 90+ is strong, 80-89 is usable with a fallback, 0 is an empty pack, and “no result” is a routing gap.

Dimension	Weight	Signal
Coverage	30%	Pages that matter included.
Cleanliness	20%	Duplicated nav, boilerplate, and broken text avoided.
Source fidelity	20%	Selected URLs canonical and on-target.
Freshness signal	15%	Heuristic presence of recency markers for freshness-sensitive targets, not a check on page modification time.
Density	15%	Usable context high relative to token budget.

Targets

Target	Kind	URL
Parallel docs	docs	docs.parallel.ai
Exa docs	docs	docs.exa.ai
Tavily docs	docs	docs.tavily.com
Raindrop docs	docs	www.raindrop.ai/docs
DocPull docs	docs	docpull.raintree.technology
Next.js docs	JS-heavy docs	nextjs.org/docs
Python 2.7 stdlib	archived docs	docs.python.org/2.7/library/index.html
Tavily pricing	pricing freshness	www.tavily.com/pricing

Workload settings

The core crawl walks a page graph from a seed URL; provider workflows fetch a fixed number of search results and optionally extract their content.

Workflow	Settings	Median records	Records range
DocPull core	max_pages=8, max_depth=1, max_concurrent=8	8	1–11
Parallel Search	max_search_results=5, mode=advanced	5	0–5
Parallel Context	max_search_results=5, extract_limit=2, mode=advanced	2	2–6
Tavily Search + Extract	max_search_results=5, extract_limit=2	2	2
Exa Search Contents	max_search_results=5	5	1–5

What each path does

The DocPull column is DocPull by itself: crawl, normalize, score, write the context pack. The other columns are provider-assisted paths; DocPull still normalizes, scores, and records every result in the same matrix.

Path	Role	What changed in this run
DocPull core	Direct crawl and normalization	Best default for known static or server-rendered URLs.
Parallel Search	External discovery before DocPull scoring	Helped on Tavily docs, the Python archive, and Tavily pricing.
Parallel Context	External extraction path	Matched the best score on Tavily pricing, but produced no result on DocPull docs.
Exa Search Contents	Fast search-content retrieval	Tied best scores on Tavily docs, Raindrop docs, and Tavily pricing.
Tavily Search + Extract	Compact search plus extraction	Useful when it finds the right URLs; otherwise it should be visible to the router as no result.

Raindrop traces

Raindrop, the observability layer used here, is not the retriever or judge. It records one run event, one tool trace per matrix cell, and signals for cells that need attention — so repeated runs surface drift in quality, routing gaps, cost, and freshness signal.

Event: One run-level event anchors the audit: da5b1f19-460e-465a-8949-f48c03772fae.
Per-cell traces: Expect 40 tool traces, one per provider-target cell, with workflow, median score, median latency, cost, URLs, counts, and artifact path. Each aggregate trace covers the underlying N=3 runs.
Signals: This run emitted 130 metadata signals (14 positive, 116 negative): high-score notes, low-score notes, score-dimension notes, high-cost cells, slow cells, failed cases, and follow-up checks.
Filters: Slice by provider, workflow, target, status, score band, cost, and signal name.
Schedule: Repeat the same matrix weekly so Raindrop can detect drift in quality, routing gaps, cost, and freshness.

Reproduce

Exact run:

docpull benchmark quick --target-set provider-matrix --provider all --trace raindrop --max-pages 8 --max-depth 1 --max-search-results 5 --extract-limit 2 --tavily-credit-usd 0.008 --max-estimated-cost 5.00