- Targets
- 8 web surfaces
- Runs
- 120 provider runs
- Spend
- $0.470 incl. credits
- Generated
- June 9, 2026
- Targets
- 8
- Cases
- 40 (120 runs)
- Routing gaps
- 6
TL;DR
DocPull core crawls pages directly — no paid APIs, no provider spend. In this routing test it was still the most reliable default: best or tied-best on five of eight targets (pass^3 @90 = 88%). It is not a drop-in replacement for the paid paths — direct crawl and search-based discovery are different jobs. Parallel and Exa earn their place on archived docs, pricing pages, and provider-doc targets.
We ran DocPull core, Parallel Search, Parallel Context, Tavily Search + Extract, and Exa Search Contents against eight targets, three times each, under fixed limits. DocPull is the control layer: providers plug in for discovery or extraction, and DocPull normalizes, scores, and records every result in one matrix.
Recommended router policy
The matrix backs up the reasoning, but the operating rule is short:
- Default to DocPull core for known static/server-rendered URLs — direct crawl, no provider spend.
- Escalate to Parallel Search or Exa when discovery matters: archived docs, pricing pages, and provider-doc targets.
- Mark failed or zero-record provider-target pairs as routing gaps until the next scheduled run.
Best path by target
Direct crawl and Exa Search Contents both reached canonical docs cleanly with full coverage.
Known source URLs crawled cleanly without needing external discovery. Provider paths struggled here (one routing gap, two failed cells, and one zero-records cell).
External discovery found stronger canonical pages than direct crawl alone.
Multiple paths reached the same high-quality public source context.
Same public path and limits as every other target; the score exposed docs-structure gaps rather than receiving special treatment.
JS-heavy structure reduced extraction quality for some provider paths.
Search-based discovery handled the archived index better than direct crawl alone.
Provider discovery helped because freshness and URL selection mattered.
| Target | Best path | Score | Why it won |
|---|---|---|---|
| Parallel docs | DocPull coreExa | 100 | Direct crawl and Exa Search Contents both reached canonical docs cleanly with full coverage. |
| Exa docs | DocPull core | 100 | Known source URLs crawled cleanly without needing external discovery. Provider paths struggled here (one routing gap, two failed cells, and one zero-records cell). |
| Tavily docs | Parallel SearchExa | 100 | External discovery found stronger canonical pages than direct crawl alone. |
| Raindrop docs | DocPull coreParallel SearchExa | 100 | Multiple paths reached the same high-quality public source context. |
| DocPull docs | DocPull coreExa | 86 | Same public path and limits as every other target; the score exposed docs-structure gaps rather than receiving special treatment. |
| Next.js docs | DocPull coreParallel Search | 91 | JS-heavy structure reduced extraction quality for some provider paths. |
| Python 2.7 stdlib | Parallel Search | 100 | Search-based discovery handled the archived index better than direct crawl alone. |
| Tavily pricing | Parallel SearchContextExa | 100 | Provider discovery helped because freshness and URL selection mattered. |
Provider × target heatmap
The full provider-by-target scores, for readers who want the raw cells behind the best-path table:
How to read this
- This is a routing test under fixed limits (8 pages, depth 1, 5 search results, 2 extracts) — not a head-to-head provider benchmark. The paths do different jobs: core crawl walks a page graph; provider paths search and extract. The best-path table picks the strongest route, not a provider winner.
- Each cell is a 0–100 score across coverage, cleanliness, source fidelity, freshness, and density. 0 = empty pack, “no result” = routing gap, and [min–max] shows variance across runs.
- DocPull docs scored 86 because of real docs-structure gaps — not special treatment. It went through the same public path as every other target.
Reliability
The median hides variance. pass^k is stricter: a cell only passes if every repetition cleared the threshold. For routing, one good run out of three isn't good enough.
100%
88%
100%
75%
79%
57%
67%
50%
| Path | pass^3 @80 | pass^3 @90 |
|---|---|---|
| DocPull core | 100% | 88% |
| Exa Search Contents | 100% | 75% |
| Parallel Search + Context | 79% | 57% |
| Tavily Search + Extract | 67% | 50% |
86.1% of cells passed at 80 and 66.7% at 90 (36 qualifying cells; four that failed all three trials are excluded and listed under routing gaps). DocPull core is the most reliable default; Exa is the strongest paid escalation.
Cost
Provider spend totaled $0.414. Tavily credits are converted at the stated per-credit price. Latency is tracked in Raindrop traces but not used to rank paths.
Direct crawl; no provider spend.
Best on discovery-heavy targets and archived docs.
Freshness-sensitive pages when URLs are found; 18 / 24 cells billed (six routing gaps).
Compact search/extract fallback. Converted from 7 extract credits at the stated $0.008/credit policy; billed in credits, not USD.
Fast search-content retrieval and provider docs.
USD-billed spend (Parallel and Exa) across 8 targets × N=3 runs; $0.470 including converted Tavily credits.
| Path | Total (N=3, 8 targets) | Median per cell | Best use |
|---|---|---|---|
| DocPull core | $0.000 | $0.000 | Direct crawl; no provider spend. |
| Parallel Search | $0.120 | $0.015 | Best on discovery-heavy targets and archived docs. |
| Parallel Context | $0.126 | $0.021 | Freshness-sensitive pages when URLs are found; 18 / 24 cells billed (six routing gaps). |
| Tavily Search + Extract | $0.056 | — | Compact search/extract fallback. Converted from 7 extract credits at the stated $0.008/credit policy; billed in credits, not USD. |
| Exa Search Contents | $0.168 | $0.021 | Fast search-content retrieval and provider docs. |
| Total (paid providers) | $0.414 | — | USD-billed spend (Parallel and Exa) across 8 targets × N=3 runs; $0.470 including converted Tavily credits. |
Routing gaps
Six provider-target pairs returned no usable context. They stay in the matrix as routing data: the router skips them until the next run.
| Target / path | Reason |
|---|---|
| Exa docs/Parallel Search | Returned zero records (empty pack, score 0). |
| Exa docs/Parallel Context | All three repetitions failed during extraction. |
| Exa docs/Tavily Search + Extract | No extractable URLs under the limits. |
| DocPull docs/Parallel Search | Returned zero records (empty pack, score 0). |
| DocPull docs/Parallel Context | No URLs available for the extractor under the limits. |
| DocPull docs/Tavily Search + Extract | No extractable URLs under the limits. |
Reliable agents need retrieval they can measure, route, and audit.
Methodology and reproducibility
Reference material: the scoring policy, run settings, trace details, and the exact command to reproduce this matrix. The sections above stand without it.
Method notes
- Targets
- Five tool docs sites plus three hard targets: JS-heavy docs, archived docs, and pricing.
- Providers
- DocPull core, Parallel Search, Parallel Context, Tavily Search + Extract, and Exa Search Contents.
- Repetition
- Each provider-target cell ran N=3 times. The heatmap shows the median across runs; cells with run-to-run variance render the spread as [min–max]. Per-run artifacts live under run-1/, run-2/, run-3/ subdirs in the benchmark output.
- Scoring
- Weighted score: coverage 30%, cleanliness 20%, source fidelity 20%, freshness 15%, density 15%. The weights are heuristic — the sub-score signals are the load-bearing detail, not the headline number.
- Boilerplate detection
- Used inside the cleanliness and density dimensions. It is a substring sniff on English navigation phrases ("skip to main content", "on this page", "cookie", etc.) and will under-report localized boilerplate.
- Freshness signal
- A presence test for target-specific terms in URL, title, or first 5000 characters of body. It is a heuristic freshness signal, not proof that the page is current — it does not check page modification time.
- Empty packs
- Cells that returned zero records score 0, not the arithmetic average of vacuous high sub-scores. See exa_docs / Parallel Search and docpull_docs / Parallel Search below for two real cases.
- Reliability (pass^k)
- Alongside the median, every cell is also scored with pass^k: the fraction of cells whose worst run still clears a threshold. The article reports pass^3 @80 and pass^3 @90. Cells that failed all three trials produce no scores and are excluded from the pass^k denominator (36 of 40 cells qualify); they remain in the routing-gaps list.
- Production signals
- Each cell also records latency, normalized cost, routing gaps, URL fidelity, and usable context returned per token.
- Cost
- Live provider spend totaled $0.414 across N=3 repetitions, from Parallel Search, Parallel Context, and Exa. Tavily billed 7 extract credits, shown converted at the stated $0.008/credit policy ($0.056).
- Trace policy
- Raindrop stores metadata only: provider, target, score, latency, cost, URLs, counts, signal names, and artifact path.
- Crawl limits
- Caps: 8 pages, depth 1, 5 search results, 2 extracts.
Score dimensions
Each cell is a weighted 0-100 score. Roughly: 90+ is strong, 80-89 is usable with a fallback, 0 is an empty pack, and “no result” is a routing gap.
| Dimension | Weight | Signal |
|---|---|---|
| Coverage | 30% | Pages that matter included. |
| Cleanliness | 20% | Duplicated nav, boilerplate, and broken text avoided. |
| Source fidelity | 20% | Selected URLs canonical and on-target. |
| Freshness signal | 15% | Heuristic presence of recency markers for freshness-sensitive targets, not a check on page modification time. |
| Density | 15% | Usable context high relative to token budget. |
Targets
| Target | Kind | URL |
|---|---|---|
| Parallel docs | docs | docs.parallel.ai |
| Exa docs | docs | docs.exa.ai |
| Tavily docs | docs | docs.tavily.com |
| Raindrop docs | docs | www.raindrop.ai/docs |
| DocPull docs | docs | docpull.raintree.technology |
| Next.js docs | JS-heavy docs | nextjs.org/docs |
| Python 2.7 stdlib | archived docs | docs.python.org/2.7/library/index.html |
| Tavily pricing | pricing freshness | www.tavily.com/pricing |
Workload settings
The core crawl walks a page graph from a seed URL; provider workflows fetch a fixed number of search results and optionally extract their content.
| Workflow | Settings | Median records | Records range |
|---|---|---|---|
| DocPull core | max_pages=8, max_depth=1, max_concurrent=8 | 8 | 1–11 |
| Parallel Search | max_search_results=5, mode=advanced | 5 | 0–5 |
| Parallel Context | max_search_results=5, extract_limit=2, mode=advanced | 2 | 2–6 |
| Tavily Search + Extract | max_search_results=5, extract_limit=2 | 2 | 2 |
| Exa Search Contents | max_search_results=5 | 5 | 1–5 |
What each path does
The DocPull column is DocPull by itself: crawl, normalize, score, write the context pack. The other columns are provider-assisted paths; DocPull still normalizes, scores, and records every result in the same matrix.
| Path | Role | What changed in this run |
|---|---|---|
| DocPull core | Direct crawl and normalization | Best default for known static or server-rendered URLs. |
| Parallel Search | External discovery before DocPull scoring | Helped on Tavily docs, the Python archive, and Tavily pricing. |
| Parallel Context | External extraction path | Matched the best score on Tavily pricing, but produced no result on DocPull docs. |
| Exa Search Contents | Fast search-content retrieval | Tied best scores on Tavily docs, Raindrop docs, and Tavily pricing. |
| Tavily Search + Extract | Compact search plus extraction | Useful when it finds the right URLs; otherwise it should be visible to the router as no result. |
Raindrop traces
Raindrop, the observability layer used here, is not the retriever or judge. It records one run event, one tool trace per matrix cell, and signals for cells that need attention — so repeated runs surface drift in quality, routing gaps, cost, and freshness signal.
- Event
- One run-level event anchors the audit: da5b1f19-460e-465a-8949-f48c03772fae.
- Per-cell traces
- Expect 40 tool traces, one per provider-target cell, with workflow, median score, median latency, cost, URLs, counts, and artifact path. Each aggregate trace covers the underlying N=3 runs.
- Signals
- This run emitted 130 metadata signals (14 positive, 116 negative): high-score notes, low-score notes, score-dimension notes, high-cost cells, slow cells, failed cases, and follow-up checks.
- Filters
- Slice by provider, workflow, target, status, score band, cost, and signal name.
- Schedule
- Repeat the same matrix weekly so Raindrop can detect drift in quality, routing gaps, cost, and freshness.
Reproduce
Exact run:
docpull benchmark quick --target-set provider-matrix --provider all --trace raindrop --max-pages 8 --max-depth 1 --max-search-results 5 --extract-limit 2 --tavily-credit-usd 0.008 --max-estimated-cost 5.00