Skip to main content

Retrieval Routing Matrix: DocPull, Parallel, Exa, Tavily, and Raindrop Traces

A provider-routing matrix comparing DocPull core with provider-assisted paths across eight targets: one router policy, the best path per target, reliability and cost, with full methodology in the appendix.

DocPullBenchmarkValidated benchmark4 min
Targets
8 web surfaces
Runs
120 provider runs
Spend
$0.470 incl. credits

TL;DR

DocPull core crawls pages directly — no paid APIs, no provider spend. In this routing test it was still the most reliable default: best or tied-best on five of eight targets (pass^3 @90 = 88%). It is not a drop-in replacement for the paid paths — direct crawl and search-based discovery are different jobs. Parallel and Exa earn their place on archived docs, pricing pages, and provider-doc targets.

We ran DocPull core, Parallel Search, Parallel Context, Tavily Search + Extract, and Exa Search Contents against eight targets, three times each, under fixed limits. DocPull is the control layer: providers plug in for discovery or extraction, and DocPull normalizes, scores, and records every result in one matrix.

Best path by target

Parallel docs100
DocPull coreExa

Direct crawl and Exa Search Contents both reached canonical docs cleanly with full coverage.

Exa docs100
DocPull core

Known source URLs crawled cleanly without needing external discovery. Provider paths struggled here (one routing gap, two failed cells, and one zero-records cell).

Tavily docs100
Parallel SearchExa

External discovery found stronger canonical pages than direct crawl alone.

Raindrop docs100
DocPull coreParallel SearchExa

Multiple paths reached the same high-quality public source context.

DocPull docs86
DocPull coreExa

Same public path and limits as every other target; the score exposed docs-structure gaps rather than receiving special treatment.

Next.js docs91
DocPull coreParallel Search

JS-heavy structure reduced extraction quality for some provider paths.

Python 2.7 stdlib100
Parallel Search

Search-based discovery handled the archived index better than direct crawl alone.

Tavily pricing100
Parallel SearchContextExa

Provider discovery helped because freshness and URL selection mattered.

Provider × target heatmap

The full provider-by-target scores, for readers who want the raw cells behind the best-path table:

Parallel docsTarget
DocPull
100
P Search
95
P Context
94 [89–94]
Tavily
94
Exa
100
Exa docsTarget
DocPull
100
P Search
0
P Context
no result
Tavily
no result
Exa
90
Tavily docsTarget
DocPull
94
P Search
100
P Context
89 [89–94]
Tavily
85
Exa
100
Raindrop docsTarget
DocPull
100
P Search
100
P Context
94
Tavily
94
Exa
100
DocPull docsTarget
DocPull
86
P Search
0
P Context
no result
Tavily
no result
Exa
86
Next.js docsTarget
DocPull
91
P Search
91
P Context
79 [79–80]
Tavily
79
Exa
85
Python 2.7 stdlibTarget
DocPull
98
P Search
100
P Context
90 [86–90]
Tavily
78
Exa
92
Tavily pricingTarget
DocPull
91
P Search
100
P Context
100
Tavily
93
Exa
100

How to read this

  • This is a routing test under fixed limits (8 pages, depth 1, 5 search results, 2 extracts) — not a head-to-head provider benchmark. The paths do different jobs: core crawl walks a page graph; provider paths search and extract. The best-path table picks the strongest route, not a provider winner.
  • Each cell is a 0–100 score across coverage, cleanliness, source fidelity, freshness, and density. 0 = empty pack, “no result” = routing gap, and [min–max] shows variance across runs.
  • DocPull docs scored 86 because of real docs-structure gaps — not special treatment. It went through the same public path as every other target.

Reliability

The median hides variance. pass^k is stricter: a cell only passes if every repetition cleared the threshold. For routing, one good run out of three isn't good enough.

DocPull core
pass^3 @80

100%

pass^3 @90

88%

Exa Search Contents
pass^3 @80

100%

pass^3 @90

75%

Parallel Search + Context
pass^3 @80

79%

pass^3 @90

57%

Tavily Search + Extract
pass^3 @80

67%

pass^3 @90

50%

86.1% of cells passed at 80 and 66.7% at 90 (36 qualifying cells; four that failed all three trials are excluded and listed under routing gaps). DocPull core is the most reliable default; Exa is the strongest paid escalation.

Cost

Provider spend totaled $0.414. Tavily credits are converted at the stated per-credit price. Latency is tracked in Raindrop traces but not used to rank paths.

DocPull core$0.000

Direct crawl; no provider spend.

Parallel Search$0.120

Best on discovery-heavy targets and archived docs.

Parallel Context$0.126

Freshness-sensitive pages when URLs are found; 18 / 24 cells billed (six routing gaps).

Tavily Search + Extract$0.056

Compact search/extract fallback. Converted from 7 extract credits at the stated $0.008/credit policy; billed in credits, not USD.

Exa Search Contents$0.168

Fast search-content retrieval and provider docs.

Total (paid providers)$0.414

USD-billed spend (Parallel and Exa) across 8 targets × N=3 runs; $0.470 including converted Tavily credits.

Routing gaps

Six provider-target pairs returned no usable context. They stay in the matrix as routing data: the router skips them until the next run.

Target / pathReason
Exa docs/Parallel SearchReturned zero records (empty pack, score 0).
Exa docs/Parallel ContextAll three repetitions failed during extraction.
Exa docs/Tavily Search + ExtractNo extractable URLs under the limits.
DocPull docs/Parallel SearchReturned zero records (empty pack, score 0).
DocPull docs/Parallel ContextNo URLs available for the extractor under the limits.
DocPull docs/Tavily Search + ExtractNo extractable URLs under the limits.

Reliable agents need retrieval they can measure, route, and audit.

Methodology and reproducibility

Reference material: the scoring policy, run settings, trace details, and the exact command to reproduce this matrix. The sections above stand without it.

Method notes
Targets
Five tool docs sites plus three hard targets: JS-heavy docs, archived docs, and pricing.
Providers
DocPull core, Parallel Search, Parallel Context, Tavily Search + Extract, and Exa Search Contents.
Repetition
Each provider-target cell ran N=3 times. The heatmap shows the median across runs; cells with run-to-run variance render the spread as [min–max]. Per-run artifacts live under run-1/, run-2/, run-3/ subdirs in the benchmark output.
Scoring
Weighted score: coverage 30%, cleanliness 20%, source fidelity 20%, freshness 15%, density 15%. The weights are heuristic — the sub-score signals are the load-bearing detail, not the headline number.
Boilerplate detection
Used inside the cleanliness and density dimensions. It is a substring sniff on English navigation phrases ("skip to main content", "on this page", "cookie", etc.) and will under-report localized boilerplate.
Freshness signal
A presence test for target-specific terms in URL, title, or first 5000 characters of body. It is a heuristic freshness signal, not proof that the page is current — it does not check page modification time.
Empty packs
Cells that returned zero records score 0, not the arithmetic average of vacuous high sub-scores. See exa_docs / Parallel Search and docpull_docs / Parallel Search below for two real cases.
Reliability (pass^k)
Alongside the median, every cell is also scored with pass^k: the fraction of cells whose worst run still clears a threshold. The article reports pass^3 @80 and pass^3 @90. Cells that failed all three trials produce no scores and are excluded from the pass^k denominator (36 of 40 cells qualify); they remain in the routing-gaps list.
Production signals
Each cell also records latency, normalized cost, routing gaps, URL fidelity, and usable context returned per token.
Cost
Live provider spend totaled $0.414 across N=3 repetitions, from Parallel Search, Parallel Context, and Exa. Tavily billed 7 extract credits, shown converted at the stated $0.008/credit policy ($0.056).
Trace policy
Raindrop stores metadata only: provider, target, score, latency, cost, URLs, counts, signal names, and artifact path.
Crawl limits
Caps: 8 pages, depth 1, 5 search results, 2 extracts.
Score dimensions

Each cell is a weighted 0-100 score. Roughly: 90+ is strong, 80-89 is usable with a fallback, 0 is an empty pack, and “no result” is a routing gap.

DimensionWeightSignal
Coverage30%Pages that matter included.
Cleanliness20%Duplicated nav, boilerplate, and broken text avoided.
Source fidelity20%Selected URLs canonical and on-target.
Freshness signal15%Heuristic presence of recency markers for freshness-sensitive targets, not a check on page modification time.
Density15%Usable context high relative to token budget.
Targets
TargetKindURL
Parallel docsdocsdocs.parallel.ai
Exa docsdocsdocs.exa.ai
Tavily docsdocsdocs.tavily.com
Raindrop docsdocswww.raindrop.ai/docs
DocPull docsdocsdocpull.raintree.technology
Next.js docsJS-heavy docsnextjs.org/docs
Python 2.7 stdlibarchived docsdocs.python.org/2.7/library/index.html
Tavily pricingpricing freshnesswww.tavily.com/pricing
Workload settings

The core crawl walks a page graph from a seed URL; provider workflows fetch a fixed number of search results and optionally extract their content.

WorkflowSettingsMedian recordsRecords range
DocPull coremax_pages=8, max_depth=1, max_concurrent=881–11
Parallel Searchmax_search_results=5, mode=advanced50–5
Parallel Contextmax_search_results=5, extract_limit=2, mode=advanced22–6
Tavily Search + Extractmax_search_results=5, extract_limit=222
Exa Search Contentsmax_search_results=551–5
What each path does

The DocPull column is DocPull by itself: crawl, normalize, score, write the context pack. The other columns are provider-assisted paths; DocPull still normalizes, scores, and records every result in the same matrix.

PathRoleWhat changed in this run
DocPull coreDirect crawl and normalizationBest default for known static or server-rendered URLs.
Parallel SearchExternal discovery before DocPull scoringHelped on Tavily docs, the Python archive, and Tavily pricing.
Parallel ContextExternal extraction pathMatched the best score on Tavily pricing, but produced no result on DocPull docs.
Exa Search ContentsFast search-content retrievalTied best scores on Tavily docs, Raindrop docs, and Tavily pricing.
Tavily Search + ExtractCompact search plus extractionUseful when it finds the right URLs; otherwise it should be visible to the router as no result.
Raindrop traces

Raindrop, the observability layer used here, is not the retriever or judge. It records one run event, one tool trace per matrix cell, and signals for cells that need attention — so repeated runs surface drift in quality, routing gaps, cost, and freshness signal.

Event
One run-level event anchors the audit: da5b1f19-460e-465a-8949-f48c03772fae.
Per-cell traces
Expect 40 tool traces, one per provider-target cell, with workflow, median score, median latency, cost, URLs, counts, and artifact path. Each aggregate trace covers the underlying N=3 runs.
Signals
This run emitted 130 metadata signals (14 positive, 116 negative): high-score notes, low-score notes, score-dimension notes, high-cost cells, slow cells, failed cases, and follow-up checks.
Filters
Slice by provider, workflow, target, status, score band, cost, and signal name.
Schedule
Repeat the same matrix weekly so Raindrop can detect drift in quality, routing gaps, cost, and freshness.
Reproduce

Exact run:

docpull benchmark quick --target-set provider-matrix --provider all --trace raindrop --max-pages 8 --max-depth 1 --max-search-results 5 --extract-limit 2 --tavily-credit-usd 0.008 --max-estimated-cost 5.00