Skip to main content

Benchmarking DocPull, Parallel, Tavily, and Exa with Raindrop traces

A provider comparison for agent context packs, with Raindrop tracking speed, score, cost, source quality, and per-case eval telemetry.

Benchmark10 min

The comparison problem

Agents do not need another raw search result. They need sourceable context that survives inspection: clean text, stable source indexes, scores, manifests, costs, and enough metadata to explain where the context came from. DocPull already handles the known-site case. The harder question is when to route discovery through Parallel, Tavily, or Exa before normalizing the result back into the same DocPull pack format.

The first run benchmarks one target, docs.parallel.ai, and forces every provider into the same local context-pack shape: documents.ndjson, a corpus manifest, sources.md, source scores, and a pack score. Raindrop sits around the run, not inside the retrieval matrix, so the comparison can become an eval loop instead of a one-off table.

Methodology

Target
docs.parallel.ai, one documentation site with a known source of truth.
Cases
DocPull core crawl, cached DocPull rerun, Parallel Search, Parallel Context, Tavily Search + Extract, and Exa Search Contents.
Normalization
Every provider output was converted into DocPull context-pack artifacts: documents.ndjson, corpus manifest, sources.md, source scores, and pack score.
Scoring
Providers were scored after normalization, so the comparison is about agent-ready output quality rather than raw API response shape.
Cost guard
The live provider run used a $0.10 maximum estimated cost guard and landed at $0.020 estimated live provider cost.
Trace policy
Raindrop received metadata only: timings, scores, counts, costs, selected URLs, issue counts, artifact paths, and run summary. Scraped page text was not sent.

First-run results

CaseWorkflowTimeScoreRecordsTokensCost
DocPull coreKnown-site crawl4.634s10055.4Kn/a
DocPull cachedCache rerun3.248scache skip0 fetched / 5 skippedn/an/a
Parallel SearchSearch pack1.548s10051.4K$0.005
Parallel ContextSearch + Extract pack2.673s100715.7K$0.008
Tavily Search + ExtractSearch + Extract pack0.684s10038.1Kcredits only
Exa Search ContentsSearch with contents0.320s1005206.9K$0.007

What changed the ranking

All scored provider cases reached 100/100 after normalization, so the useful comparison is not the score alone. The separation comes from latency, cost representation, traceability, selected URL quality, and how much context each provider places into the pack.

Provider readout

Exa was fastest, but also the largest pack

Exa returned five scored records in 0.320 seconds and exposed direct cost metadata. It also produced the largest normalized pack by a wide margin, so future runs should track whether that extra volume improves answer quality or just increases downstream context load.

Tavily was fast and compact

Tavily returned three extracted records in 0.684 seconds. It is easy to normalize into DocPull artifacts, but its credit-based usage model needs account-level translation before it can be compared dollar-for-dollar.

Parallel had the clearest workflow metadata

Parallel Context returned explicit Search + Extract metadata, selected URLs, usage buckets, and seven normalized records. That makes it strong when an agent needs a traceable source-selection story, not just fast content.

DocPull remained the control

When the documentation URL is already known, DocPull gives the most auditable path: local crawl, cache behavior, source index, manifest, source scores, and pack score without a live provider dependency.

Case-study architecture

The benchmark matrix stays focused on retrieval and extraction quality. Raindrop is the operating layer around the matrix: it receives the run event, captures one trace per case, makes providers and workflows searchable, and gives repeated runs a place to create signals, issues, and experiments.

Raindrop
Eval telemetry. It records the run event and each provider case as searchable traces so repeated benchmarks can become signals, issues, and experiments.
Parallel
Structured Search and Search + Extract. Strongest first-run fit when workflow metadata, selected URLs, and source-selection auditability matter.
Exa
Search with contents. Fastest first-run provider and clearest direct cost metadata, but it generated the largest context pack.
Tavily
Search + Extract. Fast, compact, and easy to normalize, with cost interpretation mediated through credits.

How Raindrop is used

Raindrop is not judging the content and it is not retrieving sources. It is the trace system that makes the benchmark operational. The run starts as a docpull_benchmark event, each benchmark case is emitted as a tool trace, and the final report links back to the JSON and Markdown artifacts on disk.

Run event
docpull_benchmark captures the target URL, output directory, enabled providers, cost guard, report paths, and summary.
Case traces
Each case is emitted as a tool trace with workflow, output path, duration, estimated cost, RSS delta, artifact size, stats, skips, pack score, and source-score count.
Provider search
Cases can be filtered by provider and workflow: parallel-search, parallel-context, tavily-search-extract, exa-search-contents, or DocPull core.
Signals
A score drop, cost spike, slower wall time, extraction failure, warning count, or weaker selected URL set can become a Raindrop signal.
Issues
Repeated signals can be promoted into provider or harness issues: failed extraction, stale docs, noisy sources, budget drift, or bad normalization.
Experiments
Prompt wording, target set, page cap, search result limit, extraction limit, and provider settings can be compared as named experiment variants.

What this becomes over time

One run is evidence, not infrastructure. The Raindrop case study gets strong when the same harness runs repeatedly and every provider case is traceable by provider, workflow, target, prompt, and settings.

Repeated runs

Run the same matrix on a schedule across fixed targets, fresh targets, and prompt variants so one result becomes a time series.

Regression tracking

Track wall time, pack score, estimated cost, record count, selected URL quality, and source-score drift per provider.

Searchable traces

Emit one Raindrop event for the run and one tool trace per case, tagged by provider, workflow, target, prompt, and settings.

Signals and issues

Raise signals when score drops, cost jumps, extraction fails, selected URLs get worse, or a provider stops returning usable context.

Experiments

Compare prompt wording, target sets, page caps, search result limits, extraction limits, and provider settings as named experiments.

Decision rules

The point is not to crown one provider. The point is to make routing decisions explicit enough that an agent or product workflow can pick the right path for the job.

Use DocPull core when
the source URL is known, auditability matters, and repeat crawls should benefit from local cache behavior.
Use Parallel Context when
the agent needs an explainable Search + Extract path with selected URLs and rich workflow metadata.
Use Tavily when
the task benefits from quick extracted results and a compact normalized pack is enough.
Use Exa when
speed and broad content retrieval matter, then watch pack size and downstream token load.
Use Raindrop when
the question is no longer one result, but whether provider quality, speed, cost, and source selection are improving or degrading over repeated runs.

What this first run does not prove

This is one target, one day, and a small page cap. It is not a universal ranking of web-search APIs. A vendor comparison gets more useful when it includes multiple documentation stacks, noisy websites, freshness-sensitive queries, pricing-page queries, and repeated runs.

The useful result is the harness. DocPull can compare core crawl, Parallel, Tavily, and Exa through the same artifacts and scoring rules. Raindrop records the metadata needed to make that comparison repeatable: timings, counts, scores, costs, selected URLs, artifact paths, and issue or warning counts.

The product decision

Parallel, Tavily, and Exa belong in the benchmark matrix. DocPull should remain the normalizer, scorer, and durable local context-pack format. Raindrop belongs beside the matrix as evaluation telemetry, so repeat runs can be inspected without treating observability as a retrieval provider.

That keeps the system honest: providers can compete on source discovery, extraction, speed, and cost, while DocPull makes the output inspectable, diffable, cacheable, and useful to agents after the live call is over.