The comparison problem
Agents do not need another raw search result. They need sourceable context that survives inspection: clean text, stable source indexes, scores, manifests, costs, and enough metadata to explain where the context came from. DocPull already handles the known-site case. The harder question is when to route discovery through Parallel, Tavily, or Exa before normalizing the result back into the same DocPull pack format.
The first run benchmarks one target, docs.parallel.ai, and forces every provider into the same local context-pack shape: documents.ndjson, a corpus manifest, sources.md, source scores, and a pack score. Raindrop sits around the run, not inside the retrieval matrix, so the comparison can become an eval loop instead of a one-off table.
Methodology
- Target
- docs.parallel.ai, one documentation site with a known source of truth.
- Cases
- DocPull core crawl, cached DocPull rerun, Parallel Search, Parallel Context, Tavily Search + Extract, and Exa Search Contents.
- Normalization
- Every provider output was converted into DocPull context-pack artifacts: documents.ndjson, corpus manifest, sources.md, source scores, and pack score.
- Scoring
- Providers were scored after normalization, so the comparison is about agent-ready output quality rather than raw API response shape.
- Cost guard
- The live provider run used a $0.10 maximum estimated cost guard and landed at $0.020 estimated live provider cost.
- Trace policy
- Raindrop received metadata only: timings, scores, counts, costs, selected URLs, issue counts, artifact paths, and run summary. Scraped page text was not sent.
First-run results
| Case | Workflow | Time | Score | Records | Tokens | Cost |
|---|---|---|---|---|---|---|
| DocPull core | Known-site crawl | 4.634s | 100 | 5 | 5.4K | n/a |
| DocPull cached | Cache rerun | 3.248s | cache skip | 0 fetched / 5 skipped | n/a | n/a |
| Parallel Search | Search pack | 1.548s | 100 | 5 | 1.4K | $0.005 |
| Parallel Context | Search + Extract pack | 2.673s | 100 | 7 | 15.7K | $0.008 |
| Tavily Search + Extract | Search + Extract pack | 0.684s | 100 | 3 | 8.1K | credits only |
| Exa Search Contents | Search with contents | 0.320s | 100 | 5 | 206.9K | $0.007 |
What changed the ranking
All scored provider cases reached 100/100 after normalization, so the useful comparison is not the score alone. The separation comes from latency, cost representation, traceability, selected URL quality, and how much context each provider places into the pack.
Provider readout
Exa was fastest, but also the largest pack
Exa returned five scored records in 0.320 seconds and exposed direct cost metadata. It also produced the largest normalized pack by a wide margin, so future runs should track whether that extra volume improves answer quality or just increases downstream context load.
Tavily was fast and compact
Tavily returned three extracted records in 0.684 seconds. It is easy to normalize into DocPull artifacts, but its credit-based usage model needs account-level translation before it can be compared dollar-for-dollar.
Parallel had the clearest workflow metadata
Parallel Context returned explicit Search + Extract metadata, selected URLs, usage buckets, and seven normalized records. That makes it strong when an agent needs a traceable source-selection story, not just fast content.
DocPull remained the control
When the documentation URL is already known, DocPull gives the most auditable path: local crawl, cache behavior, source index, manifest, source scores, and pack score without a live provider dependency.
Case-study architecture
The benchmark matrix stays focused on retrieval and extraction quality. Raindrop is the operating layer around the matrix: it receives the run event, captures one trace per case, makes providers and workflows searchable, and gives repeated runs a place to create signals, issues, and experiments.
- Raindrop
- Eval telemetry. It records the run event and each provider case as searchable traces so repeated benchmarks can become signals, issues, and experiments.
- Parallel
- Structured Search and Search + Extract. Strongest first-run fit when workflow metadata, selected URLs, and source-selection auditability matter.
- Exa
- Search with contents. Fastest first-run provider and clearest direct cost metadata, but it generated the largest context pack.
- Tavily
- Search + Extract. Fast, compact, and easy to normalize, with cost interpretation mediated through credits.
How Raindrop is used
Raindrop is not judging the content and it is not retrieving sources. It is the trace system that makes the benchmark operational. The run starts as a docpull_benchmark event, each benchmark case is emitted as a tool trace, and the final report links back to the JSON and Markdown artifacts on disk.
- Run event
- docpull_benchmark captures the target URL, output directory, enabled providers, cost guard, report paths, and summary.
- Case traces
- Each case is emitted as a tool trace with workflow, output path, duration, estimated cost, RSS delta, artifact size, stats, skips, pack score, and source-score count.
- Provider search
- Cases can be filtered by provider and workflow: parallel-search, parallel-context, tavily-search-extract, exa-search-contents, or DocPull core.
- Signals
- A score drop, cost spike, slower wall time, extraction failure, warning count, or weaker selected URL set can become a Raindrop signal.
- Issues
- Repeated signals can be promoted into provider or harness issues: failed extraction, stale docs, noisy sources, budget drift, or bad normalization.
- Experiments
- Prompt wording, target set, page cap, search result limit, extraction limit, and provider settings can be compared as named experiment variants.
What this becomes over time
One run is evidence, not infrastructure. The Raindrop case study gets strong when the same harness runs repeatedly and every provider case is traceable by provider, workflow, target, prompt, and settings.
Repeated runs
Run the same matrix on a schedule across fixed targets, fresh targets, and prompt variants so one result becomes a time series.
Regression tracking
Track wall time, pack score, estimated cost, record count, selected URL quality, and source-score drift per provider.
Searchable traces
Emit one Raindrop event for the run and one tool trace per case, tagged by provider, workflow, target, prompt, and settings.
Signals and issues
Raise signals when score drops, cost jumps, extraction fails, selected URLs get worse, or a provider stops returning usable context.
Experiments
Compare prompt wording, target sets, page caps, search result limits, extraction limits, and provider settings as named experiments.
Decision rules
The point is not to crown one provider. The point is to make routing decisions explicit enough that an agent or product workflow can pick the right path for the job.
- Use DocPull core when
- the source URL is known, auditability matters, and repeat crawls should benefit from local cache behavior.
- Use Parallel Context when
- the agent needs an explainable Search + Extract path with selected URLs and rich workflow metadata.
- Use Tavily when
- the task benefits from quick extracted results and a compact normalized pack is enough.
- Use Exa when
- speed and broad content retrieval matter, then watch pack size and downstream token load.
- Use Raindrop when
- the question is no longer one result, but whether provider quality, speed, cost, and source selection are improving or degrading over repeated runs.
What this first run does not prove
This is one target, one day, and a small page cap. It is not a universal ranking of web-search APIs. A vendor comparison gets more useful when it includes multiple documentation stacks, noisy websites, freshness-sensitive queries, pricing-page queries, and repeated runs.
The useful result is the harness. DocPull can compare core crawl, Parallel, Tavily, and Exa through the same artifacts and scoring rules. Raindrop records the metadata needed to make that comparison repeatable: timings, counts, scores, costs, selected URLs, artifact paths, and issue or warning counts.
The product decision
Parallel, Tavily, and Exa belong in the benchmark matrix. DocPull should remain the normalizer, scorer, and durable local context-pack format. Raindrop belongs beside the matrix as evaluation telemetry, so repeat runs can be inspected without treating observability as a retrieval provider.
That keeps the system honest: providers can compete on source discovery, extraction, speed, and cost, while DocPull makes the output inspectable, diffable, cacheable, and useful to agents after the live call is over.