Results summary
Serpapi
100% Success@3 across 30 official-docs tasks. Success@1: 90%. Median latency: 2016 ms.
Brave
97% Success@3 across 30 official-docs tasks. Success@1: 83%. Median latency: 1074 ms.
Tavily
80% Success@3 across 30 official-docs tasks. Success@1: 47%. Median latency: 1623 ms.
| Provider | Success@1 | Success@3 | Success@10 | MRR | Median latency |
|---|---|---|---|---|---|
| Serpapi | 90% | 100% | 100% | 0.933 | 2016 ms |
| Brave | 83% | 97% | 100% | 0.903 | 1074 ms |
| Tavily | 47% | 80% | 100% | 0.635 | 1623 ms |
Success@k means at least one expected official source appeared in the top k results. MRR is mean reciprocal rank for the first expected source. Relevance labels were URL-pattern and official-domain based, then reviewed for moved official docs domains before publishing.
What this means for agent workflows
- If the agent needs the official source in the first few results, SerpAPI performed best in this run. It found an expected official source in the top 3 for all 30 tasks.
- If latency matters, Brave was materially faster in this sample. Its median response was about 1.1s, compared with about 2.0s for SerpAPI.
- Tavily may still fit answer-style research workflows, but this test was narrower. We measured retrieval of official documentation URLs, not generated answer quality.
- Do not generalize this to all search tasks. Current facts, exact error lookup, legal/compliance research, and source-diversity tasks need separate cohorts.
Method in brief
The task set contains 30 official-documentation queries across AI APIs, browser automation, infrastructure, data stores, and workflow tools. Each provider was called once per task on 12 May 2026. We saved response status, latency, result count, top result URLs, and rank observations.
The primary relevance signal was whether an expected official URL pattern or accepted official domain appeared in the top 10. This is intentionally simple and auditable. It does not judge snippet quality, generated answer quality, pricing, rate limits, or long-term stability.
Download the evidence tables
Limits and next steps
- This is a dated May 2026 cohort, not a timeless ranking.
- The benchmark uses one run per provider; repeated runs would be needed for stability claims.
- Provider categories differ: SerpAPI wraps Google results, Brave exposes Brave Search, and Tavily is more answer/research oriented.
- The next high-value cohort should test exact technical-error lookup or current pricing/version discovery, where agents often fail expensively.
Need this for your stack?
AgentFirstTools can inspect a tool shortlist or agent workflow and produce a narrow evidence-backed audit before you depend on it in production.