Current benchmark track
Official docs search APIs, May 2026
We tested Brave Search API, SerpAPI, and Tavily on 30 official-documentation retrieval tasks. The page includes Success@k, MRR, latency, and downloadable evidence tables.
MCP servers and agent tool surfaces
Future benchmark work should test install friction, auth handling, schema quality, dry-run support, logs, error recovery, and whether agents can complete real workflows without hidden human steps.
What we will publish
- Clear category boundaries. Search APIs, SERP wrappers, extraction APIs, CLIs, and MCP servers should not be mixed into one misleading leaderboard.
- Standard retrieval metrics where they fit. Search benchmarks should report metrics such as Success@k, MRR, Precision@k, and NDCG@k, plus latency, errors, and result counts.
- Dated evidence. Each result cohort should keep enough traces for readers to understand what was visible at scoring time.
- Plain recommendations. Findings should explain who should use a tool, who should avoid it, and what safeguards are needed in agent workflows.
Benchmark cost guide
Use the cost guide to plan a realistic pilot, category comparison, or recurring benchmark before spending on providers, LLM judging, evidence handling, and review.
Lab note: May 2026 search API pilot
We ran an early pilot with a small task set to test the evidence pipeline. It is useful for transparency, but it is not a provider ranking or buying recommendation.
Get benchmark updates
Join the update list for benchmark releases, tool comparisons, and practical notes on agent-ready infrastructure. No generic AI commentary.