Agent tool benchmarks — AgentFirstTools

Current benchmark track

New cohort

Official docs search APIs, May 2026

We tested Brave Search API, SerpAPI, and Tavily on 30 official-documentation retrieval tasks. The page includes Success@k, MRR, latency, and downloadable evidence tables.

Read the benchmark View buyer guide

Candidate track

MCP servers and agent tool surfaces

Future benchmark work should test install friction, auth handling, schema quality, dry-run support, logs, error recovery, and whether agents can complete real workflows without hidden human steps.

What we will publish

Clear category boundaries. Search APIs, SERP wrappers, extraction APIs, CLIs, and MCP servers should not be mixed into one misleading leaderboard.
Standard retrieval metrics where they fit. Search benchmarks should report metrics such as Success@k, MRR, Precision@k, and NDCG@k, plus latency, errors, and result counts.
Dated evidence. Each result cohort should keep enough traces for readers to understand what was visible at scoring time.
Plain recommendations. Findings should explain who should use a tool, who should avoid it, and what safeguards are needed in agent workflows.

Benchmark cost guide

Use the cost guide to plan a realistic pilot, category comparison, or recurring benchmark before spending on providers, LLM judging, evidence handling, and review.

Estimate benchmark costs

Lab note: May 2026 search API pilot

We ran an early pilot with a small task set to test the evidence pipeline. It is useful for transparency, but it is not a provider ranking or buying recommendation.

Read the pilot note

Get benchmark updates

Join the update list for benchmark releases, tool comparisons, and practical notes on agent-ready infrastructure. No generic AI commentary.

Evidence for choosing tools your agents can use.

Current benchmark track

Official docs search APIs, May 2026

MCP servers and agent tool surfaces

What we will publish

Benchmark cost guide

Lab note: May 2026 search API pilot

Get benchmark updates