Benchmarks

Evidence for choosing tools your agents can use.

AgentFirstTools benchmarks test tool categories against realistic agent tasks. The goal is not a generic score. It is to show which tools fit a workflow, where they fail, and what evidence supports the recommendation.

Current benchmark track

New cohort

Official docs search APIs, May 2026

We tested Brave Search API, SerpAPI, and Tavily on 30 official-documentation retrieval tasks. The page includes Success@k, MRR, latency, and downloadable evidence tables.

Candidate track

MCP servers and agent tool surfaces

Future benchmark work should test install friction, auth handling, schema quality, dry-run support, logs, error recovery, and whether agents can complete real workflows without hidden human steps.

What we will publish

Benchmark cost guide

Use the cost guide to plan a realistic pilot, category comparison, or recurring benchmark before spending on providers, LLM judging, evidence handling, and review.

Lab note: May 2026 search API pilot

We ran an early pilot with a small task set to test the evidence pipeline. It is useful for transparency, but it is not a provider ranking or buying recommendation.

Get benchmark updates

Join the update list for benchmark releases, tool comparisons, and practical notes on agent-ready infrastructure. No generic AI commentary.