Benchmarks

Evidence for choosing tools your agents can use.

AgentFirstTools benchmarks test tool categories against realistic agent tasks. The goal is not a generic score. It is to show which tools fit a workflow, where they fail, and what evidence supports the recommendation.

Current benchmark track

Interactive chart

SonarSource LLM coding leaderboard

Explore pass rate, generated code volume, issue density, cognitive complexity, and severity metrics from SonarSource’s Java leaderboard data.

New cohort

Official docs search APIs, May 2026

We tested Brave Search API, SerpAPI, and Tavily on 30 official-documentation retrieval tasks. The page includes Success@k, MRR, latency, and downloadable evidence tables.

Candidate track

MCP servers and agent tool surfaces

Future benchmark work should test install friction, auth handling, schema quality, dry-run support, logs, error recovery, and whether agents can complete real workflows without hidden human steps.

What we will publish

Benchmark planning resources

Use the scoping template and cost guide to turn a vague tool comparison into a decision, task set, evidence plan, and realistic pilot budget before spending on providers, LLM judging, evidence handling, and review.

Lab note: May 2026 search API pilot

We ran an early pilot with a small task set to test the evidence pipeline. It is useful for transparency, but it is not a provider ranking or buying recommendation.

Get benchmark updates

Join the update list for benchmark releases, tool comparisons, and practical notes on agent-ready infrastructure. No generic AI commentary.