Current benchmark track
SonarSource LLM coding leaderboard
Explore pass rate, generated code volume, issue density, cognitive complexity, and severity metrics from SonarSource’s Java leaderboard data.
Official docs search APIs, May 2026
We tested Brave Search API, SerpAPI, and Tavily on 30 official-documentation retrieval tasks. The page includes Success@k, MRR, latency, and downloadable evidence tables.
MCP servers and agent tool surfaces
Future benchmark work should test install friction, auth handling, schema quality, dry-run support, logs, error recovery, and whether agents can complete real workflows without hidden human steps.
What we will publish
- Clear category boundaries. Search APIs, SERP wrappers, extraction APIs, CLIs, and MCP servers should not be mixed into one misleading leaderboard.
- Standard retrieval metrics where they fit. Search benchmarks should report metrics such as Success@k, MRR, Precision@k, and NDCG@k, plus latency, errors, and result counts.
- Dated evidence. Each result cohort should keep enough traces for readers to understand what was visible at scoring time.
- Plain recommendations. Findings should explain who should use a tool, who should avoid it, and what safeguards are needed in agent workflows.
Benchmark planning resources
Use the scoping template and cost guide to turn a vague tool comparison into a decision, task set, evidence plan, and realistic pilot budget before spending on providers, LLM judging, evidence handling, and review.
Lab note: May 2026 search API pilot
We ran an early pilot with a small task set to test the evidence pipeline. It is useful for transparency, but it is not a provider ranking or buying recommendation.
Get benchmark updates
Join the update list for benchmark releases, tool comparisons, and practical notes on agent-ready infrastructure. No generic AI commentary.