Budget tiers
| Tier | Typical scope | Direct cash budget | Use it for |
|---|---|---|---|
| Methodology pilot | 10–20 tasks, 2–3 providers, saved responses, manual review of failures. | £50–£200 | Finding unclear tasks, broken adapters, unfair expected sources, and missing measurements. |
| Credible category comparison | 50–150 tasks, 4–5 providers, objective metrics, latency and error tracking, limited judging. | £500–£1,500 | A decision-support report for a category such as search APIs, MCP servers, browser automation, or document extraction. |
| Deep workflow benchmark | Multi-step agent runs with search, fetch, browser, code, or tool actions, plus manual review. | £3k–£10k+ | Testing whether agents can complete realistic workflows, not just retrieve one result. |
| Recurring benchmark programme | Repeated cohorts, more providers, stability checks, calibrated judging, public evidence bundles. | £10k–£25k+ | A durable leaderboard or vendor-quality research programme. |
These ranges cover direct tool, API, LLM, judging, and rerun budget. They do not include every hour of staff or engineering time.
What drives the cost?
Tool calls
Provider API calls, browser sessions, extraction jobs, retries, and reruns after task or adapter bugs.
Agent and judge tokens
LLM agents and rubric-based judges can cost more than the tool being tested, especially on multi-step tasks.
Evidence handling
Responses must be timestamped, redacted, normalized, and preserved so a reader can inspect what happened.
Review and interpretation
Useful findings need relevance labels, spot checks, uncertainty notes, and plain recommendations tied to use cases.
A simple planning formula
Start with the number of tasks, providers, and reruns. Then add judging and publication overhead. Keep the formula visible so stakeholders can change assumptions instead of arguing about a single opaque number.
tool_call_budget = tasks × providers × runs × cost_per_task
judge_budget = judged_items × judge_cost_per_item
rerun_buffer = 20% to 50% for adapter bugs, timeouts, and task fixes
publication_budget = evidence cleanup + charts + review + decision notesExample: search APIs for agent tasks
A focused search-API comparison can stay relatively cheap because each task may require only one or a few requests per provider. The important work is defining scenario-shaped tasks and scoring against evidence.
- Tasks: 50–100 queries covering official docs, pricing pages, exact errors, current vendor changes, ambiguous names, and regional sources.
- Providers: 4–5 services that expose search results, citations, snippets, or answer APIs suitable for agents.
- Metrics: Success@k, MRR, Precision@k, NDCG@k where relevance labels exist, plus latency, result count, errors, and estimated cost per successful task.
- Evidence: URLs, ranks, snippets, response shape, timestamps, errors, and any visible source dates. Date metadata is an audit signal, not proof that an answer is current.
- Judging: use rubric-based review only where simple source matching is not enough, such as ambiguity or nuanced current-fact tasks.
Quick benchmark budget estimator
Change the assumptions to produce a plain-language pilot scope. The estimate is not a quote; it gives teams a concrete starting point for deciding whether a benchmark is worth scoping.
What to avoid
- Do not publish a provider ranking from a 10-task pilot. Use pilots to improve the method.
- Do not hide arbitrary scores behind a single overall number. Show raw observations first, then any weighted view separately.
- Do not mix unlike categories into one leaderboard. A SERP wrapper, an answer API, an extraction API, and a deep-research agent solve different jobs.
- Do not treat the changing web as a footnote. Label cohorts by date and keep enough evidence to make the comparison auditable.
Need a benchmark scoped?
Send the tool category, the decision you need to make, the providers you are considering, and the workflow an agent must support. AgentFirstTools can help scope a pilot or audit before you spend on a full benchmark.