What you are paying for
An agent-tool benchmark has four different cost buckets. Tool API fees are visible, but they are not the whole benchmark.
1. Tool calls
Search requests, extract calls, browser sessions, task API runs, rate-limit retries, and reruns after adapter bugs. This is usually cheap for simple search and expensive for deep research.
2. Agent and judge tokens
If the benchmark uses an LLM agent to search, fetch, reason, and answer, token cost can dominate. LLM-as-judge adds another pass and must be preserved as evidence.
3. Evidence handling
Redacting responses, saving timestamps, normalising URLs, computing IR metrics, and preserving failure cases takes engineering time even when API calls are cheap.
4. Review and publication
Manual spot checks, vendor corrections, methodology notes, charts, and readable verdicts are what turn a run into a trustworthy asset rather than a spreadsheet.
Budget tiers
| Tier | Typical scope | Cash budget | Use it for |
|---|---|---|---|
| Methodology pilot | 10–20 tasks, 2–3 providers, saved evidence, mostly manual inspection. | £50–£200 | Finding bad task definitions, adapter bugs, unfair expected domains, and scoring gaps. |
| Credible search API comparison | 50–150 tasks, 4–5 providers, objective IR metrics, latency/cost tracking, limited judging. | £500–£1,500 | A public AgentFirstTools article with enough evidence to be useful to buyers and credible to vendors. |
| Deep-research benchmark | Multi-step agents, search + fetch loops, long-context answers, expensive competitor APIs, LLM grading. | £3k–£10k+ | Parallel-style deep research comparisons where each question can trigger many tool calls. |
| Leaderboard programme | Repeated cohorts, many providers, reruns, judge calibration, manual review, public evidence bundles. | £10k–£25k+ | A durable benchmark property, not a one-off article. |
GBP figures are planning ranges, not quotes. They include expected API/LLM spend and practical slack for reruns, but not full-time staff cost.
A Parallel-style benchmark cost sanity check
Parallel publishes benchmark tables with cost shown as CPM: US dollars per 1,000 requests or questions. Its pricing page also makes the difference between simple request pricing and deep task pricing clear: the Search API is listed at $0.005 per request for 10 results, while Task API requests range from $0.005 to $2.40 depending on depth.
Using the CPM tables as a public planning proxy, the cash cost scales directly with sample size:
run_cost = displayed_CPM × questions / 1000
Example:
CPM 156 over 100 questions = $15.60 for that provider/run row
CPM 156 over 1,000 questions = $156.00 for that provider/run row| Scenario from public CPM-style tables | Approx. cash cost | Planning meaning |
|---|---|---|
| Search API benchmark family across several provider rows, 50 questions per family | ~$433 / ~£346 | Enough to smoke-test a public comparison, not enough for a final leaderboard. |
| Same search benchmark shape, 100 questions per family | ~$866 / ~£693 | Close to the first serious AgentFirstTools target if tooling is ready. |
| Same shape, 500 questions per family | ~$4,329 / ~£3,463 | Starts becoming a research programme; reruns and manual review matter. |
| Deep-research/task-style benchmark rows from public CPM/sample-size tables | ~$8,327 / ~£6,661 | Comparable to a low-thousands-to-mid-thousands vendor benchmark campaign. |
These are estimates from public CPM-style reporting and an approximate 0.80 USD→GBP planning conversion. Exact reproduction depends on the hidden harness, model prices, provider settings, and retry policy.
The AgentFirstTools first run should be narrower
The first AgentFirstTools benchmark should optimize for credibility per pound, not leaderboard theatre. That means testing one buyer-relevant question well:
- Start with 4 providers. Pick a mix such as Parallel, Exa, Tavily, and Brave or SerpAPI, depending on credentials and publishable terms.
- Use 50–100 scenario-shaped tasks. Include official docs, pricing, recent events, exact errors, ambiguous entities, regional sources, and nuanced legal/compliance searches.
- Lead with objective metrics. Success@k, MRR, NDCG@10, official-source rank, result count, error rate, latency, and estimated cost per successful task.
- Use judging only where needed. Ambiguity, current events, and legal/source-diversity tasks need rubric-based review over saved evidence; they should not hide the raw retrieval metrics.
- Publish dated evidence. Treat the result as a May 2026 cohort, not a timeless truth. New providers should trigger a fresh cohort rather than being inserted into an old leaderboard.
What makes the benchmark worth buying later?
The commercial value is not only traffic. A good benchmark becomes a lead magnet for paid audits and implementation help because it shows that AgentFirstTools can inspect a tool like an operator, not summarize it like a blogger.
- For buyers: the benchmark lowers tool-selection risk and shows cost per successful delegated task.
- For vendors: the evidence bundle shows exactly where their API is agent-friendly or brittle.
- For AgentFirstTools: the same harness can become a paid audit, vendor report, integration playbook, or recurring benchmark sponsorship asset.
Decision for now
Do not spend £5k–£10k before the harness has proved itself. Ship the next milestone as a contained search API benchmark: small enough to afford, rigorous enough to be cited, and explicit enough that a vendor or buyer can challenge the evidence.