# Agent tool benchmark scoping template

Use this worksheet before spending on a benchmark for a tool category that AI agents need to use reliably. The goal is to make the decision, task set, evidence, and budget explicit enough that a pilot can be run without turning into a vague tool bake-off.

## 1. Decision

- What decision should the benchmark support?
- Who will use the result?
- What happens if the answer is inconclusive?
- Is the benchmark for private selection, vendor validation, internal platform work, or public publication?

## 2. Tool category

- Category name:
- Tools/providers to include:
- Tools/providers explicitly out of scope:
- Why these tools are comparable:
- Known category boundaries or caveats:

## 3. Agent workflow

Describe the real workflow the agent must support.

- Trigger or user request:
- Agent role:
- Tool calls required:
- Expected output:
- Verification step:
- Failure recovery path:

## 4. Task set

Start with a small pilot before committing to a public comparison.

| Task group | Example tasks | Expected evidence | Difficulty notes |
| --- | --- | --- | --- |
| Easy / common |  |  |  |
| Ambiguous |  |  |  |
| Current facts |  |  |  |
| Error-prone |  |  |  |
| Multi-step |  |  |  |

Minimum viable pilot:

- Number of tasks:
- Number of providers:
- Runs per task/provider:
- Expected pilot duration:

## 5. Metrics

Choose metrics that match the job. Do not force every category into a single score.

- Success definition:
- Primary metric:
- Secondary metrics:
- Latency measurement:
- Error taxonomy:
- Cost metric:
- Human review rubric:

For retrieval tasks, consider Success@k, MRR, Precision@k, NDCG@k, result count, source quality, latency, errors, and cost per successful task.

## 6. Evidence to preserve

For each run, decide what must be saved for auditability.

- Prompt/task:
- Provider/tool request:
- Raw response or redacted excerpt:
- URLs / IDs / receipts:
- Timestamps:
- Agent logs:
- Judging notes:
- Known redactions:

## 7. Budget assumptions

Use ranges, then replace with measured pilot numbers.

| Driver | Assumption | Estimated cost |
| --- | --- | --- |
| Provider/tool calls |  |  |
| Agent tokens |  |  |
| Judge tokens/review |  |  |
| Reruns/timeouts |  |  |
| Evidence cleanup |  |  |
| Charts/reporting |  |  |
| Manual review |  |  |

Rerun buffer: 20–50% is normal for adapter bugs, task mistakes, timeouts, and provider instability.

## 8. Publication and interpretation

- Will raw evidence be shared?
- What would make the benchmark unfair?
- What limitations must be stated?
- How will tied or use-case-dependent results be handled?
- What claims are not supported by this benchmark?

## 9. Go/no-go checkpoint

Before expanding the pilot, check:

- Tasks are understandable and representative.
- Providers are comparable enough for the stated decision.
- Evidence is saved in a repeatable format.
- Metrics expose real differences rather than noise.
- Costs are acceptable for a larger cohort.
- The result can support a concrete recommendation.

## 10. Help from AgentFirstTools

AgentFirstTools can help scope a private pilot, review an existing evaluation plan, or run an evidence-backed audit for a tool category. If useful, send the completed worksheet through the benchmark scoping form at:

https://agentfirsttools.com/benchmarks/agent-tool-benchmark-scope-template/
