What the template forces you to decide
Decision
Who needs the result, what choice it will support, and what happens if the answer is inconclusive.
Category boundaries
Which tools are genuinely comparable, which tools are excluded, and which caveats matter before ranking anything.
Agent workflow
The real trigger, tool calls, expected output, verification step, and recovery path the benchmark should represent.
Evidence
The request, response, URLs, timestamps, receipts, logs, and review notes that make the result auditable later.
Recommended pilot shape
| Part | Default starting point | Why it matters |
|---|---|---|
| Task set | 10–20 tasks across easy, ambiguous, current-fact, error-prone, and multi-step cases. | Small enough to debug; varied enough to reveal bad task design and provider mismatch. |
| Providers | 2–4 tools that solve the same job, not adjacent categories mixed together. | A SERP wrapper, answer API, extraction API, and research agent should not share one raw score. |
| Runs | At least one clean run plus reruns for adapter bugs and timeouts. | Most benchmark waste appears in reruns, not the first happy-path request. |
| Metrics | One primary metric, a few secondary metrics, latency, errors, and cost per successful task. | Prevents a single opaque score from hiding the reason a tool is or is not fit for agents. |
Worksheet excerpt
Decision:
- What decision should the benchmark support?
- Who will use the result?
- What happens if the answer is inconclusive?
Agent workflow:
- Trigger or user request:
- Tool calls required:
- Expected output:
- Verification step:
- Failure recovery path:
Evidence to preserve:
- Prompt/task, request, raw response or redacted excerpt
- URLs / IDs / receipts, timestamps, agent logs, judging notesHow it connects to budget
The worksheet pairs with the benchmark cost guide. Once the task count, provider count, judging plan, and evidence plan are explicit, the budget becomes a set of assumptions instead of a guess.
- Use a 20–50% rerun buffer for adapter bugs, task mistakes, provider timeouts, and task wording fixes.
- Separate direct API/LLM spend from human review, evidence cleanup, charting, and publication.
- Do not expand a pilot until the evidence format and metrics can support a concrete recommendation.
Want the benchmark scoped with you?
Send the tool category, the decision you need to make, and any providers already on the shortlist. AgentFirstTools can help turn that into a pilot plan, evidence checklist, or paid audit scope.