How to score: give each criterion 0, 1, or 2 points. A score of 9–12 means the tool is ready for bounded agent workflows. 5–8 means useful but fragile. 0–4 means agents need heavy human supervision or workarounds.
Score a tool in the browser
Pick a score for each criterion. Nothing is uploaded unless you choose to send the context in the inquiry form below.
0 / 12 selectedChoose all six scores to get a readiness band and next step.
The summary is copied locally so you can paste it into the form, an issue, or a vendor note.
The scoring rubric
1. Inspectable
Can an agent discover capabilities, required inputs, current state, permissions, examples, and failure modes before acting?
0 · HiddenDocs, state, and permissions are scattered or human-only.
1 · PartialSome docs or endpoints exist, but the agent must infer important details.
2 · ClearMachine-readable schemas, examples, state, and limits are easy to query.
2. Scriptable
Can every important workflow be called repeatably through a stable API, CLI, MCP server, webhook, or file interface?
0 · UI-onlyRequires clicking a web app for core work.
1 · MixedBasic operations are callable, but edge cases still require the UI.
2 · CompleteCore workflows are callable, documented, and versioned.
3. Bounded
Can actions be scoped by workspace, role, resource, budget, time, and approval level?
0 · All or nothingCredentials grant broad power.
1 · CoarseSome roles or limits exist, but not enough for safe delegation.
2 · Least privilegeAgents can receive narrow, revocable authority for the task.
4. Verifiable
Does every meaningful action return durable evidence: IDs, URLs, status endpoints, logs, diffs, previews, audit events, or structured success and failure signals?
0 · Trust meOnly a spinner, toast, or vague success message.
1 · Some receiptsReceipts exist but are inconsistent or hard to query later.
2 · Receipts by defaultAgents can cite and re-check proof of what happened.
5. Recoverable
Are failures explicit, retries safe, partial progress visible, and destructive operations reversible or at least clearly marked as irreversible?
0 · FragileTimeouts and partial failures leave unknown state.
1 · Manual recoveryHumans can recover, but agents lack safe retry/rollback paths.
2 · Designed recoveryIdempotency, status checks, rollbacks, and takeover points exist.
6. Composable
Can the tool participate in larger agent workflows across repos, terminals, browsers, docs, inboxes, schedulers, CI, deployments, and human handoff?
0 · SiloNo useful integration points.
1 · IntegratesSome integrations exist, but workflow state is hard to pass around.
2 · Workflow-nativeInputs, outputs, permissions, and status fit broader automation loops.
Worksheet
Copy this into an issue, doc, or audit note for each tool you evaluate.
Criterion
Score
Evidence / next fix
Inspectable
0 / 1 / 2
What can the agent query before acting?
Scriptable
0 / 1 / 2
Which core workflows are stable and callable?
Bounded
0 / 1 / 2
How narrowly can credentials and approvals be scoped?
Verifiable
0 / 1 / 2
What receipt proves the outcome?
Recoverable
0 / 1 / 2
What happens after timeout, failure, or partial completion?
Composable
0 / 1 / 2
How does this plug into wider agent workflows?
Download the worksheet
Use the public template when you need a repeatable buying note, vendor review, or internal adoption decision. Keep the evidence column specific: link to docs, schemas, command output, screenshots, logs, or workflow traces.
Reusable templates: download the worksheet as
Markdown or
CSV. If the score exposes a risky gap, capture the evidence before adding more agent automation.
See a completed example
If you are unsure how much evidence to capture, use the worked example first. It shows how a hypothetical documentation search API can score 7 / 12, why that is a guarded pilot rather than a green light, and what fixes would raise confidence.
Example scorecard: compare your evidence quality with a filled-in documentation-retrieval workflow before sending a tool to an autonomous agent.
Want a second opinion?
If a tool or shortlist matters to an AI-agent rollout, send the scorecard context. AgentFirstTools can do a fit check before any paid audit. Useful context: rough score, lowest-scoring criterion, whether this is a purchase/adoption/build decision, deadline, and budget range.
What to do with the score
- 9–12: document the safe agent workflows and start measuring real autonomous usage.
- 5–8: pick the lowest-scoring criterion and fix that before adding more automation.
- 0–4: do not hand it to an autonomous agent yet; add inspectability, scoped access, and verification first.
Next step: if this scorecard reveals repeated gaps, those gaps are the roadmap. The most useful fixes are usually basic infrastructure: status endpoints, scoped tokens, dry runs, receipts, and rollback paths.