Practical asset
Agent-first tool scorecard
Use this scorecard to evaluate whether a tool is genuinely usable by autonomous AI agents, not just technically automatable. It turns the six AgentFirstTools criteria into a simple audit you can run against APIs, CLIs, SaaS products, internal platforms, and operational workflows.
10–20 minute auditSix criteriaDesigned for teams
How to score: give each criterion 0, 1, or 2 points. A tool scoring 9–12 is agent-ready for bounded workflows. 5–8 means useful but fragile. 0–4 means agents will need heavy human supervision or brittle workarounds.
The scoring rubric
1. Inspectable
Can an agent discover capabilities, required inputs, current state, permissions, examples, and failure modes before acting?
0 · HiddenDocs, state, and permissions are scattered or human-only.
1 · PartialSome docs or endpoints exist, but the agent must infer important details.
2 · ClearMachine-readable schemas, examples, state, and limits are easy to query.
2. Scriptable
Can every important workflow be called repeatably through a stable API, CLI, MCP server, webhook, file interface, or other non-brittle control surface?
0 · UI-onlyRequires clicking a web app for core work.
1 · MixedBasic operations are callable, but edge cases still require the UI.
2 · CompleteCore workflows are callable, documented, versioned, and automatable.
3. Bounded
Can actions be scoped by workspace, role, resource, budget, time, and approval level?
0 · All or nothingCredentials grant broad power.
1 · CoarseSome roles or limits exist, but not enough for safe delegation.
2 · Least privilegeAgents can receive narrow, revocable authority for the task.
4. Verifiable
Does every meaningful action return durable evidence: IDs, URLs, status endpoints, logs, diffs, previews, audit events, or structured success and failure signals?
0 · Trust meOnly a spinner, toast, or vague success message.
1 · Some receiptsReceipts exist but are inconsistent or hard to query later.
2 · Receipts by defaultAgents can cite and re-check proof of what happened.
5. Recoverable
Are failures explicit, retries safe, partial progress visible, and destructive operations reversible or at least clearly marked as irreversible?
0 · FragileTimeouts and partial failures leave unknown state.
1 · Manual recoveryHumans can recover, but agents lack safe retry/rollback paths.
2 · Designed recoveryIdempotency, status checks, rollbacks, and takeover points exist.
6. Composable
Can the tool participate in larger agent workflows across repos, terminals, browsers, docs, inboxes, schedulers, CI, deployments, and human handoff?
0 · SiloNo useful integration points.
1 · IntegratesSome integrations exist, but workflow state is hard to pass around.
2 · Workflow-nativeInputs, outputs, permissions, and status fit broader automation loops.
Worksheet
Copy this into an issue, doc, or audit note for each tool you evaluate.
Criterion
Score
Evidence / next fix
Inspectable
0 / 1 / 2
What can the agent query before acting?
Scriptable
0 / 1 / 2
Which core workflows are stable and callable?
Bounded
0 / 1 / 2
How narrowly can credentials and approvals be scoped?
Verifiable
0 / 1 / 2
What receipt proves the outcome?
Recoverable
0 / 1 / 2
What happens after timeout, failure, or partial completion?
Composable
0 / 1 / 2
How does this plug into wider agent workflows?
What to do with the score
- 9–12: document the safe agent workflows and start measuring real autonomous usage.
- 5–8: pick the lowest-scoring criterion and fix that before adding more automation.
- 0–4: do not hand it to an autonomous agent yet; add inspectability, scoped access, and verification first.
Next step: if this scorecard reveals repeated gaps, those gaps are the roadmap. The highest-value agent-first work is usually boring infrastructure: status endpoints, scoped tokens, dry-runs, receipts, and rollback paths.