Agent-First Tool Scorecard

How to score: give each criterion 0, 1, or 2 points. A tool scoring 9–12 is agent-ready for bounded workflows. 5–8 means useful but fragile. 0–4 means agents will need heavy human supervision or brittle workarounds.

The scoring rubric

1. Inspectable

Can an agent discover capabilities, required inputs, current state, permissions, examples, and failure modes before acting?

0 · HiddenDocs, state, and permissions are scattered or human-only.

1 · PartialSome docs or endpoints exist, but the agent must infer important details.

2 · ClearMachine-readable schemas, examples, state, and limits are easy to query.

2. Scriptable

Can every important workflow be called repeatably through a stable API, CLI, MCP server, webhook, file interface, or other non-brittle control surface?

0 · UI-onlyRequires clicking a web app for core work.

1 · MixedBasic operations are callable, but edge cases still require the UI.

2 · CompleteCore workflows are callable, documented, versioned, and automatable.

3. Bounded

Can actions be scoped by workspace, role, resource, budget, time, and approval level?

0 · All or nothingCredentials grant broad power.

1 · CoarseSome roles or limits exist, but not enough for safe delegation.

2 · Least privilegeAgents can receive narrow, revocable authority for the task.

4. Verifiable

Does every meaningful action return durable evidence: IDs, URLs, status endpoints, logs, diffs, previews, audit events, or structured success and failure signals?

0 · Trust meOnly a spinner, toast, or vague success message.

1 · Some receiptsReceipts exist but are inconsistent or hard to query later.

2 · Receipts by defaultAgents can cite and re-check proof of what happened.

5. Recoverable

Are failures explicit, retries safe, partial progress visible, and destructive operations reversible or at least clearly marked as irreversible?

0 · FragileTimeouts and partial failures leave unknown state.

1 · Manual recoveryHumans can recover, but agents lack safe retry/rollback paths.

2 · Designed recoveryIdempotency, status checks, rollbacks, and takeover points exist.

6. Composable

Can the tool participate in larger agent workflows across repos, terminals, browsers, docs, inboxes, schedulers, CI, deployments, and human handoff?

0 · SiloNo useful integration points.

1 · IntegratesSome integrations exist, but workflow state is hard to pass around.

2 · Workflow-nativeInputs, outputs, permissions, and status fit broader automation loops.

Worksheet

Copy this into an issue, doc, or audit note for each tool you evaluate.

Criterion

Score

Evidence / next fix

Inspectable

0 / 1 / 2

What can the agent query before acting?

Scriptable

0 / 1 / 2

Which core workflows are stable and callable?

Bounded

0 / 1 / 2

How narrowly can credentials and approvals be scoped?

Verifiable

0 / 1 / 2

What receipt proves the outcome?

Recoverable

0 / 1 / 2

What happens after timeout, failure, or partial completion?

Composable

0 / 1 / 2

How does this plug into wider agent workflows?

What to do with the score

9–12: document the safe agent workflows and start measuring real autonomous usage.
5–8: pick the lowest-scoring criterion and fix that before adding more automation.
0–4: do not hand it to an autonomous agent yet; add inspectability, scoped access, and verification first.

Next step: if this scorecard reveals repeated gaps, those gaps are the roadmap. The highest-value agent-first work is usually boring infrastructure: status endpoints, scoped tokens, dry-runs, receipts, and rollback paths.

Read the core guide Design action receipts Back to homepage

Agent-first tool scorecard

The scoring rubric

1. Inspectable

2. Scriptable

3. Bounded

4. Verifiable

5. Recoverable

6. Composable

Worksheet

What to do with the score