Completed example

What a filled-in tool scorecard looks like

Use this worked example as a model before scoring a real API, CLI, SaaS product, MCP server, or internal workflow. The point is not the exact vendor score; it is the kind of evidence an agent operator should collect before trusting a tool.

Example workflow: docs retrievalTotal score: 7 / 12Decision: pilot with guardrails
Scenario: a product engineering team wants an AI coding agent to search official documentation before proposing API usage or infrastructure changes. They are evaluating a hypothetical documentation search API, not a specific provider recommendation.

Decision summary

7/12Total score
PilotRecommended decision
2 fixesBefore broad rollout

The tool is useful for bounded retrieval tasks, but not yet safe as the only documentation source for autonomous changes. The team should pilot it in read-only workflows, add source-verification checks, and keep a human approval step for code or infrastructure actions that depend on the results.

Filled scorecard

Criterion
Score
Evidence and next fix
Inspectable
1 / 2
The agent can read endpoint docs, query parameters, pricing pages, and example responses. It cannot query quota state or index freshness programmatically. Next fix: expose quota and freshness metadata in a status endpoint.
Scriptable
2 / 2
Search calls are available through HTTPS with stable JSON responses. Common filters are documented and easy to call from Python, shell, or an MCP wrapper.
Bounded
1 / 2
API keys can be restricted by project and rate limit, but not by allowed domain category or maximum spend per workflow. Next fix: use a proxy key with per-agent budget caps and logging.
Verifiable
1 / 2
Responses include URLs, titles, snippets, and timestamps. They do not prove that a result is the official page, and snippets can be stale. Next fix: require the agent to fetch selected URLs and verify canonical domains before citing.
Recoverable
1 / 2
429 and timeout responses are explicit, but retry-after behaviour is inconsistent across failure types. Next fix: wrap calls with exponential backoff, cached last-good results, and a clear fallback source.
Composable
1 / 2
The output is JSON and can feed a retrieval pipeline, but result objects need normalization before they fit issue comments, audit receipts, and benchmark evidence tables.

Example evidence note

A useful scorecard is evidence-backed. Keep raw commands, response excerpts, docs links, and decision notes together so another agent or teammate can re-check the claim later.

Workflow: agent retrieves official docs before suggesting SDK usage.
Acceptance check: at least 3 official URLs, fetched and cited, before code changes.
Observed risk: search result says "official" in title but canonical domain differs.
Guardrail: reject results outside allowlisted vendor domains unless human approves.
Decision: read-only pilot; no autonomous edits based only on search snippets.

What to do after a 7/12 score

  1. Run a narrow pilot. Keep the tool in read-only documentation lookup or research workflows.
  2. Patch the weakest evidence gap. In this example, source verification and budget scoping matter more than adding another integration.
  3. Measure real agent failures. Log missing official results, stale snippets, domain mismatches, 429s, and human overrides.
  4. Re-score after the fix. A 7 can become a 9 if verification and bounded usage are made explicit.
Reusable example: download the completed example as Markdown, then replace the scenario and evidence with your tool context.

Want a second opinion?

If you have a completed scorecard for a tool that matters to an AI-agent rollout, send the context. AgentFirstTools can do a fit check before any paid audit.