# Completed Agent-First Tool Scorecard Example

Use this worked example as a model before scoring a real API, CLI, SaaS product, MCP server, or internal workflow.

## Scenario

A product engineering team wants an AI coding agent to search official documentation before proposing API usage or infrastructure changes. They are evaluating a hypothetical documentation search API, not a specific provider recommendation.

## Decision summary

- Total score: 7 / 12
- Decision: pilot with guardrails
- Recommended scope: read-only documentation lookup and research workflows
- Do not allow: autonomous code or infrastructure changes based only on search snippets

## Filled scorecard

| Criterion | Score | Evidence and next fix |
| --- | ---: | --- |
| Inspectable | 1 / 2 | The agent can read endpoint docs, query parameters, pricing pages, and example responses. It cannot query quota state or index freshness programmatically. Next fix: expose quota and freshness metadata in a status endpoint. |
| Scriptable | 2 / 2 | Search calls are available through HTTPS with stable JSON responses. Common filters are documented and easy to call from Python, shell, or an MCP wrapper. |
| Bounded | 1 / 2 | API keys can be restricted by project and rate limit, but not by allowed domain category or maximum spend per workflow. Next fix: use a proxy key with per-agent budget caps and logging. |
| Verifiable | 1 / 2 | Responses include URLs, titles, snippets, and timestamps. They do not prove that a result is the official page, and snippets can be stale. Next fix: require the agent to fetch selected URLs and verify canonical domains before citing. |
| Recoverable | 1 / 2 | 429 and timeout responses are explicit, but retry-after behaviour is inconsistent across failure types. Next fix: wrap calls with exponential backoff, cached last-good results, and a clear fallback source. |
| Composable | 1 / 2 | The output is JSON and can feed a retrieval pipeline, but result objects need normalization before they fit issue comments, audit receipts, and benchmark evidence tables. |

## Evidence note

```text
Workflow: agent retrieves official docs before suggesting SDK usage.
Acceptance check: at least 3 official URLs, fetched and cited, before code changes.
Observed risk: search result says "official" in title but canonical domain differs.
Guardrail: reject results outside allowlisted vendor domains unless human approves.
Decision: read-only pilot; no autonomous edits based only on search snippets.
```

## After a 7 / 12 score

1. Run a narrow pilot in read-only documentation lookup or research workflows.
2. Patch the weakest evidence gap before broad rollout.
3. Measure missing official results, stale snippets, domain mismatches, 429s, and human overrides.
4. Re-score after verification and bounded usage are made explicit.

## Blank template

Use the blank scorecard: https://agentfirsttools.com/tools/agent-first-scorecard/
