Why we ran the pilot
Before publishing a search API benchmark, we needed to check whether the harness saved the right evidence. An agent needs more than a list of links. It needs URLs, snippets, timing, result counts, errors, and enough structure to decide what to do next.
The pilot helped us test that evidence format on real API responses.
What changed after the pilot
The pilot was useful, but mostly for improving the benchmark process. It led to five decisions:
- Separate AI search APIs from SERP wrapper APIs. SerpAPI is useful as a baseline, but it is not the same kind of product as Tavily or Exa.
- Publish objective observations first: result count, source rank, latency, errors, and standard retrieval metrics.
- Avoid made-up overall scores unless the weights are explicit and tied to a specific buyer use case.
- Use reviewed relevance labels before publishing rankings. URL and domain patterns are not enough for a public benchmark.
- Keep pilot notes out of the main site journey unless they give readers a practical decision they can act on.
Draft metric summary
These figures show that the metric pipeline ran. They do not show which provider is best. The relevance labels still need review.
What evidence we kept
For each provider and task, the pilot saved the timestamp, latency, result count, top URLs, source ranks, response notes, redacted raw output, pooled results, draft relevance labels, and draft metrics.
A compact public summary is available as evidence-summary.json.
What a publishable benchmark needs
- One clear category first, probably AI search APIs for official documentation retrieval.
- Four to six comparable providers, with SERP wrappers labelled only as baselines if included.
- Fifty to one hundred scenario-based tasks in one dated run.
- Pooled and reviewed relevance labels.
- Plain-English findings backed by the evidence bundle.