Search API benchmark pilot note

Do not use these numbers to choose a provider. The task set was small, the relevance labels were draft, and one provider was a SERP wrapper rather than an AI search API. We are keeping this note for transparency, but future benchmark pages should use reviewed labels and a clearer provider category.

Why we ran the pilot

Before publishing a search API benchmark, we needed to check whether the harness saved the right evidence. An agent needs more than a list of links. It needs URLs, snippets, timing, result counts, errors, and enough structure to decide what to do next.

The pilot helped us test that evidence format on real API responses.

What changed after the pilot

The pilot was useful, but mostly for improving the benchmark process. It led to five decisions:

Separate AI search APIs from SERP wrapper APIs. SerpAPI is useful as a baseline, but it is not the same kind of product as Tavily or Exa.
Publish objective observations first: result count, source rank, latency, errors, and standard retrieval metrics.
Avoid made-up overall scores unless the weights are explicit and tied to a specific buyer use case.
Use reviewed relevance labels before publishing rankings. URL and domain patterns are not enough for a public benchmark.
Keep pilot notes out of the main site journey unless they give readers a practical decision they can act on.

Draft metric summary

These figures show that the metric pipeline ran. They do not show which provider is best. The relevance labels still need review.

Provider

NDCG@10

Success@3

Success@5

MRR

P@5

Median latency

Brave

0.6100

0.7000

0.6433

0.3200

1055 ms

SerpAPI

0.6553

0.7000

0.3800

41 ms

Tavily

0.5921

0.6000

0.7000

0.5468

0.3800

623 ms

What evidence we kept

For each provider and task, the pilot saved the timestamp, latency, result count, top URLs, source ranks, response notes, redacted raw output, pooled results, draft relevance labels, and draft metrics.

A compact public summary is available as evidence-summary.json.

What a publishable benchmark needs

One clear category first, probably AI search APIs for official documentation retrieval.
Four to six comparable providers, with SERP wrappers labelled only as baselines if included.
Fifty to one hundred scenario-based tasks in one dated run.
Pooled and reviewed relevance labels.
Plain-English findings backed by the evidence bundle.

Current editorial decision: this pilot should stay as a low-profile lab note, not a featured benchmark. Future public benchmark pages should be more complete before they appear in navigation or the sitemap.

Get benchmark updates