Lab note · May 2026 pilot

Search API benchmark pilot note

We ran a small pilot with Brave, Tavily, and SerpAPI to test the benchmark process. This page is not a provider ranking or a buying recommendation.

Run ID: 20260506T213530ZProviders: 3Tasks: 10Status: internal pilot note
Do not use these numbers to choose a provider. The task set was small, the relevance labels were draft, and one provider was a SERP wrapper rather than an AI search API. We are keeping this note for transparency, but future benchmark pages should use reviewed labels and a clearer provider category.

Why we ran the pilot

Before publishing a search API benchmark, we needed to check whether the harness saved the right evidence. An agent needs more than a list of links. It needs URLs, snippets, timing, result counts, errors, and enough structure to decide what to do next.

The pilot helped us test that evidence format on real API responses.

What changed after the pilot

The pilot was useful, but mostly for improving the benchmark process. It led to five decisions:

Draft metric summary

These figures show that the metric pipeline ran. They do not show which provider is best. The relevance labels still need review.

Provider
NDCG@10
Success@3
Success@5
MRR
P@5
Median latency
Brave
0.6100
0.7000
0.7000
0.6433
0.3200
1055 ms
SerpAPI
0.6553
0.7000
0.7000
0.7000
0.3800
41 ms
Tavily
0.5921
0.6000
0.7000
0.5468
0.3800
623 ms

What evidence we kept

For each provider and task, the pilot saved the timestamp, latency, result count, top URLs, source ranks, response notes, redacted raw output, pooled results, draft relevance labels, and draft metrics.

A compact public summary is available as evidence-summary.json.

What a publishable benchmark needs

  1. One clear category first, probably AI search APIs for official documentation retrieval.
  2. Four to six comparable providers, with SERP wrappers labelled only as baselines if included.
  3. Fifty to one hundred scenario-based tasks in one dated run.
  4. Pooled and reviewed relevance labels.
  5. Plain-English findings backed by the evidence bundle.
Current editorial decision: this pilot should stay as a low-profile lab note, not a featured benchmark. Future public benchmark pages should be more complete before they appear in navigation or the sitemap.