Search Quality Benchmarks¶
This page documents the search behavior that exists in code today and suggests how to evaluate it against your own registry.
Code-backed behavior¶
Implemented in src/registry.rs:
- BM25 keyword search
- optional semantic retrieval when the
semanticfeature is enabled - Reciprocal Rank Fusion for hybrid ranking
Current fixed settings:
- BM25
k1 = 1.2 - BM25
b = 0.75 - name-token boost by duplication
- hybrid candidate pool fetches at least 30 items from each retriever before fusion
What to test¶
Build a benchmark set with three categories:
- exact-name or exact-term queries
- conceptual queries that need semantic matching
- ambiguous queries that need good ranking, not just recall
Suggested fields:
Useful metrics¶
- MRR for "did the first useful tool rank high enough?"
- Recall@5 for "did discovery surface the right tool set?"
- nDCG@5 for "were the best tools near the top?"
Scaling guidance¶
For the current implementation:
- BM25 is a scan over registry entries and short descriptions
- semantic lookup is brute-force cosine over stored vectors
That is a good fit for the current design space of dozens or hundreds of tools. If the registry grows into the many-thousands range, a more specialized indexing strategy may be worth revisiting.