Improve reference lookup with BM25F-style sparse retrieval by natarajsundar · Pull Request #216 · andrewyng/context-hub

natarajsundar · 2026-04-11T01:00:08Z

Summary

This PR improves Context Hub relevance for reference-file lookups by upgrading the first-pass sparse retriever instead of adding a second-pass reranker.

Specifically, it:

extracts deterministic search signals from references/ files at build time
- file stems (for example raw-body.md → raw body)
- humanized reference paths
- first Markdown headings
indexes those signals into search-index.json
switches sparse scoring to a BM25F-style combined-field scorer across:
- id
- name
- tags
- description
- referenceTitles
- referencePaths
- referenceHeadings
adds regression tests for acronym and reference-stem queries
adds a short design note under docs/reference-search-relevance.md

Why

A lot of the highest-signal content in Context Hub lives in references/*.md, but the current first-pass index is built from top-level entry metadata only. That makes short, realistic developer queries like:

rrf
raw body
hnsw

harder to retrieve in the first pass even when the answer is present in a reference file.

This PR keeps retrieval fully local and sparse, but makes the index aware of the reference structure that already exists in the content model.

Design notes

This is intentionally different from a reranking-based approach:

no extra runtime retrieval stage
no external model or embedding dependency
no new service boundary
still explainable and deterministic

The retrieval idea is closer to classical IR improvements:

BM25F / combined-field sparse scoring
index-time document expansion using structured metadata already present in the corpus

Backward compatibility

Existing search-index.json payloads continue to work:

older indexes without the new reference fields still score correctly
the new fields only activate when the build step emits them

Testing

npm test passes
chub build sample-content/ --validate-only succeeds
Manual testing done (describe below)
verified new regression cases for:
- rrf
- raw body stripe
- backward compatibility with pre-existing sparse index documents

natarajsundar · 2026-04-11T01:27:47Z

Hi @rohitprasad15 @danielhorvath-cleo @andrewyng @Ivanye2509 I’d really appreciate a review from contributors who are closest to Context Hub’s search, ranking, and content architecture. This PR proposes an additive relevance improvement path with a working benchmark and companion implementation, while keeping the current content model intact. I’d especially value feedback on scope, fit with the existing CLI direction, and what subset would make the best upstream first step. Thanks for taking a look.

Improve reference lookup with BM25F-style sparse retrieval

3bb050a

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve reference lookup with BM25F-style sparse retrieval#216

Improve reference lookup with BM25F-style sparse retrieval#216
natarajsundar wants to merge 1 commit intoandrewyng:mainfrom
natarajsundar:feat/bm25f

natarajsundar commented Apr 11, 2026 •

edited

Loading

Uh oh!

natarajsundar commented Apr 11, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

natarajsundar commented Apr 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Why

Design notes

Backward compatibility

Testing

Uh oh!

natarajsundar commented Apr 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

natarajsundar commented Apr 11, 2026 •

edited

Loading

natarajsundar commented Apr 11, 2026 •

edited

Loading