Improve reference lookup with BM25F-style sparse retrieval#216
Open
natarajsundar wants to merge 1 commit intoandrewyng:mainfrom
Open
Improve reference lookup with BM25F-style sparse retrieval#216natarajsundar wants to merge 1 commit intoandrewyng:mainfrom
natarajsundar wants to merge 1 commit intoandrewyng:mainfrom
Conversation
Author
|
Hi @rohitprasad15 @danielhorvath-cleo @andrewyng @Ivanye2509 I’d really appreciate a review from contributors who are closest to Context Hub’s search, ranking, and content architecture. This PR proposes an additive relevance improvement path with a working benchmark and companion implementation, while keeping the current content model intact. I’d especially value feedback on scope, fit with the existing CLI direction, and what subset would make the best upstream first step. Thanks for taking a look. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This PR improves Context Hub relevance for reference-file lookups by upgrading the first-pass sparse retriever instead of adding a second-pass reranker.
Specifically, it:
references/files at build timeraw-body.md→raw body)search-index.jsonidnametagsdescriptionreferenceTitlesreferencePathsreferenceHeadingsdocs/reference-search-relevance.mdWhy
A lot of the highest-signal content in Context Hub lives in
references/*.md, but the current first-pass index is built from top-level entry metadata only. That makes short, realistic developer queries like:rrfraw bodyhnswharder to retrieve in the first pass even when the answer is present in a reference file.
This PR keeps retrieval fully local and sparse, but makes the index aware of the reference structure that already exists in the content model.
Design notes
This is intentionally different from a reranking-based approach:
The retrieval idea is closer to classical IR improvements:
Backward compatibility
Existing
search-index.jsonpayloads continue to work:Testing
npm testpasseschub build sample-content/ --validate-onlysucceedsrrfraw body stripe