Skip to content

Improve reference lookup with BM25F-style sparse retrieval#216

Open
natarajsundar wants to merge 1 commit intoandrewyng:mainfrom
natarajsundar:feat/bm25f
Open

Improve reference lookup with BM25F-style sparse retrieval#216
natarajsundar wants to merge 1 commit intoandrewyng:mainfrom
natarajsundar:feat/bm25f

Conversation

@natarajsundar
Copy link
Copy Markdown

@natarajsundar natarajsundar commented Apr 11, 2026

Summary

This PR improves Context Hub relevance for reference-file lookups by upgrading the first-pass sparse retriever instead of adding a second-pass reranker.

Specifically, it:

  • extracts deterministic search signals from references/ files at build time
    • file stems (for example raw-body.mdraw body)
    • humanized reference paths
    • first Markdown headings
  • indexes those signals into search-index.json
  • switches sparse scoring to a BM25F-style combined-field scorer across:
    • id
    • name
    • tags
    • description
    • referenceTitles
    • referencePaths
    • referenceHeadings
  • adds regression tests for acronym and reference-stem queries
  • adds a short design note under docs/reference-search-relevance.md

Why

A lot of the highest-signal content in Context Hub lives in references/*.md, but the current first-pass index is built from top-level entry metadata only. That makes short, realistic developer queries like:

  • rrf
  • raw body
  • hnsw

harder to retrieve in the first pass even when the answer is present in a reference file.

This PR keeps retrieval fully local and sparse, but makes the index aware of the reference structure that already exists in the content model.

Design notes

This is intentionally different from a reranking-based approach:

  • no extra runtime retrieval stage
  • no external model or embedding dependency
  • no new service boundary
  • still explainable and deterministic

The retrieval idea is closer to classical IR improvements:

  • BM25F / combined-field sparse scoring
  • index-time document expansion using structured metadata already present in the corpus

Backward compatibility

Existing search-index.json payloads continue to work:

  • older indexes without the new reference fields still score correctly
  • the new fields only activate when the build step emits them

Testing

  • npm test passes
  • chub build sample-content/ --validate-only succeeds
  • Manual testing done (describe below)
  • verified new regression cases for:
    • rrf
    • raw body stripe
    • backward compatibility with pre-existing sparse index documents

@natarajsundar
Copy link
Copy Markdown
Author

natarajsundar commented Apr 11, 2026

Hi @rohitprasad15 @danielhorvath-cleo @andrewyng @Ivanye2509 I’d really appreciate a review from contributors who are closest to Context Hub’s search, ranking, and content architecture. This PR proposes an additive relevance improvement path with a working benchmark and companion implementation, while keeping the current content model intact. I’d especially value feedback on scope, fit with the existing CLI direction, and what subset would make the best upstream first step. Thanks for taking a look.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant