[ZEPPELIN-6411] Semantic search for Zeppelin by kkalyan · Pull Request #5218 · apache/zeppelin

kkalyan · 2026-04-19T18:05:26Z

What is this PR for?

Added EmbeddingSearch — a new SearchService implementation that enables natural language search across Zeppelin notebooks using ONNX-based sentence embeddings (all-MiniLM-L6-v2).
Disabled by default, enabled with zeppelin.search.semantic.enable = true.

The problem:
Zeppelin's built-in search uses Lucene's keyword matching, which works well for exact terms but falls short for the way analysts actually search.
A user looking for "yesterday's spending" gets zero results — even though their notebooks contain SELECT sum(cost) WHERE date = current_date -
interval '1' day. The words don't match, so Lucene can't find it.

This PR adds EmbeddingSearch, an alternative SearchService that uses sentence embeddings (all-MiniLM-L6-v2 via ONNX Runtime) to match by meaning
instead of keywords. It runs entirely in-process with no external services required.

Beyond semantic matching, EmbeddingSearch addresses other gaps in notebook search:

Indexes paragraph output — table results and text output become searchable, not just the code
Extracts SQL table names — FROM/JOIN references are extracted and used to boost related paragraphs in a two-phase ranking
Strips interpreter prefixes — %spark.sql, %python etc. are removed so they don't pollute search results
Live indexing — new or updated paragraphs are searchable immediately, no restart needed

What type of PR is it?

Feature

Todos

EmbeddingSearch core implementation (ONNX inference, mean pooling, cosine similarity)
Table name extraction from SQL (FROM/JOIN regex) with two-phase search boosting
Paragraph output indexing (TABLE, TEXT results)
Versioned binary persistence (v3 format)
Live indexing (new paragraphs searchable immediately)
Angular UI: render search results with separate code/output/tables blocks
Classic UI: same improvements
11 unit tests including semantic validation
Documentation

What is the Jira issue?

https://issues.apache.org/jira/browse/ZEPPELIN-6411

How should this be tested?

Automated tests:

# Embedding search tests (requires ~86MB model download, one-time)
ZEPPELIN_EMBEDDING_TEST=true mvn test -pl zeppelin-zengine -Dtest=EmbeddingSearchTest

# Verify no regressions to existing Lucene search
mvn test -pl zeppelin-zengine -Dtest=LuceneSearchTest

Manual testing:

1. Set zeppelin.search.semantic.enable = true in zeppelin-site.xml
2. Restart Zeppelin
3. Search for natural language queries like:
  - "yesterday's spending" (Lucene: 0 results → Semantic: finds spend queries)
  - "how much do drivers earn" (finds taxi tip analysis)
  - "late deliveries" (finds shipping performance queries)
  - "airport rides" (both work — keyword match exists)

Screenshots (if appropriate)

Semantic Search with New UI

Semantic Search with Classic UI

Questions:

Does the license files need to update?
Yes — NOTICE updated with ONNX Runtime (MIT) and DJL Tokenizers (Apache 2.0) attribution.
Is there breaking changes for older versions?
No. Disabled by default. Existing LuceneSearch behavior is unchanged.

Add EmbeddingSearch — a new SearchService implementation that enables natural language search across notebooks using ONNX-based sentence embeddings (all-MiniLM-L6-v2). Disabled by default, enabled with: zeppelin.search.semantic.enable = true Key improvements over keyword search: - Understands meaning, not just exact keywords - Indexes paragraph output (table data, text results) - Strips interpreter prefixes for cleaner matching - Zero external services — runs entirely in-process JIRA: https://issues.apache.org/jira/browse/ZEPPELIN-6411

Add EmbeddingSearch — a new SearchService implementation that enables natural language search across notebooks using ONNX-based sentence embeddings (all-MiniLM-L6-v2). Disabled by default, enabled with: zeppelin.search.semantic.enable = true Key improvements over keyword search: - Understands meaning, not just exact keywords - Indexes paragraph output (table data, text results) - Extracts and boosts SQL table names (FROM/JOIN) - Two-phase search: discover relevant tables, then boost matches - Strips interpreter prefixes for cleaner matching - Zero external services — runs entirely in-process Frontend improvements (both Angular and Classic UI): - Search results show SQL code, output data, and table names in separate styled blocks instead of a single code editor - Language badges (sql/python/md) on search result cards New files: - EmbeddingSearch.java: core implementation - EmbeddingSearchTest.java: 11 tests including semantic validation - docs/embedding-search.md: architecture documentation JIRA: https://issues.apache.org/jira/browse/ZEPPELIN-6411

Add two-phase search, table extraction, output indexing, frontend changes, and live indexing test to documentation. JIRA: https://issues.apache.org/jira/browse/ZEPPELIN-6411

Copilot

Pull request overview

This PR introduces an optional semantic notebook search implementation (EmbeddingSearch) that uses ONNX-based sentence embeddings to match queries by meaning (plus output indexing and SQL table boosting), and updates both Classic and Angular UIs to render richer search results.

Changes:

Add EmbeddingSearch (ONNX Runtime + DJL tokenizer) with binary persistence and live indexing.
Add semantic-search config flag (zeppelin.search.semantic.enable) and wire the server to select Lucene vs semantic search.
Update Classic + Angular search result rendering (code/output/tables blocks) and adjust TypeScript typing/build settings.

Reviewed changes

Copilot reviewed 20 out of 20 changed files in this pull request and generated 13 comments.

Show a summary per file

File	Description
zeppelin-zengine/src/main/java/org/apache/zeppelin/search/EmbeddingSearch.java	New semantic search service (model download, embedding, indexing, persistence, query ranking).
zeppelin-zengine/src/test/java/org/apache/zeppelin/search/EmbeddingSearchTest.java	New gated tests for semantic indexing/query behavior.
zeppelin-zengine/src/main/java/org/apache/zeppelin/conf/ZeppelinConfiguration.java	Add config accessor + conf var for semantic search enablement.
zeppelin-server/src/main/java/org/apache/zeppelin/server/ZeppelinServer.java	Bind `EmbeddingSearch` when semantic search is enabled.
zeppelin-zengine/pom.xml	Add ONNX Runtime + DJL tokenizers dependencies.
zeppelin-web/src/app/search/result-list.html	Classic UI search results layout changes.
zeppelin-web/src/app/search/result-list.controller.js	Classic UI result parsing for code/output/tables + language badge.
zeppelin-web-angular/src/app/pages/workspace/notebook-search/result-item/result-item.component.ts	Angular UI result parsing + simplified rendering (no Monaco/highlighting).
zeppelin-web-angular/src/app/pages/workspace/notebook-search/result-item/result-item.component.html	Angular UI template to show code/output/tables and badge.
zeppelin-web-angular/src/app/pages/workspace/notebook-search/result-item/result-item.component.less	Angular UI styling for new result layout.
zeppelin-web-angular/tsconfig.base.json	TS compiler option changes.
zeppelin-web-angular/projects/zeppelin-sdk/tsconfig.json	TS compiler option changes for SDK build.
zeppelin-web-angular/src/app/utility/get-keyword-positions.ts	Tighten type for `positions`.
zeppelin-web-angular/src/app/share/run-scripts/run-scripts.directive.ts	Type annotations / casts for script execution logic.
zeppelin-web-angular/src/app/services/save-as.service.ts	Type annotation for `binaryData`.
zeppelin-web-angular/src/app/pages/workspace/notebook/paragraph/code-editor/code-editor.component.ts	Type annotation for `newDecorations`.
zeppelin-web-angular/src/app/pages/workspace/notebook/notebook.component.ts	Safer optional chaining on permissions access.
zeppelin-web-angular/src/app/pages/workspace/credential/credential.component.ts	Type cast for destructuring credentials.
docs/embedding-search.md	New documentation for semantic search design and usage.
NOTICE	Add attributions for ONNX Runtime and DJL tokenizers.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

jongyoul · 2026-04-20T03:01:00Z

@kkalyan Thank you for the contribution. Could you please check the CI first and fix it? Moreover, I will review it but I have a simple question. Do we download the model when we start the server?

kkalyan · 2026-04-20T05:10:42Z

@kkalyan Thank you for the contribution. Could you please check the CI first and fix it? Moreover, I will review it but I have a simple question. Do we download the model when we start the server?

Hi @jongyoul - Yes, the model (~86MB) is downloaded on first start when zeppelin.search.semantic.enable=true (disabled by default). It's cached at {zeppelin.search.index.path}/models/all-MiniLM-L6-v2/ and reused on subsequent starts. I'll fix the CI issues — ESLint brace-style in the classic UI controller and the missing ASF license header on the docs file.

Expand single-line if blocks in detectLang() to satisfy ESLint brace-style rule, and add ASF license header to embedding-search.md to pass Apache RAT audit. JIRA: https://issues.apache.org/jira/browse/ZEPPELIN-6411

- Fix table boosting bug: results now re-sorted by boosted score - Add connect/read timeouts to model download (30s/60s) - Atomic index persistence: write to temp file, then rename - Strip <B> highlight tags from LuceneSearch results in both UIs - Hide language badge for unknown content types (return '' not 'text') - Remove unused SNIPPET_LENGTH constant - Share model directory across test methods to avoid 86MB re-download JIRA: https://issues.apache.org/jira/browse/ZEPPELIN-6411

jongyoul · 2026-04-20T05:34:26Z

I think it's better to have it by default as we need to assume the environment not to download it dynamically. Moreover, don't we need to wait until it's downloaded when starting the server?

kkalyan · 2026-04-20T05:39:11Z

I think it's better to have it by default as we need to assume the environment not to download it dynamically. Moreover, don't we need to wait until it's downloaded when starting the server?

Thank you @jongyoul. You're right — downloading at startup is problematic for production/air-gapped environments.
A couple of questions so I get this right:

Should the model be bundled in the distribution (~86MB), or would a configurable local path (zeppelin.search.model.path) where admins
pre-stage the model be better?
If the model isn't found, should Zeppelin fall back to LuceneSearch with a warning, or fail fast?
Would an optional helper script (bin/install-search-model.sh) be acceptable for the download convenience?

Happy to implement whichever direction you prefer.

Kalyan Kanuri added 3 commits April 19, 2026 07:45

docs: Update embedding search documentation

e2393e5

Add two-phase search, table extraction, output indexing, frontend changes, and live indexing test to documentation. JIRA: https://issues.apache.org/jira/browse/ZEPPELIN-6411

kkalyan changed the title ~~Zeppelin 6411 semantic search~~ [feat] Semantic search for Zeppelin Apr 19, 2026

chore: Revert unrelated package-lock.json change

e2d0cc7

kkalyan changed the title ~~[feat] Semantic search for Zeppelin~~ [ZEPPELIN-6411] Semantic search for Zeppelin Apr 19, 2026

jongyoul requested a review from Copilot April 20, 2026 01:00

Copilot started reviewing on behalf of jongyoul April 20, 2026 01:01 View session

Copilot AI reviewed Apr 20, 2026

View reviewed changes

Kalyan Kanuri added 2 commits April 19, 2026 22:17

fix: Resolve CI failures for ESLint brace-style and RAT license check

254f3de

Expand single-line if blocks in detectLang() to satisfy ESLint brace-style rule, and add ASF license header to embedding-search.md to pass Apache RAT audit. JIRA: https://issues.apache.org/jira/browse/ZEPPELIN-6411

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ZEPPELIN-6411] Semantic search for Zeppelin#5218

[ZEPPELIN-6411] Semantic search for Zeppelin#5218
kkalyan wants to merge 6 commits intoapache:masterfrom
kkalyan:ZEPPELIN-6411-semantic-search

kkalyan commented Apr 19, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jongyoul commented Apr 20, 2026

Uh oh!

kkalyan commented Apr 20, 2026

Uh oh!

jongyoul commented Apr 20, 2026

Uh oh!

kkalyan commented Apr 20, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

kkalyan commented Apr 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What is this PR for?

What type of PR is it?

Todos

What is the Jira issue?

How should this be tested?

Screenshots (if appropriate)

Questions:

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jongyoul commented Apr 20, 2026

Uh oh!

kkalyan commented Apr 20, 2026

Uh oh!

jongyoul commented Apr 20, 2026

Uh oh!

kkalyan commented Apr 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

kkalyan commented Apr 19, 2026 •

edited

Loading

kkalyan commented Apr 20, 2026 •

edited

Loading