[ZEPPELIN-6411] Semantic search for Zeppelin#5218
[ZEPPELIN-6411] Semantic search for Zeppelin#5218kkalyan wants to merge 6 commits intoapache:masterfrom
Conversation
Add EmbeddingSearch — a new SearchService implementation that enables natural language search across notebooks using ONNX-based sentence embeddings (all-MiniLM-L6-v2). Disabled by default, enabled with: zeppelin.search.semantic.enable = true Key improvements over keyword search: - Understands meaning, not just exact keywords - Indexes paragraph output (table data, text results) - Strips interpreter prefixes for cleaner matching - Zero external services — runs entirely in-process JIRA: https://issues.apache.org/jira/browse/ZEPPELIN-6411
Add EmbeddingSearch — a new SearchService implementation that enables natural language search across notebooks using ONNX-based sentence embeddings (all-MiniLM-L6-v2). Disabled by default, enabled with: zeppelin.search.semantic.enable = true Key improvements over keyword search: - Understands meaning, not just exact keywords - Indexes paragraph output (table data, text results) - Extracts and boosts SQL table names (FROM/JOIN) - Two-phase search: discover relevant tables, then boost matches - Strips interpreter prefixes for cleaner matching - Zero external services — runs entirely in-process Frontend improvements (both Angular and Classic UI): - Search results show SQL code, output data, and table names in separate styled blocks instead of a single code editor - Language badges (sql/python/md) on search result cards New files: - EmbeddingSearch.java: core implementation - EmbeddingSearchTest.java: 11 tests including semantic validation - docs/embedding-search.md: architecture documentation JIRA: https://issues.apache.org/jira/browse/ZEPPELIN-6411
Add two-phase search, table extraction, output indexing, frontend changes, and live indexing test to documentation. JIRA: https://issues.apache.org/jira/browse/ZEPPELIN-6411
There was a problem hiding this comment.
Pull request overview
This PR introduces an optional semantic notebook search implementation (EmbeddingSearch) that uses ONNX-based sentence embeddings to match queries by meaning (plus output indexing and SQL table boosting), and updates both Classic and Angular UIs to render richer search results.
Changes:
- Add
EmbeddingSearch(ONNX Runtime + DJL tokenizer) with binary persistence and live indexing. - Add semantic-search config flag (
zeppelin.search.semantic.enable) and wire the server to select Lucene vs semantic search. - Update Classic + Angular search result rendering (code/output/tables blocks) and adjust TypeScript typing/build settings.
Reviewed changes
Copilot reviewed 20 out of 20 changed files in this pull request and generated 13 comments.
Show a summary per file
| File | Description |
|---|---|
| zeppelin-zengine/src/main/java/org/apache/zeppelin/search/EmbeddingSearch.java | New semantic search service (model download, embedding, indexing, persistence, query ranking). |
| zeppelin-zengine/src/test/java/org/apache/zeppelin/search/EmbeddingSearchTest.java | New gated tests for semantic indexing/query behavior. |
| zeppelin-zengine/src/main/java/org/apache/zeppelin/conf/ZeppelinConfiguration.java | Add config accessor + conf var for semantic search enablement. |
| zeppelin-server/src/main/java/org/apache/zeppelin/server/ZeppelinServer.java | Bind EmbeddingSearch when semantic search is enabled. |
| zeppelin-zengine/pom.xml | Add ONNX Runtime + DJL tokenizers dependencies. |
| zeppelin-web/src/app/search/result-list.html | Classic UI search results layout changes. |
| zeppelin-web/src/app/search/result-list.controller.js | Classic UI result parsing for code/output/tables + language badge. |
| zeppelin-web-angular/src/app/pages/workspace/notebook-search/result-item/result-item.component.ts | Angular UI result parsing + simplified rendering (no Monaco/highlighting). |
| zeppelin-web-angular/src/app/pages/workspace/notebook-search/result-item/result-item.component.html | Angular UI template to show code/output/tables and badge. |
| zeppelin-web-angular/src/app/pages/workspace/notebook-search/result-item/result-item.component.less | Angular UI styling for new result layout. |
| zeppelin-web-angular/tsconfig.base.json | TS compiler option changes. |
| zeppelin-web-angular/projects/zeppelin-sdk/tsconfig.json | TS compiler option changes for SDK build. |
| zeppelin-web-angular/src/app/utility/get-keyword-positions.ts | Tighten type for positions. |
| zeppelin-web-angular/src/app/share/run-scripts/run-scripts.directive.ts | Type annotations / casts for script execution logic. |
| zeppelin-web-angular/src/app/services/save-as.service.ts | Type annotation for binaryData. |
| zeppelin-web-angular/src/app/pages/workspace/notebook/paragraph/code-editor/code-editor.component.ts | Type annotation for newDecorations. |
| zeppelin-web-angular/src/app/pages/workspace/notebook/notebook.component.ts | Safer optional chaining on permissions access. |
| zeppelin-web-angular/src/app/pages/workspace/credential/credential.component.ts | Type cast for destructuring credentials. |
| docs/embedding-search.md | New documentation for semantic search design and usage. |
| NOTICE | Add attributions for ONNX Runtime and DJL tokenizers. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
@kkalyan Thank you for the contribution. Could you please check the CI first and fix it? Moreover, I will review it but I have a simple question. Do we download the model when we start the server? |
Hi @jongyoul - Yes, the model (~86MB) is downloaded on first start when |
Expand single-line if blocks in detectLang() to satisfy ESLint brace-style rule, and add ASF license header to embedding-search.md to pass Apache RAT audit. JIRA: https://issues.apache.org/jira/browse/ZEPPELIN-6411
- Fix table boosting bug: results now re-sorted by boosted score - Add connect/read timeouts to model download (30s/60s) - Atomic index persistence: write to temp file, then rename - Strip <B> highlight tags from LuceneSearch results in both UIs - Hide language badge for unknown content types (return '' not 'text') - Remove unused SNIPPET_LENGTH constant - Share model directory across test methods to avoid 86MB re-download JIRA: https://issues.apache.org/jira/browse/ZEPPELIN-6411
|
I think it's better to have it by default as we need to assume the environment not to download it dynamically. Moreover, don't we need to wait until it's downloaded when starting the server? |
Thank you @jongyoul. You're right — downloading at startup is problematic for production/air-gapped environments.
Happy to implement whichever direction you prefer. |
What is this PR for?
Added
EmbeddingSearch— a newSearchServiceimplementation that enables natural language search across Zeppelin notebooks using ONNX-based sentence embeddings (all-MiniLM-L6-v2).Disabled by default, enabled with
zeppelin.search.semantic.enable = true.The problem:
Zeppelin's built-in search uses Lucene's keyword matching, which works well for exact terms but falls short for the way analysts actually search.
A user looking for "yesterday's spending" gets zero results — even though their notebooks contain SELECT sum(cost) WHERE date = current_date -
interval '1' day. The words don't match, so Lucene can't find it.
This PR adds EmbeddingSearch, an alternative SearchService that uses sentence embeddings (all-MiniLM-L6-v2 via ONNX Runtime) to match by meaning
instead of keywords. It runs entirely in-process with no external services required.
Beyond semantic matching, EmbeddingSearch addresses other gaps in notebook search:
What type of PR is it?
Feature
Todos
What is the Jira issue?
How should this be tested?
Automated tests:
Screenshots (if appropriate)
Semantic Search with New UI


Semantic Search with Classic UI
Questions: