Using as a "traditional" search engine? #370

BarrySmith · 2026-02-07T15:06:50Z

BarrySmith
Feb 7, 2026

This is an amazing technical contribution. But it appears to be focused "completely?" on gathering data for use with embeddings/RAG? Am I accurate in that assessment?

After having using both RAG and "classical" website search engines (for example what comes with Sphinx websites) I am convinced that "classical" non-embedding based text search engineers are often much better than RAG both for feeding directly to humans and for feeding into LLM prompts. So my question is have you considered adding classical search technology to the data you gather? What I mean be this is providing multiple language APIs for "searching" the processed data, allowing developers the freedom to then display results they want or to find the results onto further pipelines. (For example, though Sphinx websites are great at search, it is all hardwired directly into the webpage environment and there is no API for calling the search outside of that environment (for example, pure JavaScript or Python). I, for one, would start using such a search API immediately if you did. Or I am missing something you already provide and just don't emphasize in your front facing documentation?

Thanks

Goldziher · 2026-02-08T10:01:54Z

Goldziher
Feb 8, 2026
Maintainer

@BarrySmith -- thanks for the proposal. Can you explain more what you would like to see / have?

1 reply

BarrySmith Feb 8, 2026
Author

Thanks for responding. Here, my naiveté will show; I will try with an example. petsc.org has a search box in the upper corner where you type in a few words, and it returns relevant pages using "classical" search techniques. I find the results very useful. We implemented the webpages with https://www.sphinx-doc.org/en/master/. If you dig into that code, you will find a file _static/searchtools.js (developed by the Sphinx team)

 * searchtools.js
 * ~~~~~~~~~~~~~~~~
 *
 * Sphinx JavaScript utilities for the full-text search.
 *
 * :copyright: Copyright 2007-2024 by the Sphinx team, see AUTHORS.
 * :license: BSD, see LICENSE for details.
 *
 */

which has things like

 _parseQuery: (query) => {
    // stem the search terms and add them to the correct list
    const stemmer = new Stemmer();
    const searchTerms = new Set();
    const excludedTerms = new Set();
    const highlightTerms = new Set();
    const objectTerms = new Set(splitQuery(query.toLowerCase().trim()));
    splitQuery(query.trim()).forEach((queryTerm) => {
      const queryTermLower = queryTerm.toLowerCase();

      // maybe skip this "word"
      // stopwords array is from language_data.js
      if (
        stopwords.indexOf(queryTermLower) !== -1 ||
        queryTerm.match(/^\d+$/)
      )
        return;

      // stem the word
      let word = stemmer.stemWord(queryTermLower);
      // select the correct list
      if (word[0] === "-") excludedTerms.add(word.substr(1));
      else {
        searchTerms.add(word);
        highlightTerms.add(queryTermLower);
      }
    });

/**

execute search (requires search index to be loaded)
*/
_performSearch: (query, searchTerms, excludedTerms, highlightTerms, objectTerms) => {
const filenames = Search._index.filenames;
const docNames = Search._index.docnames;
const titles = Search._index.titles;
const allTitles = Search._index.alltitles;
const indexEntries = Search._index.indexentries;

// Collect multiple result groups to be sorted separately and then ordered.
// Each is an array of [docname, title, anchor, descr, score, filename].
const normalResults = [];
const nonMainIndexResults = [];

_removeChildren(document.getElementById("search-progress"));

const queryLower = query.toLowerCase().trim();
for (const [title, foundTitles] of Object.entries(allTitles)) {

etc

I tried to use this search software outside a browser, but had to give up because it is so tightly integrated with how the document is displayed in a web browser.  

Since you already have a fantastic infrastructure for document processing, I figured you might be able to implement an API in Rust similar to Sphinx's that could be used in many projects, both AI- and non-AI-related. Perhaps I am missing some available software, but I see a big gap in useful open source search infrastructure, you could perhaps fill?

Goldziher · 2026-02-08T17:56:51Z

Goldziher
Feb 8, 2026
Maintainer

I see. So, Kreuzberg is focused on text-extraction and whats called "document intelligence". Its not a search engine in this regard. Its adjacent but not what you are looking for. You can integrate Kreuzberg as part of a pipeline for full-text search. There are libraries for this in TS/JS, Python, Rust and many others.

1 reply

BarrySmith Feb 8, 2026
Author

Thanks for taking a look at my suggestion.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Using as a "traditional" search engine? #370

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments 2 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Using as a "traditional" search engine? #370

Uh oh!

BarrySmith Feb 7, 2026

Replies: 2 comments · 2 replies

Uh oh!

Goldziher Feb 8, 2026 Maintainer

Uh oh!

BarrySmith Feb 8, 2026 Author

Uh oh!

Goldziher Feb 8, 2026 Maintainer

Uh oh!

BarrySmith Feb 8, 2026 Author

BarrySmith
Feb 7, 2026

Replies: 2 comments 2 replies

Goldziher
Feb 8, 2026
Maintainer

BarrySmith Feb 8, 2026
Author

Goldziher
Feb 8, 2026
Maintainer

BarrySmith Feb 8, 2026
Author