Skip to content

Conversation

@MBkkt
Copy link
Contributor

@MBkkt MBkkt commented Dec 15, 2023

  • Analyzer description
  • View examples
  • Inverted index examples
  • Explain how the search works internally?
  • Release notes

Upstream PRs

  • 3.10:
  • 3.11:
  • 3.12: arangodb/enterprise-preview:devel-nightly

@arangodb-docs-automation
Copy link
Contributor

Deploy Preview Available Via
https://deploy-preview-392--docs-hugo.netlify.app

@cla-bot cla-bot bot added the cla-signed label Dec 15, 2023
@Simran-B
Copy link
Contributor

The Analyzer seems to work differently than the one in Elastic, at least based on what they show in their blog post. They don't seem to create all of the n-grams, e.g. for avocado and n-gram size = 3, they would apparently add avo, oca, ado to the index but our Analyzer returns:

av, avo, voc, oca, cad, ado, do, o.

The first token seems to always be a prefix with ngramSize - 1? Then all the trigrams, and I suppose the rest of the strings down to a size of 1 are for suffix matching. I don't quite understand the first entry. If ngramSize is e.g. 5 but the input is shorter than this, the first two entries are identical. Is this intentional?

@MBkkt
Copy link
Contributor Author

MBkkt commented Dec 22, 2023

@Simran-B

It returns all ngram with specified ngramSize, and all suffix ngrams smaller than ngramSize

It also change input text to the \xFFtext\xFF -- \xFF invalid utf-8 and cannot appear in any text, so it real output is
\xFFav, avo, voc, oca, cad, ado, do\xFF, o\xFF

This is necessary to speedup queries with max sub pattern size smaller than ngramSize and prefix/suffix queries

@MBkkt
Copy link
Contributor Author

MBkkt commented Dec 22, 2023

@Simran-B

The Analyzer seems to work differently than the one in Elastic, at least based on what they show in their blog post. They don't seem to create all of the n-grams, e.g. for avocado and n-gram size = 3, they would apparently add avo, oca, ado to the index but our Analyzer returns:

It's just simplification of blog-post
If analyzer will produce for avocado only avo oca ado.
It will imposible to fast find it with query like %cad%

But it's possible on search phase avoid all tokens which intersects, except first and last ngram
So if you honestly ngram avocado, it's possible to search avocado with \xFFav oca do\xFF, and you don't need to search avo, voc, cad, ado
It can be better in some cases, but from our measurements commonly it slower (because smaller count of distinct terms in approximation query)

@Simran-B Simran-B self-assigned this Dec 22, 2023
Copy link
Contributor Author

@MBkkt MBkkt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@Simran-B Simran-B marked this pull request as ready for review February 5, 2024 13:18
@Simran-B
Copy link
Contributor

Simran-B commented Feb 5, 2024

/generate

@Simran-B
Copy link
Contributor

Simran-B commented Feb 6, 2024

/generate

@Simran-B
Copy link
Contributor

Simran-B commented Feb 6, 2024

The API change described in #447 broke an example. The example erroneously tried to drop a collection that is part of a graph but dropping the example graph drops all graph collections anyway.

Another issue that surfaced is that curl examples that purposefully trigger an error need to make use of // xpError(...) or the new toolchain complains about an unexpected error. In the old toolchain, the correct example behavior was only ensured by assert() statements.

…--- to em dash

This interfered with ArangoSearch wildcard Analyzer examples in result tables where the verbatim -- needs to be displayed
@Simran-B
Copy link
Contributor

Simran-B commented Feb 6, 2024

/commit

@cla-bot

This comment was marked as duplicate.

@cla-bot cla-bot bot removed the cla-signed label Feb 6, 2024
@cla-bot

This comment was marked as duplicate.

Copy link
Contributor

@nerpaula nerpaula left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@nerpaula nerpaula merged commit 8515e34 into main Feb 7, 2024
@nerpaula nerpaula deleted the MBkkt-patch-1 branch February 7, 2024 10:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants