Skip to content

Use a more sane tokenizer for source code searchΒ #32220

@bsofiato

Description

@bsofiato

Feature Description

As of today, the elastic search search uses the default analizer when indexing the source code contents. This implementation uses whitespaces to break the tokens.

I feel this approach is not particularly suitable for source code search. To illustrate the issue, let us consider the code snippet below:

public baz(Foo foo) {
   return foo.bar();
}

It is fair to think that searching for bar returns the code above. As of today, however, this is not the case: ES will assume that foo.bar() is a single token. As such, ES will not match the criterion bar.

I suggest we use the pattern tokenizer instead. It uses regular expressions to separate tokens. By default, it uses any (non-word character as a token separator). In such a case, the snippet foo.bar() would yield two tokens -- foo and bar (the second token will match the given criterion).

What do you guys think?

Screenshots

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    type/proposalThe new feature has not been accepted yet but needs to be discussed first.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions