-
-
Notifications
You must be signed in to change notification settings - Fork 6.2k
Description
Feature Description
As of today, the elastic search search uses the default analizer when indexing the source code contents. This implementation uses whitespaces to break the tokens.
I feel this approach is not particularly suitable for source code search. To illustrate the issue, let us consider the code snippet below:
public baz(Foo foo) {
return foo.bar();
}
It is fair to think that searching for bar returns the code above. As of today, however, this is not the case: ES will assume that foo.bar() is a single token. As such, ES will not match the criterion bar.
I suggest we use the pattern tokenizer instead. It uses regular expressions to separate tokens. By default, it uses any (non-word character as a token separator). In such a case, the snippet foo.bar() would yield two tokens -- foo and bar (the second token will match the given criterion).
What do you guys think?
Screenshots
No response