Skip to content

Commit 28a5fc4

Browse files
author
Luis Cabrera
authored
Impact of an analyzer on wildcard queries
Explaining a bit better how selecting an analyzer will impact wildcard search results
1 parent d64215b commit 28a5fc4

File tree

1 file changed

+10
-1
lines changed

1 file changed

+10
-1
lines changed

articles/search/query-lucene-syntax.md

Lines changed: 10 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -179,7 +179,16 @@ Suffix matching, where `*` or `?` precedes the string (as in `search=/.*numeric.
179179
> [!NOTE]
180180
> As a rule, pattern matching is slow so you might want to explore alternative methods, such as edge n-gram tokenization that creates tokens for sequences of characters in a term. The index will be larger, but queries might execute faster, depending on the pattern construction and the length of strings you are indexing.
181181
>
182-
> During query parsing, queries that are formulated as prefix, suffix, wildcard, or regular expressions are passed as-is to the query tree, bypassing [lexical analysis](search-lucene-query-architecture.md#stage-2-lexical-analysis). Matches will only be found if the index contains the strings in the format your query specifies. In most cases, you will need an alternative analyzer during indexing that preserves string integrity so that partial term and pattern matching succeeds. For more information, see [Partial term search in Azure Cognitive Search queries](search-query-partial-matching.md).
182+
183+
### Impact of an analyzer on wildcard queries
184+
185+
During query parsing, queries that are formulated as prefix, suffix, wildcard, or regular expressions are passed as-is to the query tree, bypassing [lexical analysis](search-lucene-query-architecture.md#stage-2-lexical-analysis). Matches will only be found if the index contains the strings in the format your query specifies. In most cases, you will need an analyzer during indexing that preserves string integrity so that partial term and pattern matching succeeds. For more information, see [Partial term search in Azure Cognitive Search queries](search-query-partial-matching.md).
186+
187+
Consider a situation where you may want the search query 'terminat*' to return results that contain terms such as 'terminate', 'termination' and 'terminates'.
188+
189+
If you were to use the en.lucene (English Lucene) analyzer, it would apply aggressive stemming of each term. For example, 'terminate', 'termination', 'terminates' will all be tokenized down to the token 'termi' in your index. On the other side, terms in queries using wildcards or fuzzy search are not analyzed at all., so there would be no results that would match the 'terminat*' query.
190+
191+
On the other side, the Microsoft analyzers (in this case, the en.microsoft analyzer) are a bit more advanced and use lemmatization instead of stemming. This means that all generated tokens should be valid English words. For example, 'terminate', 'terminates' and 'termination' will mostly stay whole in the index, and would be a preferable choice for scenarios that depend a lot on wildcards and fuzzy search.
183192

184193
## <a name="bkmk_searchscoreforwildcardandregexqueries"></a> Scoring wildcard and regex queries
185194

0 commit comments

Comments
 (0)