You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: articles/search/search-query-partial-matching.md
+31-53Lines changed: 31 additions & 53 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -8,36 +8,39 @@ author: HeidiSteen
8
8
ms.author: heidist
9
9
ms.service: cognitive-search
10
10
ms.topic: conceptual
11
-
ms.date: 01/12/2020
11
+
ms.date: 01/14/2020
12
12
---
13
13
# Match on patterns and special characters (dashes)
14
14
15
-
For queries that include special characters (`-, *, (, ), /, \, =`)or query patterns based on partial terms within a larger term, custom analyzers are typically needed to ensure that the index contains the expected content in the right format.
15
+
For queries that include special characters (`-, *, (, ), /, \, =`), or for query patterns based on partial terms within a larger term, additional configuration steps are typically needed to ensure that the index contains the expected content, in the right format.
16
16
17
17
By default, a phone number like `"+1 (425) 703-6214"` is tokenized as
18
-
`"1"`, `"425"`, `"703"`, `"6214"`. Searching on `"3-62"`, partial terms that include a dash, will fail because that content doesn't exist like that in the index.
18
+
`"1"`, `"425"`, `"703"`, `"6214"`. As you can imagine, searching on `"3-62"`, partial terms that include a dash, will fail because that content doesn't exist like that in the index.
19
19
20
-
When you need to search on partial strings or special characters, you can override the default analyzer with a custom analyzer that operates under simpler tokenization rules that keeps terms intact. Taking a step back, the approach looks like this:
20
+
When you need to search on partial strings or special characters, you can override the default analyzer with a custom analyzer that operates under simpler tokenization rules that preserves whole terms. Taking a step back, the approach looks like this:
21
21
22
22
+ Choose a predefined analyzer or define a custom analyzer that produces the desired output
23
23
+ Assign the analyzer to the field
24
24
+ Build the index and test
25
25
26
26
This article walks you through each step. The approach described here can be used in other scenarios. Wildcard and regular expression queries also need whole terms as the basis for pattern matching.
27
27
28
+
> [!TIP]
29
+
> Evaluating analyers requires frequent index rebuilds. You can make this step easier by using Postman, the REST APIs for [Create Index](https://docs.microsoft.com/rest/api/searchservice/create-index), [Delete Index](https://docs.microsoft.com/rest/api/searchservice/delete-index), and [Load Documents](https://docs.microsoft.com/rest/api/searchservice//addupdate-or-delete-documents) where the request body provides a small representative data set that you want to test.
30
+
28
31
## Choosing an analyzer
29
32
30
-
Azure Cognitive Search uses the Standard Lucene analyzer by default. You can override this analyzer on a per-field basis to get different output in the index. The following analyzers are commonly used when you want to keep terms intact during indexing:
33
+
When choosing an analyzer that produces whole-term tokens, the following analyzers are common choices:
31
34
32
35
| Analyzer | Behaviors |
33
36
|----------|-----------|
34
37
|[keyword](https://lucene.apache.org/core/6_6_1/analyzers-common/org/apache/lucene/analysis/core/KeywordAnalyzer.html)| Content of the entire field is tokenized as a single term. |
35
38
|[whitespace](https://lucene.apache.org/core/6_6_1/analyzers-common/org/apache/lucene/analysis/core/WhitespaceAnalyzer.html)| Separates on white spaces only. Terms that include dashes or other characters are treated as a single token. |
36
-
|[custom analyzer](index-add-custom-analyzers.md)| (recommended) Create a custom analyzer so that you can specify both the tokenizer and token filter. A recommended combination is the keyword tokenizer with a lower-case token filter. When used by itself, the built-in keyword analyzer does not lower-case any upper-case text, which can cause queries to fail. Creating a custom analyzer give you a mechanism for adding the token filter. |
39
+
|[custom analyzer](index-add-custom-analyzers.md)| (recommended) Creating a custom analyzer lets you specify both the tokenizer and token filter. The previous analyzers must be used as-is. A custom analyzer lets you pick which tokenizers and token filters to use. A recommended combination is the keyword tokenizer with a lower-case token filter. By itself, the built-in [keyword](https://lucene.apache.org/core/6_6_1/analyzers-common/org/apache/lucene/analysis/core/KeywordAnalyzer.html)does not lower-case any upper-case text, which can cause queries to fail. A custom analyzer gives you a mechanism for adding the lower-case token filter. |
37
40
38
-
To help you choose an analyzer, test each one to view the tokens it emits. Using a web API test tool like Postman, create a request that calls the [Test Analyzer REST API](https://docs.microsoft.com/rest/api/searchservice/test-analyzer), passing in the analyzer and term. An existing service and index is required for this test.
41
+
The best way to evaluate analyzer is to use a web API test tool like Postman and the [Test Analyzer REST API](https://docs.microsoft.com/rest/api/searchservice/test-analyzer). Given an existing index and a field containing dashes or partial terms, you can try various analyzers over specific terms to see what tokens are emitted.
39
42
40
-
1.Start with the Standard analyzer to understand the default behavior.
43
+
1.Check the Standard analyzer to see how terms are tokenized by default.
41
44
42
45
```json
43
46
{
@@ -46,7 +49,7 @@ To help you choose an analyzer, test each one to view the tokens it emits. Using
46
49
}
47
50
```
48
51
49
-
1. Evaluate the response to see how the text is tokenized within the index.
52
+
1. Evaluate the response to see how the text is tokenized within the index. Notice how each term is lower-cased and broken up.
50
53
51
54
```json
52
55
{
@@ -81,7 +84,8 @@ To help you choose an analyzer, test each one to view the tokens it emits. Using
81
84
}
82
85
```
83
86
84
-
1. Now the response consists of a single token, with dashes preserved as a part of the string. If you need to search on a pattern or a partial term, the query engine now has the basis for finding a match.
87
+
1. Now the response consists of a single token, upper-cased, with dashes preserved as a part of the string. If you need to search on a pattern or a partial term, the query engine now has the basis for finding a match.
88
+
85
89
86
90
```json
87
91
{
@@ -96,19 +100,16 @@ To help you choose an analyzer, test each one to view the tokens it emits. Using
96
100
]
97
101
}
98
102
```
103
+
> [!Important]
104
+
> Be aware that query parsers often lower-case terms in a search expression when building the query tree. If you are using an analyzer that does not lower-case text inputs, and you are not getting expected results, this could be the reason.
99
105
100
-
101
-
## Set up analyzers
102
-
103
-
Gaining control over tokenization starts with switching out the default Standard Lucene analyzer for a built-in or [custom analyzer](index-add-custom-analyzers.md) that delivers minimal processing (typical when using advanced wildcard queries), or additional processing that generates more tokens.
106
+
## Analyzer definitions
107
+
108
+
Whether you are evaluating analyzers or moving forward with a specific configuration, you will need to specify the analyzer on the field definition, and possibly configure the analyzer itself if you are not using a built-in analyzer. When swapping analyzers, you typically need to rebuild the index (drop, recreate, and reload).
104
109
105
110
### Use built-in analyzers
106
111
107
-
If you're using a built-in analyzer, you can reference it by name on an `analyzer` property of a field definition, with no additional configuration required:
108
-
109
-
+ `keyword`
110
-
+ `whitespace`
111
-
+ `pattern`
112
+
Built-in or predefined analyzers can be specified by name on an `analyzer` property of a field definition, with no additional configuration required. The following example demonstrates the use of the `whitespace` analyzer.
112
113
113
114
```json
114
115
{
@@ -117,14 +118,14 @@ If you're using a built-in analyzer, you can reference it by name on an `analyze
117
118
"key": false,
118
119
"retrievable": true,
119
120
"searchable": true,
120
-
"analyzer": "keyword"
121
+
"analyzer": "whitespace"
121
122
}
122
123
```
123
124
For more information about all available built-in analyzers, see [Predefined analyzers list](https://docs.microsoft.com/azure/search/index-add-custom-analyzers#predefined-analyzers-reference).
124
125
125
126
### Use custom analyzers
126
127
127
-
A custom analyzer is a user-defined combination of tokenizer, tokenfilter, and possible configuration settings, giving you more control over the indexing process. The definition of a custom analyzer is specified in the index, and then referenced on a field definition.
128
+
If you are using a [custom analyzer](index-add-custom-analyzers.md), define it in the index with a user-defined combination of tokenizer, tokenfilter, with possible configuration settings. Next, reference it on a field definition, just as you would a built-in analyzer.
128
129
129
130
When the objective is whole-term tokenization, a custom analyzer that consists of a **keyword tokenizer** and **lower-case token filter** is recommended.
130
131
@@ -163,11 +164,16 @@ The following example illustrates a custom analyzer that provides the keyword to
163
164
```
164
165
165
166
> [!NOTE]
166
-
> The keyword_v2 tokenizer and lowercase token filter are known to the system and using their default configurations, which is why you can reference them by name without having to define them first.
167
+
> The `keyword_v2` tokenizer and `lowercase` token filter are known to the system and using their default configurations, which is why you can reference them by name without having to define them first.
168
+
169
+
## Tips and best practices
167
170
168
-
### Add prefix and suffix token filters to generate partial strings
169
171
170
-
A token filter adds additional processing over existing tokens in your index. The following example adds an EdgeNGramTokenFilter to make prefix matches faster. Additional tokens are generated for in 2-25 character combinations: (not only MS, MSF, MSFT, MSFT/, but also embedded/internal partial strings like SQL, SQL., SQL.2)
172
+
### Tune query performance
173
+
174
+
If you implement the recommended configuration that includes the keyword_v2 tokenizer and lower-case token filter, you might notice a decrease in query performance due to the additional token filter processing over existing tokens in your index.
175
+
176
+
The following example adds an [EdgeNGramTokenFilter](https://lucene.apache.org/core/6_6_1/analyzers-common/org/apache/lucene/analysis/ngram/EdgeNGramTokenizer.html) to make prefix matches faster. Additional tokens are generated for in 2-25 character combinations: (not only MS, MSF, MSFT, MSFT/, but also embedded/internal partial strings like SQL, SQL., SQL.2)
171
177
172
178
```json
173
179
{
@@ -206,34 +212,6 @@ A token filter adds additional processing over existing tokens in your index. Th
206
212
]
207
213
```
208
214
209
-
## Build and test the index
210
-
211
-
Once the index definition and analyzer configuration work is done, your next step is to run [Create Index](https://docs.microsoft.com/rest/api/searchservice/create-index) on the service, and then import data. Index names must be unique. If the index already exists, you can either rename the index or delete and recreate it.
212
-
213
-
### Advanced query patterns
214
-
215
-
Once you have an index that contains terms in the correct format, you can specify patterns to find matching documents.
216
-
217
-
[Wildcard](search-query-lucene-examples.md#example-7-wildcard-search) and [Regular expression (RegEx)](search-query-lucene-examples.md#example-6-regex) queries are often used to find patterns on content that is expressed as full tokens in an index.
218
-
219
-
1. On the query request, add `querytype=full` to specify the full Lucene query syntax used for wildcard and RegEx queries.
220
-
221
-
```http
222
-
GET https://<SEARCH-SERVICE>.search.windows.net/indexes/<INDEX>/docs?search=*&query-type=true&api-version=2019-05-06
223
-
````
224
-
225
-
2. In the `search=` expression:
226
-
227
-
+ For wildcard search, combine text with `*` or `?` wildcard characters
228
-
+ For RegEx queries, enclose your pattern or term with `/`, such as `fieldCode:/SQL*Java-Ext/`
229
-
230
-
> [!NOTE]
231
-
> You might be inclined to also use `searchFields` as a field constraint, or set `searchMode=all` as an operator contraint, but in most cases you won't need either one. A regular expression query is typically sufficient for finding an exact match.
232
-
233
-
## Additional customizations
234
-
235
-
If changing the `analyzer` property doesn't produce expected results, explore these additional mechanisms.
236
-
237
215
### Use different analyzers for indexing and query processing
238
216
239
217
Analyzers are called during indexing and during query execution. It's common to use the same analyzer for both but you can configure custom analyzers for each workload. Analyzer overrides are specified in the [index definition](https://docs.microsoft.com/rest/api/searchservice/create-index) in an `analyzers` section, and then referenced on specific fields.
@@ -248,7 +226,7 @@ To specify role-specific analysis, you can set properties on the field for each
248
226
"searchAnalyzer":"standard",
249
227
```
250
228
251
-
### Consider duplicating fields for query optimization
229
+
### Duplicate fields for different scenarios
252
230
253
231
Another option leverages the per-field analyzer assignment to optimize for different scenarios. Specifically, you might define "featureCode" and "featureCodeRegex" to support regular full text search on the first, and advanced pattern matching on the second.
0 commit comments