|
| 1 | +--- |
| 2 | +title: Match patterns and special characters |
| 3 | +titleSuffix: Azure Cognitive Search |
| 4 | +description: Use wildcard and prefix queries to match on whole or partial terms in an Azure Cognitive Search query request. Hard-to-match patterns that include special characters can be resolved using full query syntax and custom analyzers. |
| 5 | + |
| 6 | +manager: nitinme |
| 7 | +author: HeidiSteen |
| 8 | +ms.author: heidist |
| 9 | +ms.service: cognitive-search |
| 10 | +ms.topic: conceptual |
| 11 | +ms.date: 01/14/2020 |
| 12 | +--- |
| 13 | +# Match on patterns and special characters (dashes) |
| 14 | + |
| 15 | +For queries that include special characters (`-, *, (, ), /, \, =`), or for query patterns based on partial terms within a larger term, additional configuration steps are typically needed to ensure that the index contains the expected content, in the right format. |
| 16 | + |
| 17 | +By default, a phone number like `+1 (425) 703-6214` is tokenized as |
| 18 | +`"1"`, `"425"`, `"703"`, `"6214"`. As you can imagine, searching on `"3-62"`, partial terms that include a dash, will fail because that content doesn't actually exist in the index. |
| 19 | + |
| 20 | +When you need to search on partial strings or special characters, you can override the default analyzer with a custom analyzer that operates under simpler tokenization rules, preserving whole terms, necessary when query strings include parts of a term or special characters. Taking a step back, the approach looks like this: |
| 21 | + |
| 22 | ++ Choose a predefined analyzer or define a custom analyzer that produces the desired output |
| 23 | ++ Assign the analyzer to the field |
| 24 | ++ Build the index and test |
| 25 | + |
| 26 | +This article walks you through these tasks. The approach described here is useful in other scenarios: wildcard and regular expression queries also need whole terms as the basis for pattern matching. |
| 27 | + |
| 28 | +> [!TIP] |
| 29 | +> Evaluating analyers is an iterative process that requires frequent index rebuilds. You can make this step easier by using Postman, the REST APIs for [Create Index](https://docs.microsoft.com/rest/api/searchservice/create-index), [Delete Index](https://docs.microsoft.com/rest/api/searchservice/delete-index),[Load Documents](https://docs.microsoft.com/rest/api/searchservice/addupdate-or-delete-documents), and [Search Documents](https://docs.microsoft.com/rest/api/searchservice/search-documents). For Load Documents, the request body should contain a small representative data set that you want to test (for example, a field with phone numbers or product codes). With these APIs in the same Postman collection, you can cycle through these steps quickly. |
| 30 | +
|
| 31 | +## Choosing an analyzer |
| 32 | + |
| 33 | +When choosing an analyzer that produces whole-term tokens, the following analyzers are common choices: |
| 34 | + |
| 35 | +| Analyzer | Behaviors | |
| 36 | +|----------|-----------| |
| 37 | +| [keyword](https://lucene.apache.org/core/6_6_1/analyzers-common/org/apache/lucene/analysis/core/KeywordAnalyzer.html) | Content of the entire field is tokenized as a single term. | |
| 38 | +| [whitespace](https://lucene.apache.org/core/6_6_1/analyzers-common/org/apache/lucene/analysis/core/WhitespaceAnalyzer.html) | Separates on white spaces only. Terms that include dashes or other characters are treated as a single token. | |
| 39 | +| [custom analyzer](index-add-custom-analyzers.md) | (recommended) Creating a custom analyzer lets you specify both the tokenizer and token filter. The previous analyzers must be used as-is. A custom analyzer lets you pick which tokenizers and token filters to use. <br><br>A recommended combination is the [keyword tokenizer](https://lucene.apache.org/core/6_6_1/analyzers-common/org/apache/lucene/analysis/core/KeywordTokenizer.html) with a [lower-case token filter](https://lucene.apache.org/core/6_6_1/analyzers-common/org/apache/lucene/analysis/core/LowerCaseFilter.html). By itself, the predefined [keyword analyzer](https://lucene.apache.org/core/6_6_1/analyzers-common/org/apache/lucene/analysis/core/KeywordAnalyzer.html) does not lower-case any upper-case text, which can cause queries to fail. A custom analyzer gives you a mechanism for adding the lower-case token filter. | |
| 40 | + |
| 41 | +If you are using a web API test tool like Postman, you can add the [Test Analyzer REST call](https://docs.microsoft.com/rest/api/searchservice/test-analyzer) to inspect tokenized output. Given an existing index and a field containing dashes or partial terms, you can try various analyzers over specific terms to see what tokens are emitted. |
| 42 | + |
| 43 | +1. Check the Standard analyzer to see how terms are tokenized by default. |
| 44 | + |
| 45 | + ```json |
| 46 | + { |
| 47 | + "text": "SVP10-NOR-00", |
| 48 | + "analyzer": "standard" |
| 49 | + } |
| 50 | + ``` |
| 51 | + |
| 52 | +1. Evaluate the response to see how the text is tokenized within the index. Notice how each term is lower-cased and broken up. |
| 53 | + |
| 54 | + ```json |
| 55 | + { |
| 56 | + "tokens": [ |
| 57 | + { |
| 58 | + "token": "svp10", |
| 59 | + "startOffset": 0, |
| 60 | + "endOffset": 5, |
| 61 | + "position": 0 |
| 62 | + }, |
| 63 | + { |
| 64 | + "token": "nor", |
| 65 | + "startOffset": 6, |
| 66 | + "endOffset": 9, |
| 67 | + "position": 1 |
| 68 | + }, |
| 69 | + { |
| 70 | + "token": "00", |
| 71 | + "startOffset": 10, |
| 72 | + "endOffset": 12, |
| 73 | + "position": 2 |
| 74 | + } |
| 75 | + ] |
| 76 | + } |
| 77 | + ``` |
| 78 | +1. Modify the request to use the `whitespace` or `keyword` analyzer: |
| 79 | + |
| 80 | + ```json |
| 81 | + { |
| 82 | + "text": "SVP10-NOR-00", |
| 83 | + "analyzer": "keyword" |
| 84 | + } |
| 85 | + ``` |
| 86 | + |
| 87 | +1. Now the response consists of a single token, upper-cased, with dashes preserved as a part of the string. If you need to search on a pattern or a partial term, the query engine now has the basis for finding a match. |
| 88 | + |
| 89 | + |
| 90 | + ```json |
| 91 | + { |
| 92 | + |
| 93 | + "tokens": [ |
| 94 | + { |
| 95 | + "token": "SVP10-NOR-00", |
| 96 | + "startOffset": 0, |
| 97 | + "endOffset": 12, |
| 98 | + "position": 0 |
| 99 | + } |
| 100 | + ] |
| 101 | + } |
| 102 | + ``` |
| 103 | +> [!Important] |
| 104 | +> Be aware that query parsers often lower-case terms in a search expression when building the query tree. If you are using an analyzer that does not lower-case text inputs, and you are not getting expected results, this could be the reason. The solution is to add a lwower-case token filter. |
| 105 | + |
| 106 | +## Analyzer definitions |
| 107 | + |
| 108 | +Whether you are evaluating analyzers or moving forward with a specific configuration, you will need to specify the analyzer on the field definition, and possibly configure the analyzer itself if you are not using a built-in analyzer. When swapping analyzers, you typically need to rebuild the index (drop, recreate, and reload). |
| 109 | + |
| 110 | +### Use built-in analyzers |
| 111 | + |
| 112 | +Built-in or predefined analyzers can be specified by name on an `analyzer` property of a field definition, with no additional configuration required in the index. The following example demonstrates how you would set the `whitespace` analyzer on a field. |
| 113 | + |
| 114 | +```json |
| 115 | + { |
| 116 | + "name": "phoneNumber", |
| 117 | + "type": "Edm.String", |
| 118 | + "key": false, |
| 119 | + "retrievable": true, |
| 120 | + "searchable": true, |
| 121 | + "analyzer": "whitespace" |
| 122 | + } |
| 123 | +``` |
| 124 | +For more information about all available built-in analyzers, see [Predefined analyzers list](https://docs.microsoft.com/azure/search/index-add-custom-analyzers#predefined-analyzers-reference). |
| 125 | + |
| 126 | +### Use custom analyzers |
| 127 | + |
| 128 | +If you are using a [custom analyzer](index-add-custom-analyzers.md), define it in the index with a user-defined combination of tokenizer, tokenfilter, with possible configuration settings. Next, reference it on a field definition, just as you would a built-in analyzer. |
| 129 | + |
| 130 | +When the objective is whole-term tokenization, a custom analyzer that consists of a **keyword tokenizer** and **lower-case token filter** is recommended. |
| 131 | + |
| 132 | ++ The keyword tokenizer creates a single token for the entire contents of a field. |
| 133 | ++ The lowercase token filter transforms upper-case letters into lower-case text. Query parsers typically lowercase any uppercase text inputs. Lowercasing homogenizes the inputs with the tokenized terms. |
| 134 | + |
| 135 | +The following example illustrates a custom analyzer that provides the keyword tokenizer and a lowercase token filter. |
| 136 | + |
| 137 | +```json |
| 138 | +{ |
| 139 | +"fields": [ |
| 140 | + { |
| 141 | + "name": "accountNumber", |
| 142 | + "analyzer":"myCustomAnalyzer", |
| 143 | + "type": "Edm.String", |
| 144 | + "searchable": true, |
| 145 | + "filterable": true, |
| 146 | + "retrievable": true, |
| 147 | + "sortable": false, |
| 148 | + "facetable": false |
| 149 | + } |
| 150 | +] |
| 151 | + |
| 152 | +"analyzers": [ |
| 153 | + { |
| 154 | + "@odata.type":"#Microsoft.Azure.Search.CustomAnalyzer", |
| 155 | + "name":"myCustomAnalyzer", |
| 156 | + "charFilters":[], |
| 157 | + "tokenizer":"keyword_v2", |
| 158 | + "tokenFilters":["lowercase"] |
| 159 | + } |
| 160 | +], |
| 161 | +"tokenizers":[], |
| 162 | +"charFilters": [], |
| 163 | +"tokenFilters": [] |
| 164 | +``` |
| 165 | + |
| 166 | +> [!NOTE] |
| 167 | +> The `keyword_v2` tokenizer and `lowercase` token filter are known to the system and using their default configurations, which is why you can reference them by name without having to define them first. |
| 168 | + |
| 169 | +## Tips and best practices |
| 170 | + |
| 171 | +### Tune query performance |
| 172 | + |
| 173 | +If you implement the recommended configuration that includes the keyword_v2 tokenizer and lower-case token filter, you might notice a decrease in query performance due to the additional token filter processing over existing tokens in your index. |
| 174 | + |
| 175 | +The following example adds an [EdgeNGramTokenFilter](https://lucene.apache.org/core/6_6_1/analyzers-common/org/apache/lucene/analysis/ngram/EdgeNGramTokenizer.html) to make prefix matches faster. Additional tokens are generated for in 2-25 character combinations that include characters: (not only MS, MSF, MSFT, MSFT/, MSFT/S, MSFT/SQ, MSFT/SQL). As you can imagine, the additional tokenization results in a larger index. |
| 176 | + |
| 177 | +```json |
| 178 | +{ |
| 179 | +"fields": [ |
| 180 | + { |
| 181 | + "name": "accountNumber", |
| 182 | + "analyzer":"myCustomAnalyzer", |
| 183 | + "type": "Edm.String", |
| 184 | + "searchable": true, |
| 185 | + "filterable": true, |
| 186 | + "retrievable": true, |
| 187 | + "sortable": false, |
| 188 | + "facetable": false |
| 189 | + } |
| 190 | +] |
| 191 | + |
| 192 | +"analyzers": [ |
| 193 | + { |
| 194 | + "@odata.type":"#Microsoft.Azure.Search.CustomAnalyzer", |
| 195 | + "name":"myCustomAnalyzer", |
| 196 | + "charFilters":[], |
| 197 | + "tokenizer":"keyword_v2", |
| 198 | + "tokenFilters":["lowercase", "my_edgeNGram"] |
| 199 | + } |
| 200 | +], |
| 201 | +"tokenizers":[], |
| 202 | +"charFilters": [], |
| 203 | +"tokenFilters": [ |
| 204 | + { |
| 205 | + "@odata.type":"#Microsoft.Azure.Search.EdgeNGramTokenFilterV2", |
| 206 | + "name":"my_edgeNGram", |
| 207 | + "minGram": 2, |
| 208 | + "maxGram": 25, |
| 209 | + "side": "front" |
| 210 | + } |
| 211 | +] |
| 212 | +``` |
| 213 | + |
| 214 | +### Use different analyzers for indexing and query processing |
| 215 | + |
| 216 | +Analyzers are called during indexing and during query execution. It's common to use the same analyzer for both but you can configure custom analyzers for each workload. Analyzer overrides are specified in the [index definition](https://docs.microsoft.com/rest/api/searchservice/create-index) in an `analyzers` section, and then referenced on specific fields. |
| 217 | + |
| 218 | +When custom analysis is only required during indexing, you can apply the custom analyzer to just indexing and continue to use the standard Lucene analyzer (or another analyzer) for queries. |
| 219 | + |
| 220 | +To specify role-specific analysis, you can set properties on the field for each one, setting `indexAnalyzer` and `searchAnalyzer` instead of the default `analyzer` property. |
| 221 | + |
| 222 | +```json |
| 223 | +"name": "featureCode", |
| 224 | +"indexAnalyzer":"my_customanalyzer", |
| 225 | +"searchAnalyzer":"standard", |
| 226 | +``` |
| 227 | + |
| 228 | +### Duplicate fields for different scenarios |
| 229 | + |
| 230 | +Another option leverages the per-field analyzer assignment to optimize for different scenarios. Specifically, you might define "featureCode" and "featureCodeRegex" to support regular full text search on the first, and advanced pattern matching on the second. |
| 231 | + |
| 232 | +```json |
| 233 | +{ |
| 234 | + "name": "featureCode", |
| 235 | + "type": "Edm.String", |
| 236 | + "retrievable": true, |
| 237 | + "searchable": true, |
| 238 | + "analyzer": null |
| 239 | +}, |
| 240 | +{ |
| 241 | + "name": "featureCodeRegex", |
| 242 | + "type": "Edm.String", |
| 243 | + "retrievable": true, |
| 244 | + "searchable": true, |
| 245 | + "analyzer": "my_customanalyzer" |
| 246 | +}, |
| 247 | +``` |
| 248 | + |
| 249 | +## Next steps |
| 250 | + |
| 251 | +This article explains how analyzers both contribute to query problems and solve query problems. As a next step, take a closer look at analyzer impact on indexing and query processing. In particular, consider using the Analyze Text API to return tokenized output so that you can see exactly what an analyzer is creating for your index. |
| 252 | + |
| 253 | ++ [Language analyzers](search-language-support.md) |
| 254 | ++ [Analyzers for text processing in Azure Cognitive Search](search-analyzers.md) |
| 255 | ++ [Analyze Text API (REST)](https://docs.microsoft.com/rest/api/searchservice/test-analyzer) |
| 256 | ++ [How full text search works (query architecture)](search-lucene-query-architecture.md) |
0 commit comments