Skip to content

Commit fe4d241

Browse files
Merge pull request #280766 from mattgotteiner/matt/prefix-suffix
Azure Search: Update prefix and suffix matching docs
2 parents 2682a4b + 749fbf6 commit fe4d241

File tree

2 files changed

+79
-36
lines changed

2 files changed

+79
-36
lines changed

articles/search/query-lucene-syntax.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -148,7 +148,7 @@ If you get syntax errors in your regular expression, review the [escape rules](#
148148

149149
## <a name="bkmk_wildcard"></a> Wildcard search
150150

151-
You can use generally recognized syntax for multiple (`*`) or single (`?`) character wildcard searches. Full Lucene syntax supports prefix, infix, and suffix matching.
151+
You can use generally recognized syntax for multiple (`*`) or single (`?`) character wildcard searches. Full Lucene syntax supports prefix and infix matching. Use [regular expression](#bkmk_regex) syntax for suffix matching.
152152

153153
Note the Lucene query parser supports the use of these symbols with a single term, and not a phrase.
154154

@@ -163,7 +163,7 @@ You can combine operators in one expression. For example, `980?2*` matches on `9
163163
Suffix matching requires the regular expression forward slash `/` delimiters. Generally, you can’t use a `*` or `?` symbol as the first character of a term, without the `/`. It's also important to note that the `*` behaves differently when used outside of regex queries. Outside of the regex forward slash `/` delimiter, the `*` is a wildcard character and matches any series of characters much like `.*` in regex. As an example, `search=/non.*al/` produces the same result set as `search=non*al`.
164164

165165
> [!NOTE]
166-
> As a rule, pattern matching is slow so you might want to explore alternative methods, such as edge n-gram tokenization that creates tokens for sequences of characters in a term. With n-gram tokenization, the index will be larger, but queries might execute faster, depending on the pattern construction and the length of strings you are indexing. For more information, see [Partial term search and patterns with special characters](search-query-partial-matching.md#tune-query-performance).
166+
> As a rule, pattern matching is slow so you might want to explore alternative methods, such as edge n-gram tokenization that creates tokens for sequences of characters in a term. With n-gram tokenization, the index will be larger, but queries might execute faster, depending on the pattern construction and the length of strings you are indexing. For more information, see [Partial term search and patterns with special characters](search-query-partial-matching.md#optimizing-prefix-and-suffix-queries).
167167
>
168168
169169
### Effect of an analyzer on wildcard queries

articles/search/search-query-partial-matching.md

Lines changed: 77 additions & 34 deletions
Original file line numberDiff line numberDiff line change
@@ -190,24 +190,24 @@ The following example illustrates a custom analyzer that provides the keyword to
190190
{
191191
"fields": [
192192
{
193-
"name": "accountNumber",
194-
"analyzer":"myCustomAnalyzer",
195-
"type": "Edm.String",
196-
"searchable": true,
197-
"filterable": true,
198-
"retrievable": true,
199-
"sortable": false,
200-
"facetable": false
193+
"name": "accountNumber",
194+
"analyzer":"myCustomAnalyzer",
195+
"type": "Edm.String",
196+
"searchable": true,
197+
"filterable": true,
198+
"retrievable": true,
199+
"sortable": false,
200+
"facetable": false
201201
}
202202
],
203203

204204
"analyzers": [
205205
{
206-
"@odata.type":"#Microsoft.Azure.Search.CustomAnalyzer",
207-
"name":"myCustomAnalyzer",
208-
"charFilters":[],
209-
"tokenizer":"keyword_v2",
210-
"tokenFilters":["lowercase"]
206+
"@odata.type":"#Microsoft.Azure.Search.CustomAnalyzer",
207+
"name":"myCustomAnalyzer",
208+
"charFilters":[],
209+
"tokenizer":"keyword_v2",
210+
"tokenFilters":["lowercase"]
211211
}
212212
],
213213
"tokenizers":[],
@@ -241,52 +241,95 @@ The previous sections explained the logic. This section steps through each API y
241241

242242
For infix and suffix queries, such as querying "num" or "numeric to find a match on "alphanumeric", use the full Lucene syntax and a regular expression: `search=/.*num.*/&queryType=full`
243243

244-
## Tune query performance
244+
## Optimizing prefix and suffix queries
245245

246-
If you implement the recommended configuration that includes the keyword_v2 tokenizer and lower-case token filter, you might notice a decrease in query performance due to the extra token filter processing over existing tokens in your index.
246+
Matching prefixes and suffixes using the default analyzer requires additional query features. Prefixes require [wildcard search](query-lucene-syntax.md#bkmk_wildcard) and suffixes require [regular expression search](query-lucene-syntax.md#bkmk_regex). Both of these features can reduce query performance.
247247

248-
The following example adds an [EdgeNGramTokenFilter](https://lucene.apache.org/core/6_6_1/analyzers-common/org/apache/lucene/analysis/ngram/EdgeNGramTokenizer.html) to make prefix matches faster. Tokens are generated in 2-25 character combinations that include characters. Here's an example progression from two to seven tokens: MS, MSF, MSFT, MSFT/, MSFT/S, MSFT/SQ, MSFT/SQL.
248+
The following example adds an [`EdgeNGramTokenFilter`](https://lucene.apache.org/core/6_6_1/analyzers-common/org/apache/lucene/analysis/ngram/EdgeNGramTokenizer.html) to make prefix or suffix matches faster. Tokens are generated in 2-25 character combinations that include characters. Here's an example progression from two to seven tokens: MS, MSF, MSFT, MSFT/, MSFT/S, MSFT/SQ, MSFT/SQL. `EdgeNGramTokenFilter` requires a `side` parameter which determines which side of the string character combinations are generated from. Use `front` for prefix queries and `back` for suffix queries.
249249

250250
Extra tokenization results in a larger index. If you have sufficient capacity to accommodate the larger index, this approach with its faster response time might be the best solution.
251251

252252
```json
253253
{
254254
"fields": [
255255
{
256-
"name": "accountNumber",
257-
"analyzer":"myCustomAnalyzer",
258-
"type": "Edm.String",
259-
"searchable": true,
260-
"filterable": true,
261-
"retrievable": true,
262-
"sortable": false,
263-
"facetable": false
256+
"name": "accountNumber_prefix",
257+
"indexAnalyzer": "ngram_front_analyzer",
258+
"searchAnalyzer": "keyword",
259+
"type": "Edm.String",
260+
"searchable": true,
261+
"filterable": false,
262+
"retrievable": true,
263+
"sortable": false,
264+
"facetable": false
265+
},
266+
{
267+
"name": "accountNumber_suffix",
268+
"indexAnalyzer": "ngram_back_analyzer",
269+
"searchAnalyzer": "keyword",
270+
"type": "Edm.String",
271+
"searchable": true,
272+
"filterable": false,
273+
"retrievable": true,
274+
"sortable": false,
275+
"facetable": false
264276
}
265277
],
266278

267279
"analyzers": [
268280
{
269-
"@odata.type":"#Microsoft.Azure.Search.CustomAnalyzer",
270-
"name":"myCustomAnalyzer",
271-
"charFilters":[],
272-
"tokenizer":"keyword_v2",
273-
"tokenFilters":["lowercase", "my_edgeNGram"]
281+
"@odata.type":"#Microsoft.Azure.Search.CustomAnalyzer",
282+
"name":"ngram_front_analyzer",
283+
"charFilters":[],
284+
"tokenizer":"keyword_v2",
285+
"tokenFilters":["lowercase", "front_edgeNGram"]
286+
},
287+
{
288+
"@odata.type":"#Microsoft.Azure.Search.CustomAnalyzer",
289+
"name":"ngram_back_analyzer",
290+
"charFilters":[],
291+
"tokenizer":"keyword_v2",
292+
"tokenFilters":["lowercase", "back_edgeNGram"]
274293
}
275294
],
276295
"tokenizers":[],
277296
"charFilters": [],
278297
"tokenFilters": [
279298
{
280-
"@odata.type":"#Microsoft.Azure.Search.EdgeNGramTokenFilterV2",
281-
"name":"my_edgeNGram",
282-
"minGram": 2,
283-
"maxGram": 25,
284-
"side": "front"
299+
"@odata.type":"#Microsoft.Azure.Search.EdgeNGramTokenFilterV2",
300+
"name":"front_edgeNGram",
301+
"minGram": 2,
302+
"maxGram": 25,
303+
"side": "front"
304+
},
305+
{
306+
"@odata.type":"#Microsoft.Azure.Search.EdgeNGramTokenFilterV2",
307+
"name":"back_edgeNGram",
308+
"minGram": 2,
309+
"maxGram": 25,
310+
"side": "back"
285311
}
286312
]
287313
}
288314
```
289315

316+
To search for account numbers that start with `123`, we can use the following query:
317+
```
318+
{
319+
"search": "123",
320+
"searchFields": "accountNumber_prefix"
321+
}
322+
```
323+
324+
325+
To search for account numbers that end with `456`, we can use the following query:
326+
```
327+
{
328+
"search": "456",
329+
"searchFields": "accountNumber_suffix"
330+
}
331+
```
332+
290333
## Next steps
291334

292335
This article explains how analyzers both contribute to query problems and solve query problems. As a next step, take a closer look at analyzers affect indexing and query processing.

0 commit comments

Comments
 (0)