Skip to content

Commit d880581

Browse files
committed
checkpoint
1 parent ffc4ae8 commit d880581

File tree

1 file changed

+24
-16
lines changed

1 file changed

+24
-16
lines changed

articles/search/search-query-partial-matching.md

Lines changed: 24 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -12,20 +12,19 @@ ms.date: 04/02/2020
1212
---
1313
# Partial term search in Azure Cognitive Search queries (wildcard, regex, fuzzy search, patterns)
1414

15-
A *partial term search* refers to queries consisting of term fragments, including first, last, or interior parts of a string, or a pattern consisting of a combination of fragments, often separated by special characters such as dashes or slashes. Common use-cases include querying for portions of a phone number, URL, people or product codes, or compound words.
15+
A *partial term search* refers to queries consisting of term fragments, such as the first, last, or interior parts of a string, or a pattern consisting of a combination of fragments, often separated by special characters such as dashes or slashes. Common use-cases include querying for portions of a phone number, URL, people or product codes, or compound words.
1616

17-
Partial search can be problematic because the index itself does not typically store terms in a way that is conducive to partial string and pattern matching. During the text analysis phase of indexing, special characters are discarded, composite and compound strings are split up, which means pattern queries will fail because no match can be found. For example, a phone number like `+1 (425) 703-6214` - tokenized as `"1"`, `"425"`, `"703"`, `"6214"` - won't show up in a `"3-62"` query because that content doesn't actually exist in the index.
17+
Partial search can be problematic because the index itself does not typically store terms in a way that is conducive to partial string and pattern matching. During the text analysis phase of indexing, special characters are discarded, composite and compound strings are split up, causing pattern queries to fail when no match is found. For example, a phone number like `+1 (425) 703-6214` - tokenized as `"1"`, `"425"`, `"703"`, `"6214"` - won't show up in a `"3-62"` query because that content doesn't actually exist in the index.
1818

1919
The solution is to store intact versions of these strings in the index, to specifically support partial search scenarios. Creating an additional field for an intact string, plus using a content-preserving analyzer, is the basis of the solution.
2020

2121
## What is partial search in Azure Cognitive Search
2222

2323
In Azure Cognitive Search, partial search is available in these forms:
2424

25-
+ Prefix search (`search=sea~`, matching on "seaside", "Seattle", "seam", and so forth)
26-
+ Wildcard search and RegEx search
27-
+ Fuzzy search that infers a valid "near match" query or corrects a misspelled term
28-
+ Autocomplete and "search-as-you-type" suggestions
25+
+ [Simple query expressions](query-simple-syntax.md) that use a fragment
26+
+ [Wildcard or prefix search](query-lucene-syntax.md#bkmk_wildcard), such as `search=sea~` or `search=sea*`, matching on "seaside", "Seattle", "seam", and so forth.
27+
+ [Regular expressions](query-lucene-syntax.md#bkmk_regex)
2928

3029
When any of the above query types are needed in your client application, follow the steps in this article to ensure the necessary content exists in your index.
3130

@@ -34,12 +33,10 @@ When any of the above query types are needed in your client application, follow
3433
When you need to search on patterns or special characters, you can override the default analyzer with a custom analyzer that operates under simpler tokenization rules, preserving whole terms, necessary when query strings include parts of a term or special characters. Taking a step back, the approach looks like this:
3534

3635
+ Define a field to store an intact version of the field (assuming you want analyzed and non-analyzed text)
37-
+ Choose a predefined analyzer or define a custom analyzer that produces the desired output
36+
+ Choose a predefined analyzer or define a custom analyzer to output an intact string
3837
+ Assign the analyzer to the field
3938
+ Build the index and test
4039

41-
This article walks you through these tasks. The approach described here is useful in other scenarios: wildcard and regular expression queries also need whole terms as the basis for pattern matching.
42-
4340
> [!TIP]
4441
> Evaluating analyzers is an iterative process that requires frequent index rebuilds. You can make this step easier by using Postman, the REST APIs for [Create Index](https://docs.microsoft.com/rest/api/searchservice/create-index), [Delete Index](https://docs.microsoft.com/rest/api/searchservice/delete-index),[Load Documents](https://docs.microsoft.com/rest/api/searchservice/addupdate-or-delete-documents), and [Search Documents](https://docs.microsoft.com/rest/api/searchservice/search-documents). For Load Documents, the request body should contain a small representative data set that you want to test (for example, a field with phone numbers or product codes). With these APIs in the same Postman collection, you can cycle through these steps quickly.
4542
@@ -74,7 +71,7 @@ When choosing an analyzer that produces whole-term tokens, the following analyze
7471
| [whitespace](https://lucene.apache.org/core/6_6_1/analyzers-common/org/apache/lucene/analysis/core/WhitespaceAnalyzer.html) | Separates on white spaces only. Terms that include dashes or other characters are treated as a single token. |
7572
| [custom analyzer](index-add-custom-analyzers.md) | (recommended) Creating a custom analyzer lets you specify both the tokenizer and token filter. The previous analyzers must be used as-is. A custom analyzer lets you pick which tokenizers and token filters to use. <br><br>A recommended combination is the [keyword tokenizer](https://lucene.apache.org/core/6_6_1/analyzers-common/org/apache/lucene/analysis/core/KeywordTokenizer.html) with a [lower-case token filter](https://lucene.apache.org/core/6_6_1/analyzers-common/org/apache/lucene/analysis/core/LowerCaseFilter.html). By itself, the predefined [keyword analyzer](https://lucene.apache.org/core/6_6_1/analyzers-common/org/apache/lucene/analysis/core/KeywordAnalyzer.html) does not lower-case any upper-case text, which can cause queries to fail. A custom analyzer gives you a mechanism for adding the lower-case token filter. |
7673

77-
If you are using a web API test tool like Postman, you can add the [Test Analyzer REST call](https://docs.microsoft.com/rest/api/searchservice/test-analyzer) to inspect tokenized output.
74+
If you are using a web API test tool like Postman, you can add the [Test Analyzer REST call](https://docs.microsoft.com/rest/api/searchservice/test-analyzer) to inspect tokenized output.
7875

7976
You must have an existing index to work with. Given an existing index and a field containing dashes or partial terms, you can try various analyzers over specific terms to see what tokens are emitted.
8077

@@ -139,15 +136,15 @@ You must have an existing index to work with. Given an existing index and a fiel
139136
}
140137
```
141138
> [!Important]
142-
> Be aware that query parsers often lower-case terms in a search expression when building the query tree. If you are using an analyzer that does not lower-case text inputs, and you are not getting expected results, this could be the reason. The solution is to add a lwower-case token filter.
139+
> Be aware that query parsers often lower-case terms in a search expression when building the query tree. If you are using an analyzer that does not lower-case text inputs, and you are not getting expected results, this could be the reason. The solution is to add a lower-case token filter, as described in the "Use custom analyzers" section below.
143140

144141
## Configure an analyzer
145142

146143
Whether you are evaluating analyzers or moving forward with a specific configuration, you will need to specify the analyzer on the field definition, and possibly configure the analyzer itself if you are not using a built-in analyzer. When swapping analyzers, you typically need to rebuild the index (drop, recreate, and reload).
147144

148145
### Use built-in analyzers
149146

150-
Built-in or predefined analyzers can be specified by name on an `analyzer` property of a field definition, with no additional configuration required in the index. The following example demonstrates how you would set the `whitespace` analyzer on a field.
147+
Built-in or predefined analyzers can be specified by name on an `analyzer` property of a field definition, with no additional configuration required in the index. The following example demonstrates how you would set the `whitespace` analyzer on a field. For more information about available built-in analyzers, see [Predefined analyzers list](https://docs.microsoft.com/azure/search/index-add-custom-analyzers#predefined-analyzers-reference).
151148

152149
```json
153150
{
@@ -159,7 +156,6 @@ Built-in or predefined analyzers can be specified by name on an `analyzer` prope
159156
"analyzer": "whitespace"
160157
}
161158
```
162-
For more information about all available built-in analyzers, see [Predefined analyzers list](https://docs.microsoft.com/azure/search/index-add-custom-analyzers#predefined-analyzers-reference).
163159

164160
### Use custom analyzers
165161

@@ -168,7 +164,7 @@ If you are using a [custom analyzer](index-add-custom-analyzers.md), define it i
168164
When the objective is whole-term tokenization, a custom analyzer that consists of a **keyword tokenizer** and **lower-case token filter** is recommended.
169165

170166
+ The keyword tokenizer creates a single token for the entire contents of a field.
171-
+ The lowercase token filter transforms upper-case letters into lower-case text. Query parsers typically lowercase any uppercase text inputs. Lowercasing homogenizes the inputs with the tokenized terms.
167+
+ The lowercase token filter transforms upper-case letters into lower-case text. Query parsers typically lowercase any uppercase text inputs. Lower-casing homogenizes the inputs with the tokenized terms.
172168

173169
The following example illustrates a custom analyzer that provides the keyword tokenizer and a lowercase token filter.
174170

@@ -206,7 +202,19 @@ The following example illustrates a custom analyzer that provides the keyword to
206202

207203
## Build and test
208204

209-
Once you have defined an index with analyzers and field definitions that support your scenario, load documents that have representative strings so that you can test partial string queries. Recall that the Test Analyzer API is called against an existing index. Be sure to include this API as part of your test, as verification that terms are tokenized or preserved in the expected format.
205+
Once you have defined an index with analyzers and field definitions that support your scenario, load documents that have representative strings so that you can test partial string queries.
206+
207+
The previous sections explained the logic. This section steps through each API you should call when testing your solution. As previously noted, if you use an interactive web test tool such as Postman, you can step through these tasks quickly.
208+
209+
+ [Delete Index](https://docs.microsoft.com/rest/api/searchservice/delete-index) removes an existing index of the same name so that you can recreate it.
210+
211+
+ [Create Index](https://docs.microsoft.com/rest/api/searchservice/create-index) creates the index structure on your search service, including analyzer definitions and fields with an analyzer specification.
212+
213+
+ [Load Documents](https://docs.microsoft.com/rest/api/searchservice/addupdate-or-delete-documents) imports documents having the same structure as your index, as well as searchable content. After this step, your index is ready to query or test.
214+
215+
+ [Test Analyzer](https://docs.microsoft.com/rest/api/searchservice/test-analyzer) was introduced in [Choose an analyzer](#choose-an-analyzer). Test some of the strings in your index using a variety of analyzers to understand how terms are tokenized.
216+
217+
+ [Search Documents](https://docs.microsoft.com/rest/api/searchservice/search-documents) explains how to construct a query request, using either [simple syntax](query-simple-syntax.md) or [full Lucene syntax](query-lucene-syntax.md) for wildcard and regular expressions.
210218

211219
## Tips and best practices
212220

@@ -274,4 +282,4 @@ This article explains how analyzers both contribute to query problems and solve
274282
+ [Language analyzers](search-language-support.md)
275283
+ [Analyzers for text processing in Azure Cognitive Search](search-analyzers.md)
276284
+ [Analyze Text API (REST)](https://docs.microsoft.com/rest/api/searchservice/test-analyzer)
277-
+ [How full text search works (query architecture)](search-lucene-query-architecture.md)
285+
+ [How full text search works (query architecture)](search-lucene-query-architecture.md)

0 commit comments

Comments
 (0)