Skip to content

Commit e4e3686

Browse files
authored
Merge pull request #110406 from HeidiSteen/heidist-search
[Azure Cognitive Search] Fuzzy search doc
2 parents a1d1301 + 3bc51e7 commit e4e3686

File tree

5 files changed

+145
-16
lines changed

5 files changed

+145
-16
lines changed

articles/search/TOC.yml

Lines changed: 8 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -114,12 +114,6 @@
114114
href: index-add-language-analyzers.md
115115
- name: Add a custom analyzer
116116
href: index-add-custom-analyzers.md
117-
- name: Typeahead
118-
items:
119-
- name: Create a suggester
120-
href: index-add-suggesters.md
121-
- name: Add suggestions or autocomplete
122-
href: search-autocomplete-tutorial.md
123117
- name: Synonyms
124118
items:
125119
- name: Add synonyms
@@ -230,6 +224,14 @@
230224
href: search-query-lucene-examples.md
231225
- name: Partial terms and special characters
232226
href: search-query-partial-matching.md
227+
- name: Fuzzy search
228+
href: search-query-fuzzy.md
229+
- name: Autocomplete "Search-as-you-type"
230+
items:
231+
- name: Create a suggester
232+
href: index-add-suggesters.md
233+
- name: Add suggestions or autocomplete
234+
href: search-autocomplete-tutorial.md
233235
- name: Query from Power Apps
234236
href: search-howto-powerapps.md
235237
- name: Syntax reference

articles/search/search-pagination-page-layout.md

Lines changed: 5 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -90,7 +90,11 @@ Another option is using a [custom scoring profile](index-add-scoring-profiles.md
9090

9191
Hit highlighting refers to text formatting (such as bold or yellow highlights) applied to matching term in a result, making it easy to spot the match. Hit highlighting instructions are provided on the [query request](https://docs.microsoft.com/rest/api/searchservice/search-documents). The search engine encloses the matching term in tags, `highlightPreTag` and `highlightPostTag`, and your code handles the response (for example, applying a bold font).
9292

93-
Formatting is applied to whole term queries. In the following example, the terms "sandy", "sand", "beaches", "beach" found within the Description field are tagged for highlighting. Queries on partial terms, such as fuzzy search or wildcard search that result in query expansion in the engine, cannot use hit highlighting.
93+
Formatting is applied to whole term queries. In the following example, the terms "sandy", "sand", "beaches", "beach" found within the Description field are tagged for highlighting. Queries that trigger query expansion in the engine, such as fuzzy and wildcard search, have limited support for hit highlighting.
94+
95+
```http
96+
GET /indexes/hotels-sample-index/docs/search=sandy beaches&highlight=Description?api-version=2019-05-06
97+
```
9498

9599
```http
96100
POST /indexes/hotels-sample-index/docs/search?api-version=2019-05-06

articles/search/search-query-fuzzy.md

Lines changed: 123 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,123 @@
1+
---
2+
title: Fuzzy search
3+
titleSuffix: Azure Cognitive Search
4+
description: Implement a "did you mean" search experience to auto-correct a misspelled term or typo.
5+
6+
manager: nitinme
7+
author: HeidiSteen
8+
ms.author: heidist
9+
ms.service: cognitive-search
10+
ms.topic: conceptual
11+
ms.date: 04/08/2020
12+
---
13+
# Fuzzy search to correct misspellings and typos
14+
15+
Azure Cognitive Search supports fuzzy search, a type of query that compensates for typos and misspelled terms in the input string. It does this by scanning for terms having a similar composition. Expanding search to cover near-matches has the effect of auto-correcting a typo when the discrepancy is just a few misplaced characters.
16+
17+
## What is fuzzy search?
18+
19+
It's an expansion exercise that produces a match on terms having a similar composition. When a fuzzy search is specified, the engine builds a graph of similarly composed terms, for all whole terms in the query. For example, if your query includes three terms "university of washington", a graph is created for each term (`search=university~ of~ washington~`).
20+
21+
The graph consists of up to 50 expansions, or permutations, of each term, capturing both correct and incorrect variants in the process. The engine then returns the topmost relevant matches in the response.
22+
23+
For a term like "university", the graph might have "unversty, universty, university, universe, inverse". Any documents that match on those in the graph are included in results. In contrast with language analyzers that can handle irregularities between singular and plural forms of the same word ("mice" and "mouse"), the comparisons in a fuzzy query are taken at face value with no attempt at reconciling the semantic differences. "Universe" and "inverse" will match because the character discrepancies are small.
24+
25+
A match succeeds if the discrepancies are limited to two or fewer edits, where an edit is an inserted, deleted, substituted, or transposed character. The string correction algorithm that specifies the differential is the [Damerau-Levenshtein distance](https://en.wikipedia.org/wiki/Damerau%E2%80%93Levenshtein_distance) metric, described as the "minimum number of operations (insertions, deletions, substitutions, or transpositions of two adjacent characters) required to change one word into the other".
26+
27+
In Azure Cognitive Search:
28+
29+
+ Fuzzy query applies to whole terms, but you can support phrases through AND constructions. For example, "Unviersty~ of~ "Wshington~" would match on "University of Washington".
30+
31+
+ The default distance of an edit is 2. A value of `~0` signifies no expansion (only the exact term is considered a match), but you could specify `~1` for one degree of difference, or one edit.
32+
33+
+ A fuzzy query can expand a term up to 50 additional permutations. This limit is not configurable, but you can effectively reduce the number of expansions by decreasing the edit distance to 1.
34+
35+
+ Responses consist of documents containing a relevant match (up to 50).
36+
37+
Collectively, the graphs are submitted as match criteria against tokens in the index. As you can imagine, fuzzy search is inherently slower than other query forms. The size and complexity of your index can determine whether the benefits are enough to offset the latency of the response.
38+
39+
> [!NOTE]
40+
> Because fuzzy search tends to be slow, it might be worthwhile to investigate alternatives such as n-gram indexing, with its progression of short character sequences (two and three character sequences for bigram and trigram tokens). Depending on your language and query surface, n-gram might give you better performance.
41+
>
42+
> Another alternative, which you could consider if you want to handle just the most egregious cases, would be a [synonym map](search-synonyms.md). For example, mapping "search" to "serach, serch, sarch", or "retrieve" to "retreive".
43+
44+
## Indexing for fuzzy search
45+
46+
Analyzers are not used during query processing to create an expansion graph, but that doesn't mean analyzers should be ignored in fuzzy search scenarios. After all, analyzers are used during indexing to create tokens against which matching is done, whether the query is free form, filtered search, or a fuzzy search with a graph as input.
47+
48+
Generally, when assigning analyzers on a per-field basis, the decision to fine-tune the analysis chain is based on the primary use case (a filter or full text search) rather than specialized query forms like fuzzy search. For this reason, there is not a specific analyzer recommendation for fuzzy search.
49+
50+
However, if test queries are not producing the matches you expect, you could try varying the indexing analyzer, setting it to a [language analyzer](index-add-language-analyzers.md), to see if you get better results. Some languages, particularly those with vowel mutations, can benefit from the inflection and irregular word forms generated by the Microsoft natural language processors. In some cases, using the right language analyzer can make a difference in whether a term is tokenized in a way that is compatible with the value provided by the user.
51+
52+
## How to use fuzzy search
53+
54+
Fuzzy queries are constructed using the full Lucene query syntax, invoking the [Lucene query parser](https://lucene.apache.org/core/6_6_1/queryparser/org/apache/lucene/queryparser/classic/package-summary.html).
55+
56+
1. Set the full Lucene parser on the query (`queryType=full`).
57+
58+
1. Optionally, scope the request to specific fields, using this parameter (`searchFields=<field1,field2>`).
59+
60+
1. Append the tilde (`~`) operator at the end of the whole term (`search=<string>~`).
61+
62+
Include an optional parameter, a number between 0 and 2 (default), if you want to specify the edit distance (`~1`). For example, "blue~" or "blue~1" would return "blue", "blues", and "glue".
63+
64+
In Azure Cognitive Search, besides the term and distance (maximum of 2), there are no additional parameters to set on the query.
65+
66+
> [!NOTE]
67+
> During query processing, fuzzy queries do not undergo [lexical analysis](search-lucene-query-architecture.md#stage-2-lexical-analysis). The query input is added directly to the query tree and expanded to create a graph of terms. The only transformation performed is lower casing.
68+
69+
## How to test fuzzy search
70+
71+
For simple testing, we recommend [Search explorer](search-explorer.md) or [Postman](search-get-started-postman.md) for iterating over a query expression. Both tools are interactive, which means you can quickly step through multiple variants of a term and evaluate the responses that come back.
72+
73+
When results are ambiguous, [hit highlighting](search-pagination-page-layout.md#hit-highlighting) can help you identify the match in the response.
74+
75+
> [!Note]
76+
> The use of hit highlighting to identify fuzzy matches has limitations and only works for basic fuzzy search. If your index has scoring profiles, or if you layer the query with additional syntax, hit highlighting might fail to identify the match.
77+
78+
### Example 1: fuzzy search with the exact term
79+
80+
Assume the following string exists in a `"Description"` field in a search document: `"Test queries with special characters, plus strings for MSFT, SQL and Java."`
81+
82+
Start with a fuzzy search on "special" and add hit highlighting to the Description field:
83+
84+
search=special~&highlight=Description
85+
86+
In the response, because you added hit highlighting, formatting is applied to "special" as the matching term.
87+
88+
"@search.highlights": {
89+
"Description": [
90+
"Test queries with <em>special</em> characters, plus strings for MSFT, SQL and Java."
91+
]
92+
93+
Try the request again, misspelling "special" by taking out letters several letters ("pe"):
94+
95+
search=scial~&highlight=Description
96+
97+
So far, no change to the response. Using the default of 2 degrees distance, removing two characters "pe" from "special" still allows for a successful match on that term.
98+
99+
"@search.highlights": {
100+
"Description": [
101+
"Test queries with <em>special</em> characters, plus strings for MSFT, SQL and Java."
102+
]
103+
104+
Trying one more request, further modify the search term by taking out one last character for a total of three deletions (from "special" to "scal"):
105+
106+
search=scal~&highlight=Description
107+
108+
Notice that the same response is returned, but now instead of matching on "special", the fuzzy match is on "SQL".
109+
110+
"@search.score": 0.4232868,
111+
"@search.highlights": {
112+
"Description": [
113+
"Mix of special characters, plus strings for MSFT, <em>SQL</em>, 2019, Linux, Java."
114+
]
115+
116+
The point of this expanded example is to illustrate the clarity that hit highlighting can bring to ambiguous results. In all cases, the same document is returned. Had you relied on document IDs to verify a match, you might have missed the shift from "special" to "SQL".
117+
118+
## See also
119+
120+
+ [How full text search works in Azure Cognitive Search (query parsing architecture)](search-lucene-query-architecture.md)
121+
+ [Search explorer](search-explorer.md)
122+
+ [How to query in .NET](search-query-dotnet.md)
123+
+ [How to query in REST](search-create-index-rest-api.md)

articles/search/search-query-lucene-examples.md

Lines changed: 6 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -82,7 +82,7 @@ This first example is not Lucene-specific, but we lead with it to introduce the
8282

8383
For brevity, the query targets only the *business_title* field and specifies only business titles are returned. The **searchFields** parameter restricts query execution to just the business_title field, and **select** specifies which fields are included in the response.
8484

85-
### Partial query string
85+
### Search expression
8686

8787
```http
8888
&search=*&searchFields=business_title&$select=business_title
@@ -115,7 +115,7 @@ You might have noticed the search score in the response. Uniform scores of 1 occ
115115

116116
Full Lucene syntax supports scoping individual search expressions to a specific field. This example searches for business titles with the term senior in them, but not junior.
117117

118-
### Partial query string
118+
### Search expression
119119

120120
```http
121121
$select=business_title&search=business_title:(senior NOT junior)
@@ -153,7 +153,7 @@ The field specified in **fieldName:searchExpression** must be a searchable field
153153
Full Lucene syntax also supports fuzzy search, matching on terms that have a similar construction.
154154
To do a fuzzy search, append the tilde `~` symbol at the end of a single word with an optional parameter, a value between 0 and 2, that specifies the edit distance. For example, `blue~` or `blue~1` would return blue, blues, and glue.
155155

156-
### Partial query string
156+
### Search expression
157157

158158
```http
159159
searchFields=business_title&$select=business_title&search=business_title:asosiate~
@@ -183,7 +183,7 @@ https://azs-playground.search.windows.net/indexes/nycjobs/docs?api-version=2019-
183183
## Example 4: Proximity search
184184
Proximity searches are used to find terms that are near each other in a document. Insert a tilde "~" symbol at the end of a phrase followed by the number of words that create the proximity boundary. For example, "hotel airport"~5 will find the terms hotel and airport within 5 words of each other in a document.
185185

186-
### Partial query string
186+
### Search expression
187187

188188
```http
189189
searchFields=business_title&$select=business_title&search=business_title:%22senior%20analyst%22~1
@@ -236,7 +236,7 @@ When setting the factor level, the higher the boost factor, the more relevant th
236236

237237
A regular expression search finds a match based on the contents between forward slashes "/", as documented in the [RegExp class](https://lucene.apache.org/core/6_6_1/core/org/apache/lucene/util/automaton/RegExp.html).
238238

239-
### Partial query string
239+
### Search expression
240240

241241
```http
242242
searchFields=business_title&$select=business_title&search=business_title:/(Sen|Jun)ior/
@@ -259,7 +259,7 @@ https://azs-playground.search.windows.net/indexes/nycjobs/docs?api-version=2019-
259259
## Example 7: Wildcard search
260260
You can use generally recognized syntax for multiple (\*) or single (?) character wildcard searches. Note the Lucene query parser supports the use of these symbols with a single term, and not a phrase.
261261

262-
### Partial query string
262+
### Search expression
263263

264264
```http
265265
searchFields=business_title&$select=business_title&search=business_title:prog*

articles/search/search-query-partial-matching.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -10,13 +10,13 @@ ms.service: cognitive-search
1010
ms.topic: conceptual
1111
ms.date: 04/02/2020
1212
---
13-
# Partial term search and patterns with special characters - Azure Cognitive Search (wildcard, regex, patterns)
13+
# Partial term search and patterns with special characters (wildcard, regex, patterns)
1414

1515
A *partial term search* refers to queries consisting of term fragments, such as the first, last, or interior parts of a string. A *pattern* might a combination of fragments, sometimes with special characters such as dashes or slashes that are part of the query. Common use-cases include querying for portions of a phone number, URL, people or product codes, or compound words.
1616

1717
Partial search can be problematic if the index doesn't have terms in the format required for pattern matching. During the text analysis phase of indexing, using the default standard analyzer, special characters are discarded, composite and compound strings are split up, causing pattern queries to fail when no match is found. For example, a phone number like `+1 (425) 703-6214`(tokenized as `"1"`, `"425"`, `"703"`, `"6214"`) won't show up in a `"3-62"` query because that content doesn't actually exist in the index.
1818

19-
The solution is to invoke an analyzer that preserves a complete string, including spaces and special characters if necessary, so that you can support partial terms and patterns. Creating an additional field for an intact string, plus using a content-preserving analyzer, is the basis of the solution.
19+
The solution is to invoke an analyzer that preserves a complete string, including spaces and special characters if necessary, so that you can match on partial terms and patterns. Creating an additional field for an intact string, plus using a content-preserving analyzer, is the basis of the solution.
2020

2121
## What is partial search in Azure Cognitive Search
2222

@@ -59,7 +59,7 @@ Analyzers are assigned on a per-field basis, which means you can create fields i
5959
"type": "Edm.String",
6060
"retrievable": true,
6161
"searchable": true,
62-
"analyzer": "my_customanalyzer"
62+
"analyzer": "my_custom_analyzer"
6363
},
6464
```
6565

0 commit comments

Comments
 (0)