You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: articles/search/search-normalizers.md
+70-66Lines changed: 70 additions & 66 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -8,52 +8,43 @@ manager: jlembicz
8
8
ms.author: ishansri
9
9
ms.service: cognitive-search
10
10
ms.topic: how-to
11
-
ms.date: 03/23/2022
11
+
ms.date: 07/14/2022
12
12
---
13
13
14
14
# Text normalization for case-insensitive filtering, faceting and sorting
15
15
16
16
> [!IMPORTANT]
17
17
> This feature is in public preview under [Supplemental Terms of Use](https://azure.microsoft.com/support/legal/preview-supplemental-terms/). The [preview REST API](/rest/api/searchservice/index-preview) supports this feature.
18
18
19
-
In Azure Cognitive Search, a *normalizer* is a component of the search engine responsible for pre-processing text for keyword matching in filters, facets, and sorts. Normalizers behave similar to [analyzers](search-analyzers.md) in how they process text, except they don't tokenize the query. Some of the transformations that can be achieved using normalizers are:
19
+
In Azure Cognitive Search, a *normalizer* is a component that pre-processes text for keyword matching over fields marked as "filterable", "facetable", or "sortable". In contrast with full text "searchable" fields that are paired with [text analyzers](search-analyzers.md), content that's created for filter-facet-sort operations doesn't undergo analysis or tokenization. Omission of text analysis can produce unexpected results when casing and character differences show up.
20
20
21
-
+ Convert to lowercase or upper-case
21
+
By applying a normalizer, you can achieve light text transformations that improve results:
22
+
23
+
+ Consistent casing (such as all lowercase or uppercase)
22
24
+ Normalize accents and diacritics like ö or ê to ASCII equivalent characters "o" and "e"
23
25
+ Map characters like `-` and whitespace into a user-specified character
24
26
25
-
Normalizers are specified on string fields in the index and are applied during indexing and query execution.
26
-
27
27
## Benefits of normalizers
28
28
29
-
Searching and retrieving documents from a search index requires matching the query to the contents of the document. The content can be analyzed to produce tokens for matching as is the case when "search" parameter is used, or can be used as-is for strict keyword matching as seen with "$filter", "facets", and "$orderby". This all-or-nothing approach covers most scenarios but falls short where simple pre-processing like casing, accent removal, asciifolding and so forth is required without undergoing through the entire analysis chain.
29
+
Searching and retrieving documents from a search index requires matching the query input to the contents of the document. Matching is either over tokenized content, as is the case when you invoke "search", or over non-tokenized content if the request is a [filter](search-query-odata-filter.md), [facet](search-faceted-navigation.md), or [orderby](search-query-odata-orderby.md) operation.
30
30
31
-
Consider the following examples:
31
+
Because non-tokenized content is also not analyzed, small differences in the content are evaluated as distinctly different values. Consider the following examples:
32
32
33
-
+`$filter=City eq 'Las Vegas'` will only return documents that contain the exact text "Las Vegas" and exclude documents with "LAS VEGAS" and "las vegas" which is inadequate when the use-case requires all documents regardless of the casing.
33
+
+`$filter=City eq 'Las Vegas'` will only return documents that contain the exact text "Las Vegas" and exclude documents with "LAS VEGAS" and "las vegas", which is inadequate when the use-case requires all documents regardless of the casing.
34
34
35
35
+`search=*&facet=City,count:5` will return "Las Vegas", "LAS VEGAS" and "las vegas" as distinct values despite being the same city.
36
36
37
-
+`search=usa&$orderby=City` will return the cities in lexicographical order: "Las Vegas", "Seattle", "las vegas", even if the intent is to order the same cities together irrespective of the case.
37
+
+`search=usa&$orderby=City` will return the cities in lexicographical order: "Las Vegas", "Seattle", "las vegas", even if the intent is to order the same cities together irrespective of the case.
38
38
39
-
## Predefined and custom normalizers
39
+
A normalizer, which is invoked during indexing and query execution, adds light transformations that smooth out minor differences in text for filter, facet, and sort scenarios. In the previous examples, the variants of "Las Vegas" would be processed according to the normalizer you select (for example, all text is lower-cased) for more uniform results.
40
40
41
-
Azure Cognitive Search provides built-in normalizers for common use-cases along with the capability to customize as required.
42
-
43
-
| Category | Description |
44
-
|----------|-------------|
45
-
|[Predefined normalizers](#predefined-normalizers)| Provided out-of-the-box and can be used without any configuration. |
46
-
|[Custom normalizers](#add-custom-normalizers) <sup>1</sup> | For advanced scenarios. Requires user-defined configuration of a combination of existing elements, consisting of char and token filters.|
47
-
48
-
<sup>(1)</sup> Custom normalizers don't specify tokenizers since normalizers always produce a single token.
49
-
50
-
## How to specify normalizers
41
+
## How to specify a normalizer
51
42
52
43
Normalizers are specified in an index definition, on a per-field basis, on text fields (`Edm.String` and `Collection(Edm.String)`) that have at least one of "filterable", "sortable", or "facetable" properties set to true. Setting a normalizer is optional and it's null by default. We recommend evaluating predefined normalizers before configuring a custom one.
53
44
54
-
Normalizers can only be specified when a new field is added to the index. Try to assess the normalization needs upfront and assign normalizers in the initial stages of development when dropping and recreating indexes is routine. Normalizers can't be specified on a field that has already been created.
45
+
Normalizers can only be specified when you add a new field to the index, so if possible, try to assess the normalization needs upfront and assign normalizers in the initial stages of development when dropping and recreating indexes is routine.
55
46
56
-
1. When creating a field definition in the [index](/rest/api/searchservice/create-index), set the "normalizer" property to one of the following: a [predefined normalizer](#predefined-normalizers) such as "lowercase", or a custom normalizer (defined in the same index schema).
47
+
1. When creating a field definition in the [index](/rest/api/searchservice/create-index), set the "normalizer" property to one of the following values: a [predefined normalizer](#predefined-normalizers) such as "lowercase", or a custom normalizer (defined in the same index schema).
57
48
58
49
```json
59
50
"fields": [
@@ -66,12 +57,12 @@ Normalizers can only be specified when a new field is added to the index. Try to
66
57
"analyzer": "en.microsoft",
67
58
"normalizer": "lowercase"
68
59
...
69
-
},
60
+
}
61
+
]
70
62
```
71
63
72
64
1. Custom normalizers are defined in the "normalizers" section of the index first, and then assigned to the field definition as shown in the previous step. For more information, see [Create Index](/rest/api/searchservice/create-index) and also [Add custom normalizers](#add-custom-normalizers).
73
65
74
-
75
66
```json
76
67
"fields": [
77
68
{
@@ -83,12 +74,60 @@ Normalizers can only be specified when a new field is added to the index. Try to
83
74
"normalizer": "my_custom_normalizer"
84
75
},
85
76
```
86
-
77
+
87
78
> [!NOTE]
88
79
> To change the normalizer of an existing field, you'll have to rebuild the index entirely (you cannot rebuild individual fields).
89
80
90
81
A good workaround for production indexes, where rebuilding indexes is costly, is to create a new field identical to the old one but with the new normalizer, and use it in place of the old one. Use [Update Index](/rest/api/searchservice/update-index) to incorporate the new field and [mergeOrUpload](/rest/api/searchservice/addupdate-or-delete-documents) to populate it. Later, as part of planned index servicing, you can clean up the index to remove obsolete fields.
91
82
83
+
## Predefined and custom normalizers
84
+
85
+
Azure Cognitive Search provides built-in normalizers for common use-cases along with the capability to customize as required.
86
+
87
+
| Category | Description |
88
+
|----------|-------------|
89
+
| [Predefined normalizers](#predefined-normalizers) | Provided out-of-the-box and can be used without any configuration. |
90
+
|[Custom normalizers](#add-custom-normalizers) <sup>1</sup> | For advanced scenarios. Requires user-defined configuration of a combination of existing elements, consisting of char and token filters.|
91
+
92
+
<sup>(1)</sup> Custom normalizers don't specify tokenizers since normalizers always produce a single token.
93
+
94
+
## Normalizers reference
95
+
96
+
### Predefined normalizers
97
+
98
+
|**Name**|**Description and Options**|
99
+
|-|-|
100
+
|standard| Lowercases the text followed by asciifolding.|
101
+
|lowercase| Transforms characters to lowercase.|
102
+
|uppercase| Transforms characters to uppercase.|
103
+
|asciifolding| Transforms characters that aren't in the Basic Latin Unicode block to their ASCII equivalent, if one exists. For example, changing à to a.|
104
+
|elision| Removes elision from beginning of the tokens.|
105
+
106
+
### Supported char filters
107
+
108
+
Normalizers support two character filters that are identical to their counterparts in [custom analyzer character filters](index-add-custom-analyzers.md#CharFilter):
The list below shows the token filters supported for normalizers and is a subset of the overall [token filters used in custom analyzers](index-add-custom-analyzers.md#TokenFilters).
Custom normalizers are [defined within the index schema](/rest/api/searchservice/create-index). The definition includes a name, a type, one or more character filters and token filters. The character filters and token filters are the building blocks for a custom normalizer and responsible for the processing of the text. These filters are applied from left to right.
@@ -131,52 +170,15 @@ Custom normalizers are [defined within the index schema](/rest/api/searchservice
131
170
132
171
Custom normalizers can be added during index creation or later by updating an existing one. Adding a custom normalizer to an existing index requires the "allowIndexDowntime" flag to be specified in [Update Index](/rest/api/searchservice/update-index) and will cause the index to be unavailable for a few seconds.
133
172
134
-
## Normalizers reference
135
-
136
-
### Predefined normalizers
137
-
138
-
|**Name**|**Description and Options**|
139
-
|-|-|
140
-
|standard| Lowercases the text followed by asciifolding.|
141
-
|lowercase| Transforms characters to lowercase.|
142
-
|uppercase| Transforms characters to uppercase.|
143
-
|asciifolding| Transforms characters that aren't in the Basic Latin Unicode block to their ASCII equivalent, if one exists. For example, changing à to a.|
144
-
|elision| Removes elision from beginning of the tokens.|
145
-
146
-
### Supported char filters
147
-
148
-
Normalizers support two character filters that are identical to their counterparts in [custom analyzer character filters](index-add-custom-analyzers.md#CharFilter):
The list below shows the token filters supported for normalizers and is a subset of the overall [token filters used in custom analyzers](index-add-custom-analyzers.md#TokenFilters).
The example below illustrates a custom normalizer definition with corresponding character filters and token filters. Custom options for character filters and token filters are specified separately as named constructs, and then referenced in the normalizer definition as illustrated below.
174
176
175
-
* A custom normalizer named "my_custom_normalizer" is defined in the "normalizers" section of the index definition.
177
+
+ A custom normalizer named "my_custom_normalizer" is defined in the "normalizers" section of the index definition.
176
178
177
-
* The normalizer is composed of two character filters and three token filters: elision, lowercase, and customized asciifolding filter "my_asciifolding".
179
+
+ The normalizer is composed of two character filters and three token filters: elision, lowercase, and customized asciifolding filter "my_asciifolding".
178
180
179
-
* The first character filter "map_dash" replaces all dashes with underscores while the second one "remove_whitespace" removes all spaces.
181
+
+ The first character filter "map_dash" replaces all dashes with underscores while the second one "remove_whitespace" removes all spaces.
180
182
181
183
```json
182
184
{
@@ -235,6 +237,8 @@ The example below illustrates a custom normalizer definition with corresponding
235
237
236
238
## See also
237
239
240
+
+[Querying concepts in Azure Cognitive Search](search-query-overview.md)
241
+
238
242
+[Analyzers for linguistic and text processing](search-analyzers.md)
0 commit comments