Skip to content

Commit e5c23ee

Browse files
authored
Merge pull request #204804 from HeidiSteen/heidist-support-case-art
[azure search] Clarify normalizer/analyzer diff, per a Stackoverflow question
2 parents 2084c9f + 69de4e7 commit e5c23ee

File tree

1 file changed

+70
-66
lines changed

1 file changed

+70
-66
lines changed

articles/search/search-normalizers.md

Lines changed: 70 additions & 66 deletions
Original file line numberDiff line numberDiff line change
@@ -8,52 +8,43 @@ manager: jlembicz
88
ms.author: ishansri
99
ms.service: cognitive-search
1010
ms.topic: how-to
11-
ms.date: 03/23/2022
11+
ms.date: 07/14/2022
1212
---
1313

1414
# Text normalization for case-insensitive filtering, faceting and sorting
1515

1616
> [!IMPORTANT]
1717
> This feature is in public preview under [Supplemental Terms of Use](https://azure.microsoft.com/support/legal/preview-supplemental-terms/). The [preview REST API](/rest/api/searchservice/index-preview) supports this feature.
1818
19-
In Azure Cognitive Search, a *normalizer* is a component of the search engine responsible for pre-processing text for keyword matching in filters, facets, and sorts. Normalizers behave similar to [analyzers](search-analyzers.md) in how they process text, except they don't tokenize the query. Some of the transformations that can be achieved using normalizers are:
19+
In Azure Cognitive Search, a *normalizer* is a component that pre-processes text for keyword matching over fields marked as "filterable", "facetable", or "sortable". In contrast with full text "searchable" fields that are paired with [text analyzers](search-analyzers.md), content that's created for filter-facet-sort operations doesn't undergo analysis or tokenization. Omission of text analysis can produce unexpected results when casing and character differences show up.
2020

21-
+ Convert to lowercase or upper-case
21+
By applying a normalizer, you can achieve light text transformations that improve results:
22+
23+
+ Consistent casing (such as all lowercase or uppercase)
2224
+ Normalize accents and diacritics like ö or ê to ASCII equivalent characters "o" and "e"
2325
+ Map characters like `-` and whitespace into a user-specified character
2426

25-
Normalizers are specified on string fields in the index and are applied during indexing and query execution.
26-
2727
## Benefits of normalizers
2828

29-
Searching and retrieving documents from a search index requires matching the query to the contents of the document. The content can be analyzed to produce tokens for matching as is the case when "search" parameter is used, or can be used as-is for strict keyword matching as seen with "$filter", "facets", and "$orderby". This all-or-nothing approach covers most scenarios but falls short where simple pre-processing like casing, accent removal, asciifolding and so forth is required without undergoing through the entire analysis chain.
29+
Searching and retrieving documents from a search index requires matching the query input to the contents of the document. Matching is either over tokenized content, as is the case when you invoke "search", or over non-tokenized content if the request is a [filter](search-query-odata-filter.md), [facet](search-faceted-navigation.md), or [orderby](search-query-odata-orderby.md) operation.
3030

31-
Consider the following examples:
31+
Because non-tokenized content is also not analyzed, small differences in the content are evaluated as distinctly different values. Consider the following examples:
3232

33-
+ `$filter=City eq 'Las Vegas'` will only return documents that contain the exact text "Las Vegas" and exclude documents with "LAS VEGAS" and "las vegas" which is inadequate when the use-case requires all documents regardless of the casing.
33+
+ `$filter=City eq 'Las Vegas'` will only return documents that contain the exact text "Las Vegas" and exclude documents with "LAS VEGAS" and "las vegas", which is inadequate when the use-case requires all documents regardless of the casing.
3434

3535
+ `search=*&facet=City,count:5` will return "Las Vegas", "LAS VEGAS" and "las vegas" as distinct values despite being the same city.
3636

37-
+ `search=usa&$orderby=City` will return the cities in lexicographical order: "Las Vegas", "Seattle", "las vegas", even if the intent is to order the same cities together irrespective of the case.
37+
+ `search=usa&$orderby=City` will return the cities in lexicographical order: "Las Vegas", "Seattle", "las vegas", even if the intent is to order the same cities together irrespective of the case.
3838

39-
## Predefined and custom normalizers
39+
A normalizer, which is invoked during indexing and query execution, adds light transformations that smooth out minor differences in text for filter, facet, and sort scenarios. In the previous examples, the variants of "Las Vegas" would be processed according to the normalizer you select (for example, all text is lower-cased) for more uniform results.
4040

41-
Azure Cognitive Search provides built-in normalizers for common use-cases along with the capability to customize as required.
42-
43-
| Category | Description |
44-
|----------|-------------|
45-
| [Predefined normalizers](#predefined-normalizers) | Provided out-of-the-box and can be used without any configuration. |
46-
|[Custom normalizers](#add-custom-normalizers) <sup>1</sup> | For advanced scenarios. Requires user-defined configuration of a combination of existing elements, consisting of char and token filters.|
47-
48-
<sup>(1)</sup> Custom normalizers don't specify tokenizers since normalizers always produce a single token.
49-
50-
## How to specify normalizers
41+
## How to specify a normalizer
5142

5243
Normalizers are specified in an index definition, on a per-field basis, on text fields (`Edm.String` and `Collection(Edm.String)`) that have at least one of "filterable", "sortable", or "facetable" properties set to true. Setting a normalizer is optional and it's null by default. We recommend evaluating predefined normalizers before configuring a custom one.
5344

54-
Normalizers can only be specified when a new field is added to the index. Try to assess the normalization needs upfront and assign normalizers in the initial stages of development when dropping and recreating indexes is routine. Normalizers can't be specified on a field that has already been created.
45+
Normalizers can only be specified when you add a new field to the index, so if possible, try to assess the normalization needs upfront and assign normalizers in the initial stages of development when dropping and recreating indexes is routine.
5546

56-
1. When creating a field definition in the [index](/rest/api/searchservice/create-index), set the "normalizer" property to one of the following: a [predefined normalizer](#predefined-normalizers) such as "lowercase", or a custom normalizer (defined in the same index schema).
47+
1. When creating a field definition in the [index](/rest/api/searchservice/create-index), set the "normalizer" property to one of the following values: a [predefined normalizer](#predefined-normalizers) such as "lowercase", or a custom normalizer (defined in the same index schema).
5748

5849
```json
5950
"fields": [
@@ -66,12 +57,12 @@ Normalizers can only be specified when a new field is added to the index. Try to
6657
"analyzer": "en.microsoft",
6758
"normalizer": "lowercase"
6859
...
69-
},
60+
}
61+
]
7062
```
7163

7264
1. Custom normalizers are defined in the "normalizers" section of the index first, and then assigned to the field definition as shown in the previous step. For more information, see [Create Index](/rest/api/searchservice/create-index) and also [Add custom normalizers](#add-custom-normalizers).
7365

74-
7566
```json
7667
"fields": [
7768
{
@@ -83,12 +74,60 @@ Normalizers can only be specified when a new field is added to the index. Try to
8374
"normalizer": "my_custom_normalizer"
8475
},
8576
```
86-
77+
8778
> [!NOTE]
8879
> To change the normalizer of an existing field, you'll have to rebuild the index entirely (you cannot rebuild individual fields).
8980

9081
A good workaround for production indexes, where rebuilding indexes is costly, is to create a new field identical to the old one but with the new normalizer, and use it in place of the old one. Use [Update Index](/rest/api/searchservice/update-index) to incorporate the new field and [mergeOrUpload](/rest/api/searchservice/addupdate-or-delete-documents) to populate it. Later, as part of planned index servicing, you can clean up the index to remove obsolete fields.
9182

83+
## Predefined and custom normalizers
84+
85+
Azure Cognitive Search provides built-in normalizers for common use-cases along with the capability to customize as required.
86+
87+
| Category | Description |
88+
|----------|-------------|
89+
| [Predefined normalizers](#predefined-normalizers) | Provided out-of-the-box and can be used without any configuration. |
90+
|[Custom normalizers](#add-custom-normalizers) <sup>1</sup> | For advanced scenarios. Requires user-defined configuration of a combination of existing elements, consisting of char and token filters.|
91+
92+
<sup>(1)</sup> Custom normalizers don't specify tokenizers since normalizers always produce a single token.
93+
94+
## Normalizers reference
95+
96+
### Predefined normalizers
97+
98+
|**Name**|**Description and Options**|
99+
|-|-|
100+
|standard| Lowercases the text followed by asciifolding.|
101+
|lowercase| Transforms characters to lowercase.|
102+
|uppercase| Transforms characters to uppercase.|
103+
|asciifolding| Transforms characters that aren't in the Basic Latin Unicode block to their ASCII equivalent, if one exists. For example, changing à to a.|
104+
|elision| Removes elision from beginning of the tokens.|
105+
106+
### Supported char filters
107+
108+
Normalizers support two character filters that are identical to their counterparts in [custom analyzer character filters](index-add-custom-analyzers.md#CharFilter):
109+
110+
+ [mapping](https://lucene.apache.org/core/6_6_1/analyzers-common/org/apache/lucene/analysis/charfilter/MappingCharFilter.html)
111+
+ [pattern_replace](https://lucene.apache.org/core/6_6_1/analyzers-common/org/apache/lucene/analysis/pattern/PatternReplaceCharFilter.html)
112+
113+
### Supported token filters
114+
115+
The list below shows the token filters supported for normalizers and is a subset of the overall [token filters used in custom analyzers](index-add-custom-analyzers.md#TokenFilters).
116+
117+
+ [arabic_normalization](https://lucene.apache.org/core/6_6_1/analyzers-common/org/apache/lucene/analysis/ar/ArabicNormalizationFilter.html)
118+
+ [asciifolding](https://lucene.apache.org/core/6_6_1/analyzers-common/org/apache/lucene/analysis/miscellaneous/ASCIIFoldingFilter.html)
119+
+ [cjk_width](https://lucene.apache.org/core/6_6_1/analyzers-common/org/apache/lucene/analysis/cjk/CJKWidthFilter.html)
120+
+ [elision](https://lucene.apache.org/core/6_6_1/analyzers-common/org/apache/lucene/analysis/util/ElisionFilter.html)
121+
+ [german_normalization](https://lucene.apache.org/core/6_6_1/analyzers-common/org/apache/lucene/analysis/de/GermanNormalizationFilter.html)
122+
+ [hindi_normalization](https://lucene.apache.org/core/6_6_1/analyzers-common/org/apache/lucene/analysis/hi/HindiNormalizationFilter.html)
123+
+ [indic_normalization](https://lucene.apache.org/core/6_6_1/analyzers-common/org/apache/lucene/analysis/in/IndicNormalizationFilter.html)
124+
+ [persian_normalization](https://lucene.apache.org/core/6_6_1/analyzers-common/org/apache/lucene/analysis/fa/PersianNormalizationFilter.html)
125+
+ [scandinavian_normalization](https://lucene.apache.org/core/6_6_1/analyzers-common/org/apache/lucene/analysis/miscellaneous/ScandinavianNormalizationFilter.html)
126+
+ [scandinavian_folding](https://lucene.apache.org/core/6_6_1/analyzers-common/org/apache/lucene/analysis/miscellaneous/ScandinavianFoldingFilter.html)
127+
+ [sorani_normalization](https://lucene.apache.org/core/6_6_1/analyzers-common/org/apache/lucene/analysis/ckb/SoraniNormalizationFilter.html)
128+
+ [lowercase](https://lucene.apache.org/core/6_6_1/analyzers-common/org/apache/lucene/analysis/core/LowerCaseFilter.html)
129+
+ [uppercase](https://lucene.apache.org/core/6_6_1/analyzers-common/org/apache/lucene/analysis/core/UpperCaseFilter.html)
130+
92131
## Add custom normalizers
93132

94133
Custom normalizers are [defined within the index schema](/rest/api/searchservice/create-index). The definition includes a name, a type, one or more character filters and token filters. The character filters and token filters are the building blocks for a custom normalizer and responsible for the processing of the text. These filters are applied from left to right.
@@ -131,52 +170,15 @@ Custom normalizers are [defined within the index schema](/rest/api/searchservice
131170

132171
Custom normalizers can be added during index creation or later by updating an existing one. Adding a custom normalizer to an existing index requires the "allowIndexDowntime" flag to be specified in [Update Index](/rest/api/searchservice/update-index) and will cause the index to be unavailable for a few seconds.
133172

134-
## Normalizers reference
135-
136-
### Predefined normalizers
137-
138-
|**Name**|**Description and Options**|
139-
|-|-|
140-
|standard| Lowercases the text followed by asciifolding.|
141-
|lowercase| Transforms characters to lowercase.|
142-
|uppercase| Transforms characters to uppercase.|
143-
|asciifolding| Transforms characters that aren't in the Basic Latin Unicode block to their ASCII equivalent, if one exists. For example, changing à to a.|
144-
|elision| Removes elision from beginning of the tokens.|
145-
146-
### Supported char filters
147-
148-
Normalizers support two character filters that are identical to their counterparts in [custom analyzer character filters](index-add-custom-analyzers.md#CharFilter):
149-
150-
+ [mapping](https://lucene.apache.org/core/6_6_1/analyzers-common/org/apache/lucene/analysis/charfilter/MappingCharFilter.html)
151-
+ [pattern_replace](https://lucene.apache.org/core/6_6_1/analyzers-common/org/apache/lucene/analysis/pattern/PatternReplaceCharFilter.html)
152-
153-
### Supported token filters
154-
155-
The list below shows the token filters supported for normalizers and is a subset of the overall [token filters used in custom analyzers](index-add-custom-analyzers.md#TokenFilters).
156-
157-
+ [arabic_normalization](https://lucene.apache.org/core/6_6_1/analyzers-common/org/apache/lucene/analysis/ar/ArabicNormalizationFilter.html)
158-
+ [asciifolding](https://lucene.apache.org/core/6_6_1/analyzers-common/org/apache/lucene/analysis/miscellaneous/ASCIIFoldingFilter.html)
159-
+ [cjk_width](https://lucene.apache.org/core/6_6_1/analyzers-common/org/apache/lucene/analysis/cjk/CJKWidthFilter.html)
160-
+ [elision](https://lucene.apache.org/core/6_6_1/analyzers-common/org/apache/lucene/analysis/util/ElisionFilter.html)
161-
+ [german_normalization](https://lucene.apache.org/core/6_6_1/analyzers-common/org/apache/lucene/analysis/de/GermanNormalizationFilter.html)
162-
+ [hindi_normalization](https://lucene.apache.org/core/6_6_1/analyzers-common/org/apache/lucene/analysis/hi/HindiNormalizationFilter.html)
163-
+ [indic_normalization](https://lucene.apache.org/core/6_6_1/analyzers-common/org/apache/lucene/analysis/in/IndicNormalizationFilter.html)
164-
+ [persian_normalization](https://lucene.apache.org/core/6_6_1/analyzers-common/org/apache/lucene/analysis/fa/PersianNormalizationFilter.html)
165-
+ [scandinavian_normalization](https://lucene.apache.org/core/6_6_1/analyzers-common/org/apache/lucene/analysis/miscellaneous/ScandinavianNormalizationFilter.html)
166-
+ [scandinavian_folding](https://lucene.apache.org/core/6_6_1/analyzers-common/org/apache/lucene/analysis/miscellaneous/ScandinavianFoldingFilter.html)
167-
+ [sorani_normalization](https://lucene.apache.org/core/6_6_1/analyzers-common/org/apache/lucene/analysis/ckb/SoraniNormalizationFilter.html)
168-
+ [lowercase](https://lucene.apache.org/core/6_6_1/analyzers-common/org/apache/lucene/analysis/core/LowerCaseFilter.html)
169-
+ [uppercase](https://lucene.apache.org/core/6_6_1/analyzers-common/org/apache/lucene/analysis/core/UpperCaseFilter.html)
170-
171173
## Custom normalizer example
172174

173175
The example below illustrates a custom normalizer definition with corresponding character filters and token filters. Custom options for character filters and token filters are specified separately as named constructs, and then referenced in the normalizer definition as illustrated below.
174176

175-
* A custom normalizer named "my_custom_normalizer" is defined in the "normalizers" section of the index definition.
177+
+ A custom normalizer named "my_custom_normalizer" is defined in the "normalizers" section of the index definition.
176178

177-
* The normalizer is composed of two character filters and three token filters: elision, lowercase, and customized asciifolding filter "my_asciifolding".
179+
+ The normalizer is composed of two character filters and three token filters: elision, lowercase, and customized asciifolding filter "my_asciifolding".
178180

179-
* The first character filter "map_dash" replaces all dashes with underscores while the second one "remove_whitespace" removes all spaces.
181+
+ The first character filter "map_dash" replaces all dashes with underscores while the second one "remove_whitespace" removes all spaces.
180182

181183
```json
182184
{
@@ -235,6 +237,8 @@ The example below illustrates a custom normalizer definition with corresponding
235237

236238
## See also
237239

240+
+ [Querying concepts in Azure Cognitive Search](search-query-overview.md)
241+
238242
+ [Analyzers for linguistic and text processing](search-analyzers.md)
239243

240-
+ [Search Documents REST API](/rest/api/searchservice/search-documents)
244+
+ [Search Documents REST API](/rest/api/searchservice/search-documents)

0 commit comments

Comments
 (0)