Skip to content

Commit fd55bfc

Browse files
lukas-vlcekoeyh
authored andcommitted
Add documentation for Normalizers (opensearch-project#6415)
Improve the documentation for Analyzers (some terminology was not correct) also fix some broken links along the way. Note: this commit is a spin-off of opensearch-project#6252 Signed-off-by: Lukáš Vlček <[email protected]>
1 parent 88b1e8c commit fd55bfc

File tree

5 files changed

+144
-23
lines changed

5 files changed

+144
-23
lines changed

_analyzers/index.md

Lines changed: 15 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -15,16 +15,24 @@ redirect_from:
1515

1616
# Text analysis
1717

18-
When you are searching documents using a full-text search, you want to receive all relevant results and not only exact matches. If you're looking for "walk", you're interested in results that contain any form of the word, like "Walk", "walked", or "walking." To facilitate full-text search, OpenSearch uses text analysis.
18+
When you are searching documents using a full-text search, you want to receive all relevant results. If you're looking for "walk", you're interested in results that contain any form of the word, like "Walk", "walked", or "walking". To facilitate full-text search, OpenSearch uses text analysis.
1919

20-
Text analysis consists of the following steps:
20+
The objective of text analysis is to split the unstructured free text content of the source document into a sequence of terms, which are then stored in an inverted index. Subsequently, when a similar text analysis is applied to a user's query, the resulting sequence of terms facilitates the matching of relevant source documents.
2121

22-
1. _Tokenize_ text into terms: For example, after tokenization, the phrase `Actions speak louder than words` is split into tokens `Actions`, `speak`, `louder`, `than`, and `words`.
23-
1. _Normalize_ the terms by converting them into a standard format, for example, converting them to lowercase or performing stemming (reducing the word to its root): For example, after normalization, `Actions` becomes `action`, `louder` becomes `loud`, and `words` becomes `word`.
22+
From a technical point of view, the text analysis process consists of several steps, some of which are optional:
23+
24+
1. Before the free text content can be split into individual words, it may be beneficial to refine the text at the character level. The primary aim of this optional step is to help the tokenizer (the subsequent stage in the analysis process) generate better tokens. This can include removal of markup tags (such as HTML) or handling specific character patterns (like replacing the &#x1F642; emoji with the text `:slightly_smiling_face:`).
25+
26+
2. The next step is to split the free text into individual words---_tokens_. This is performed by a _tokenizer_. For example, after tokenization, the sentence `Actions speak louder than words` is split into tokens `Actions`, `speak`, `louder`, `than`, and `words`.
27+
28+
3. The last step is to process individual tokens by applying a series of token filters. The aim is to convert each token into a predictable form that is directly stored in the index, for example, by converting them to lowercase or performing stemming (reducing the word to its root). For example, the token `Actions` becomes `action`, `louder` becomes `loud`, and `words` becomes `word`.
29+
30+
Although the terms ***token*** and ***term*** may sound similar and are occasionally used interchangeably, it is helpful to understand the difference between the two. In the context of Apache Lucene, each holds a distinct role. A ***token*** is created by a tokenizer during text analysis and often undergoes a number of additional modifications as it passes through the chain of token filters. Each token is associated with metadata that can be further used during the text analysis process. A ***term*** is a data value that is directly stored in the inverted index and is associated with much less metadata. During search, matching operates at the term level.
31+
{: .note}
2432

2533
## Analyzers
2634

27-
In OpenSearch, text analysis is performed by an _analyzer_. Each analyzer contains the following sequentially applied components:
35+
In OpenSearch, the abstraction that encompasses text analysis is referred to as an _analyzer_. Each analyzer contains the following sequentially applied components:
2836

2937
1. **Character filters**: First, a character filter receives the original text as a stream of characters and adds, removes, or modifies characters in the text. For example, a character filter can strip HTML characters from a string so that the text `<p><b>Actions</b> speak louder than <em>words</em></p>` becomes `\nActions speak louder than words\n`. The output of a character filter is a stream of characters.
3038

@@ -35,6 +43,8 @@ In OpenSearch, text analysis is performed by an _analyzer_. Each analyzer contai
3543
An analyzer must contain exactly one tokenizer and may contain zero or more character filters and zero or more token filters.
3644
{: .note}
3745

46+
There is also a special type of analyzer called a ***normalizer***. A normalizer is similar to an analyzer except that it does not contain a tokenizer and can only include specific types of character filters and token filters. These filters can perform only character-level operations, such as character or pattern replacement, and cannot perform operations on the token as a whole. This means that replacing a token with a synonym or stemming is not supported. See [Normalizers]({{site.url}}{{site.baseurl}}/analyzers/normalizers/) for further details.
47+
3848
## Built-in analyzers
3949

4050
The following table lists the built-in analyzers that OpenSearch provides. The last column of the table contains the result of applying the analyzer to the string `It’s fun to contribute a brand-new PR or 2 to OpenSearch!`.

_analyzers/normalizers.md

Lines changed: 111 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,111 @@
1+
---
2+
layout: default
3+
title: Normalizers
4+
nav_order: 100
5+
---
6+
7+
# Normalizers
8+
9+
A _normalizer_ functions similarly to an analyzer but outputs only a single token. It does not contain a tokenizer and can only include specific types of character and token filters. These filters can perform only character-level operations, such as character or pattern replacement, and cannot operate on the token as a whole. This means that replacing a token with a synonym or stemming is not supported.
10+
11+
A normalizer is useful in keyword search (that is, in term-based queries) because it allows you to run token and character filters on any given input. For instance, it makes it possible to match an incoming query `Naïve` with the index term `naive`.
12+
13+
Consider the following example.
14+
15+
Create a new index with a custom normalizer:
16+
```json
17+
PUT /sample-index
18+
{
19+
"settings": {
20+
"analysis": {
21+
"normalizer": {
22+
"normalized_keyword": {
23+
"type": "custom",
24+
"char_filter": [],
25+
"filter": [ "asciifolding", "lowercase" ]
26+
}
27+
}
28+
}
29+
},
30+
"mappings": {
31+
"properties": {
32+
"approach": {
33+
"type": "keyword",
34+
"normalizer": "normalized_keyword"
35+
}
36+
}
37+
}
38+
}
39+
```
40+
{% include copy-curl.html %}
41+
42+
Index a document:
43+
```json
44+
POST /sample-index/_doc/
45+
{
46+
"approach": "naive"
47+
}
48+
```
49+
{% include copy-curl.html %}
50+
51+
The following query matches the document. This is expected:
52+
```json
53+
GET /sample-index/_search
54+
{
55+
"query": {
56+
"term": {
57+
"approach": "naive"
58+
}
59+
}
60+
}
61+
```
62+
{% include copy-curl.html %}
63+
64+
But this query matches the document as well:
65+
```json
66+
GET /sample-index/_search
67+
{
68+
"query": {
69+
"term": {
70+
"approach": "Naïve"
71+
}
72+
}
73+
}
74+
```
75+
{% include copy-curl.html %}
76+
77+
To understand why, consider the effect of the normalizer:
78+
```json
79+
GET /sample-index/_analyze
80+
{
81+
"normalizer" : "normalized_keyword",
82+
"text" : "Naïve"
83+
}
84+
```
85+
86+
Internally, a normalizer accepts only filters that are instances of either `NormalizingTokenFilterFactory` or `NormalizingCharFilterFactory`. The following is a list of compatible filters found in modules and plugins that are part of the core OpenSearch repository.
87+
88+
### The `common-analysis` module
89+
90+
This module does not require installation; it is available by default.
91+
92+
Character filters: `pattern_replace`, `mapping`
93+
94+
Token filters: `arabic_normalization`, `asciifolding`, `bengali_normalization`, `cjk_width`, `decimal_digit`, `elision`, `german_normalization`, `hindi_normalization`, `indic_normalization`, `lowercase`, `persian_normalization`, `scandinavian_folding`, `scandinavian_normalization`, `serbian_normalization`, `sorani_normalization`, `trim`, `uppercase`
95+
96+
### The `analysis-icu` plugin
97+
98+
Character filters: `icu_normalizer`
99+
100+
Token filters: `icu_normalizer`, `icu_folding`, `icu_transform`
101+
102+
### The `analysis-kuromoji` plugin
103+
104+
Character filters: `normalize_kanji`, `normalize_kana`
105+
106+
### The `analysis-nori` plugin
107+
108+
Character filters: `normalize_kanji`, `normalize_kana`
109+
110+
These lists of filters include only analysis components found in the [additional plugins]({{site.url}}{{site.baseurl}}/install-and-configure/plugins/#additional-plugins) that are part of the core OpenSearch repository.
111+
{: .note}

_field-types/supported-field-types/keyword.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -52,7 +52,7 @@ Parameter | Description
5252
`index` | A Boolean value that specifies whether the field should be searchable. Default is `true`. To reduce disk space, set `index` to `false`.
5353
`index_options` | Information to be stored in the index that will be considered when calculating relevance scores. Can be set to `freqs` for term frequency. Default is `docs`.
5454
`meta` | Accepts metadata for this field.
55-
`normalizer` | Specifies how to preprocess this field before indexing (for example, make it lowercase). Default is `null` (no preprocessing).
55+
[`normalizer`]({{site.url}}{{site.baseurl}}/analyzers/normalizers/) | Specifies how to preprocess this field before indexing (for example, make it lowercase). Default is `null` (no preprocessing).
5656
`norms` | A Boolean value that specifies whether the field length should be used when calculating relevance scores. Default is `false`.
5757
[`null_value`]({{site.url}}{{site.baseurl}}/opensearch/supported-field-types/index#null-value) | A value to be used in place of `null`. Must be of the same type as the field. If this parameter is not specified, the field is treated as missing when its value is `null`. Default is `null`.
5858
`similarity` | The ranking algorithm for calculating relevance scores. Default is `BM25`.

_install-and-configure/plugins.md

Lines changed: 16 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -31,12 +31,12 @@ If you are running OpenSearch in a Docker container, plugins must be installed,
3131

3232
Use `list` to see a list of plugins that have already been installed.
3333

34-
#### Usage:
34+
#### Usage
3535
```bash
3636
bin/opensearch-plugin list
3737
```
3838

39-
#### Example:
39+
#### Example
4040
```bash
4141
$ ./opensearch-plugin list
4242
opensearch-alerting
@@ -84,20 +84,20 @@ opensearch-node1 opensearch-notifications-core 2.0.1.0
8484

8585
There are three ways to install plugins using the `opensearch-plugin`:
8686

87-
- [Install a plugin by name]({{site.url}}{{site.baseurl}}/opensearch/install/plugins#install-a-plugin-by-name)
88-
- [Install a plugin by from a zip file]({{site.url}}{{site.baseurl}}/opensearch/install/plugins#install-a-plugin-from-a-zip-file)
89-
- [Install a plugin using Maven coordinates]({{site.url}}{{site.baseurl}}/opensearch/install/plugins#install-a-plugin-using-maven-coordinates)
87+
- [Install a plugin by name](#install-a-plugin-by-name).
88+
- [Install a plugin from a ZIP file](#install-a-plugin-from-a-zip-file).
89+
- [Install a plugin using Maven coordinates](#install-a-plugin-using-maven-coordinates).
9090

9191
### Install a plugin by name:
9292

93-
For a list of plugins that can be installed by name, see [Additional plugins]({{site.url}}{{site.baseurl}}/opensearch/install/plugins#additional-plugins).
93+
For a list of plugins that can be installed by name, see [Additional plugins](#additional-plugins).
9494

95-
#### Usage:
95+
#### Usage
9696
```bash
9797
bin/opensearch-plugin install <plugin-name>
9898
```
9999

100-
#### Example:
100+
#### Example
101101
```bash
102102
$ sudo ./opensearch-plugin install analysis-icu
103103
-> Installing analysis-icu
@@ -106,16 +106,16 @@ $ sudo ./opensearch-plugin install analysis-icu
106106
-> Installed analysis-icu with folder name analysis-icu
107107
```
108108

109-
### Install a plugin from a zip file:
109+
### Install a plugin from a zip file
110110

111111
Remote zip files can be installed by replacing `<zip-file>` with the URL of the hosted file. The tool only supports downloading over HTTP/HTTPS protocols. For local zip files, replace `<zip-file>` with `file:` followed by the absolute or relative path to the plugin zip file as in the second example below.
112112

113-
#### Usage:
113+
#### Usage
114114
```bash
115115
bin/opensearch-plugin install <zip-file>
116116
```
117117

118-
#### Example:
118+
#### Example
119119
```bash
120120
# Zip file is hosted on a remote server - in this case, Maven central repository.
121121
$ sudo ./opensearch-plugin install https://repo1.maven.org/maven2/org/opensearch/plugin/opensearch-anomaly-detection/2.2.0.0/opensearch-anomaly-detection-2.2.0.0.zip
@@ -166,16 +166,16 @@ Continue with installation? [y/N]y
166166
-> Installed opensearch-anomaly-detection with folder name opensearch-anomaly-detection
167167
```
168168

169-
### Install a plugin using Maven coordinates:
169+
### Install a plugin using Maven coordinates
170170

171171
The `opensearch-plugin install` tool also accepts Maven coordinates for available artifacts and versions hosted on [Maven Central](https://search.maven.org/search?q=org.opensearch.plugin). `opensearch-plugin` will parse the Maven coordinates you provide and construct a URL. As a result, the host must be able to connect directly to [Maven Central](https://search.maven.org/search?q=org.opensearch.plugin). The plugin installation will fail if you pass coordinates to a proxy or local repository.
172172

173-
#### Usage:
173+
#### Usage
174174
```bash
175175
bin/opensearch-plugin install <groupId>:<artifactId>:<version>
176176
```
177177

178-
#### Example:
178+
#### Example
179179
```bash
180180
$ sudo ./opensearch-plugin install org.opensearch.plugin:opensearch-anomaly-detection:2.2.0.0
181181
-> Installing org.opensearch.plugin:opensearch-anomaly-detection:2.2.0.0
@@ -222,12 +222,12 @@ $ sudo $ ./opensearch-plugin install analysis-nori repository-s3
222222

223223
You can remove a plugin that has already been installed with the `remove` option.
224224

225-
#### Usage:
225+
#### Usage
226226
```bash
227227
bin/opensearch-plugin remove <plugin-name>
228228
```
229229

230-
#### Example:
230+
#### Example
231231
```bash
232232
$ sudo $ ./opensearch-plugin remove opensearch-anomaly-detection
233233
-> removing [opensearch-anomaly-detection]...

_query-dsl/term-vs-full-text.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,7 @@ redirect_from:
88

99
# Term-level and full-text queries compared
1010

11-
You can use both term-level and full-text queries to search text, but while term-level queries are usually used to search structured data, full-text queries are used for full-text search. The main difference between term-level and full-text queries is that term-level queries search documents for an exact specified term, while full-text queries analyze the query string. The following table summarizes the differences between term-level and full-text queries.
11+
You can use both term-level and full-text queries to search text, but while term-level queries are usually used to search structured data, full-text queries are used for full-text search. The main difference between term-level and full-text queries is that term-level queries search documents for an exact specified term, while full-text queries [analyze]({{{site.url}}{{site.baseurl}}/analyzers/) the query string. The following table summarizes the differences between term-level and full-text queries.
1212

1313
| | Term-level queries | Full-text queries
1414
:--- | :--- | :---

0 commit comments

Comments
 (0)