Add documentation for Normalizers (opensearch-project#6415)

lukas-vlcek · oeyh · commit fd55bfcdcf39 · 2024-03-14T12:40:58.000-05:00
Improve the documentation for Analyzers (some terminology was not correct) also fix some broken links along the way. Note: this commit is a spin-off of opensearch-project#6252 Signed-off-by: Lukáš Vlček <lukas.vlcek@aiven.io>
diff --git a/_analyzers/index.md b/_analyzers/index.md
@@ -15,16 +15,24 @@ redirect_from:
 
 # Text analysis
 
-When you are searching documents using a full-text search, you want to receive all relevant results and not only exact matches. If you're looking for "walk", you're interested in results that contain any form of the word, like "Walk", "walked", or "walking." To facilitate full-text search, OpenSearch uses text analysis.
+When you are searching documents using a full-text search, you want to receive all relevant results. If you're looking for "walk", you're interested in results that contain any form of the word, like "Walk", "walked", or "walking". To facilitate full-text search, OpenSearch uses text analysis.
 
-Text analysis consists of the following steps:
+The objective of text analysis is to split the unstructured free text content of the source document into a sequence of terms, which are then stored in an inverted index. Subsequently, when a similar text analysis is applied to a user's query, the resulting sequence of terms facilitates the matching of relevant source documents.
 
-1. _Tokenize_ text into terms: For example, after tokenization, the phrase `Actions speak louder than words` is split into tokens `Actions`, `speak`, `louder`, `than`, and `words`.
-1. _Normalize_ the terms by converting them into a standard format, for example, converting them to lowercase or performing stemming (reducing the word to its root): For example, after normalization, `Actions` becomes `action`, `louder` becomes `loud`, and `words` becomes `word`.
+From a technical point of view, the text analysis process consists of several steps, some of which are optional:
+
+1. Before the free text content can be split into individual words, it may be beneficial to refine the text at the character level. The primary aim of this optional step is to help the tokenizer (the subsequent stage in the analysis process) generate better tokens. This can include removal of markup tags (such as HTML) or handling specific character patterns (like replacing the &#x1F642; emoji with the text `:slightly_smiling_face:`).
+
+2. The next step is to split the free text into individual words---_tokens_. This is performed by a _tokenizer_. For example, after tokenization, the sentence `Actions speak louder than words` is split into tokens `Actions`, `speak`, `louder`, `than`, and `words`.
+
+3. The last step is to process individual tokens by applying a series of token filters. The aim is to convert each token into a predictable form that is directly stored in the index, for example, by converting them to lowercase or performing stemming (reducing the word to its root). For example, the token `Actions` becomes `action`, `louder` becomes `loud`, and `words` becomes `word`.
+
+Although the terms ***token*** and ***term*** may sound similar and are occasionally used interchangeably, it is helpful to understand the difference between the two. In the context of Apache Lucene, each holds a distinct role. A ***token*** is created by a tokenizer during text analysis and often undergoes a number of additional modifications as it passes through the chain of token filters. Each token is associated with metadata that can be further used during the text analysis process. A ***term*** is a data value that is directly stored in the inverted index and is associated with much less metadata. During search, matching operates at the term level.
+{: .note}
 
 ## Analyzers
 
-In OpenSearch, text analysis is performed by an _analyzer_. Each analyzer contains the following sequentially applied components:
+In OpenSearch, the abstraction that encompasses text analysis is referred to as an _analyzer_. Each analyzer contains the following sequentially applied components:
 
 1. **Character filters**: First, a character filter receives the original text as a stream of characters and adds, removes, or modifies characters in the text. For example, a character filter can strip HTML characters from a string so that the text `<p><b>Actions</b> speak louder than <em>words</em></p>` becomes `\nActions speak louder than words\n`. The output of a character filter is a stream of characters.
 
@@ -35,6 +43,8 @@ In OpenSearch, text analysis is performed by an _analyzer_. Each analyzer contai
 An analyzer must contain exactly one tokenizer and may contain zero or more character filters and zero or more token filters.
 {: .note}
 
+There is also a special type of analyzer called a ***normalizer***. A normalizer is similar to an analyzer except that it does not contain a tokenizer and can only include specific types of character filters and token filters. These filters can perform only character-level operations, such as character or pattern replacement, and cannot perform operations on the token as a whole. This means that replacing a token with a synonym or stemming is not supported. See [Normalizers]({{site.url}}{{site.baseurl}}/analyzers/normalizers/) for further details.
+
 ## Built-in analyzers
 
 The following table lists the built-in analyzers that OpenSearch provides. The last column of the table contains the result of applying the analyzer to the string `It’s fun to contribute a brand-new PR or 2 to OpenSearch!`.
diff --git a/_analyzers/normalizers.md b/_analyzers/normalizers.md
@@ -0,0 +1,111 @@
+---
+layout: default
+title: Normalizers
+nav_order: 100
+---
+
+# Normalizers
+
+A _normalizer_ functions similarly to an analyzer but outputs only a single token. It does not contain a tokenizer and can only include specific types of character and token filters. These filters can perform only character-level operations, such as character or pattern replacement, and cannot operate on the token as a whole. This means that replacing a token with a synonym or stemming is not supported.
+
+A normalizer is useful in keyword search (that is, in term-based queries) because it allows you to run token and character filters on any given input. For instance, it makes it possible to match an incoming query `Naïve` with the index term `naive`.
+
+Consider the following example.
+
+Create a new index with a custom normalizer:
+```json
+PUT /sample-index
+{
+  "settings": {
+    "analysis": {
+      "normalizer": {
+        "normalized_keyword": {
+          "type": "custom",
+          "char_filter": [],
+          "filter": [ "asciifolding", "lowercase" ]
+        }
+      }
+    }
+  },
+  "mappings": {
+    "properties": {
+      "approach": {
+        "type": "keyword",
+        "normalizer": "normalized_keyword"
+      }
+    }
+  }
+}
+```
+{% include copy-curl.html %}
+
+Index a document:
+```json
+POST /sample-index/_doc/
+{
+  "approach": "naive"
+}
+```
+{% include copy-curl.html %}
+
+The following query matches the document. This is expected:
+```json
+GET /sample-index/_search
+{
+  "query": {
+    "term": {
+      "approach": "naive"
+    }
+  }
+}
+```
+{% include copy-curl.html %}
+
+But this query matches the document as well:
+```json
+GET /sample-index/_search
+{
+  "query": {
+    "term": {
+      "approach": "Naïve"
+    }
+  }
+}
+```
+{% include copy-curl.html %}
+
+To understand why, consider the effect of the normalizer:
+```json
+GET /sample-index/_analyze
+{
+  "normalizer" : "normalized_keyword",
+  "text" : "Naïve"
+}
+```
+
+Internally, a normalizer accepts only filters that are instances of either `NormalizingTokenFilterFactory` or `NormalizingCharFilterFactory`. The following is a list of compatible filters found in modules and plugins that are part of the core OpenSearch repository.
+
+### The `common-analysis` module
+
+This module does not require installation; it is available by default.
+
+Character filters: `pattern_replace`, `mapping`
+
+Token filters: `arabic_normalization`, `asciifolding`, `bengali_normalization`, `cjk_width`, `decimal_digit`, `elision`, `german_normalization`, `hindi_normalization`, `indic_normalization`, `lowercase`, `persian_normalization`, `scandinavian_folding`, `scandinavian_normalization`, `serbian_normalization`, `sorani_normalization`, `trim`, `uppercase`
+
+### The `analysis-icu` plugin
+
+Character filters: `icu_normalizer`
+
+Token filters: `icu_normalizer`, `icu_folding`, `icu_transform`
+
+### The `analysis-kuromoji` plugin
+
+Character filters: `normalize_kanji`, `normalize_kana`
+
+### The `analysis-nori` plugin
+
+Character filters: `normalize_kanji`, `normalize_kana`
+
+These lists of filters include only analysis components found in the [additional plugins]({{site.url}}{{site.baseurl}}/install-and-configure/plugins/#additional-plugins) that are part of the core OpenSearch repository.
+{: .note}
diff --git a/_field-types/supported-field-types/keyword.md b/_field-types/supported-field-types/keyword.md
@@ -52,7 +52,7 @@ Parameter | Description
 `index` | A Boolean value that specifies whether the field should be searchable. Default is `true`. To reduce disk space, set `index` to `false`.
 `index_options` | Information to be stored in the index that will be considered when calculating relevance scores. Can be set to `freqs` for term frequency. Default is `docs`.
 `meta` | Accepts metadata for this field.
-`normalizer` | Specifies how to preprocess this field before indexing (for example, make it lowercase). Default is `null` (no preprocessing).
+[`normalizer`]({{site.url}}{{site.baseurl}}/analyzers/normalizers/) | Specifies how to preprocess this field before indexing (for example, make it lowercase). Default is `null` (no preprocessing).
 `norms` | A Boolean value that specifies whether the field length should be used when calculating relevance scores. Default is `false`.
 [`null_value`]({{site.url}}{{site.baseurl}}/opensearch/supported-field-types/index#null-value) | A value to be used in place of `null`. Must be of the same type as the field. If this parameter is not specified, the field is treated as missing when its value is `null`. Default is `null`.
 `similarity` | The ranking algorithm for calculating relevance scores. Default is `BM25`. 
diff --git a/_install-and-configure/plugins.md b/_install-and-configure/plugins.md
@@ -31,12 +31,12 @@ If you are running OpenSearch in a Docker container, plugins must be installed,
 
 Use `list` to see a list of plugins that have already been installed.
 
-#### Usage:
+#### Usage
 ```bash
 bin/opensearch-plugin list
 ```
 
-#### Example:
+#### Example
 ```bash
 $ ./opensearch-plugin list
 opensearch-alerting
@@ -84,20 +84,20 @@ opensearch-node1 opensearch-notifications-core        2.0.1.0
 
 There are three ways to install plugins using the `opensearch-plugin`:
 
-- [Install a plugin by name]({{site.url}}{{site.baseurl}}/opensearch/install/plugins#install-a-plugin-by-name)
-- [Install a plugin by from a zip file]({{site.url}}{{site.baseurl}}/opensearch/install/plugins#install-a-plugin-from-a-zip-file)
-- [Install a plugin using Maven coordinates]({{site.url}}{{site.baseurl}}/opensearch/install/plugins#install-a-plugin-using-maven-coordinates)
+- [Install a plugin by name](#install-a-plugin-by-name).
+- [Install a plugin from a ZIP file](#install-a-plugin-from-a-zip-file).
+- [Install a plugin using Maven coordinates](#install-a-plugin-using-maven-coordinates).
 
 ### Install a plugin by name:
 
-For a list of plugins that can be installed by name, see [Additional plugins]({{site.url}}{{site.baseurl}}/opensearch/install/plugins#additional-plugins).
+For a list of plugins that can be installed by name, see [Additional plugins](#additional-plugins).
 
-#### Usage:
+#### Usage
 ```bash
 bin/opensearch-plugin install <plugin-name>
 ```
 
-#### Example:
+#### Example
 ```bash
 $ sudo ./opensearch-plugin install analysis-icu
 -> Installing analysis-icu
@@ -106,16 +106,16 @@ $ sudo ./opensearch-plugin install analysis-icu
 -> Installed analysis-icu with folder name analysis-icu
 ```
 
-### Install a plugin from a zip file:
+### Install a plugin from a zip file
 
 Remote zip files can be installed by replacing `<zip-file>` with the URL of the hosted file. The tool only supports downloading over HTTP/HTTPS protocols. For local zip files, replace `<zip-file>` with `file:` followed by the absolute or relative path to the plugin zip file as in the second example below.
 
-#### Usage:
+#### Usage
 ```bash
 bin/opensearch-plugin install <zip-file>
 ```
 
-#### Example:
+#### Example
 ```bash
 # Zip file is hosted on a remote server - in this case, Maven central repository.
 $ sudo ./opensearch-plugin install https://repo1.maven.org/maven2/org/opensearch/plugin/opensearch-anomaly-detection/2.2.0.0/opensearch-anomaly-detection-2.2.0.0.zip
@@ -166,16 +166,16 @@ Continue with installation? [y/N]y
 -> Installed opensearch-anomaly-detection with folder name opensearch-anomaly-detection
 ```
 
-### Install a plugin using Maven coordinates:
+### Install a plugin using Maven coordinates
 
 The `opensearch-plugin install` tool also accepts Maven coordinates for available artifacts and versions hosted on [Maven Central](https://search.maven.org/search?q=org.opensearch.plugin). `opensearch-plugin` will parse the Maven coordinates you provide and construct a URL. As a result, the host must be able to connect directly to [Maven Central](https://search.maven.org/search?q=org.opensearch.plugin). The plugin installation will fail if you pass coordinates to a proxy or local repository.
 
-#### Usage:
+#### Usage
 ```bash
 bin/opensearch-plugin install <groupId>:<artifactId>:<version>
 ```
 
-#### Example:
+#### Example
 ```bash
 $ sudo ./opensearch-plugin install org.opensearch.plugin:opensearch-anomaly-detection:2.2.0.0
 -> Installing org.opensearch.plugin:opensearch-anomaly-detection:2.2.0.0
@@ -222,12 +222,12 @@ $ sudo $ ./opensearch-plugin install analysis-nori repository-s3
 
 You can remove a plugin that has already been installed with the `remove` option. 
 
-#### Usage:
+#### Usage
 ```bash
 bin/opensearch-plugin remove <plugin-name>
 ```
 
-#### Example:
+#### Example
 ```bash
 $ sudo $ ./opensearch-plugin remove opensearch-anomaly-detection
 -> removing [opensearch-anomaly-detection]...
diff --git a/_query-dsl/term-vs-full-text.md b/_query-dsl/term-vs-full-text.md
@@ -8,7 +8,7 @@ redirect_from:
 
 # Term-level and full-text queries compared
 
-You can use both term-level and full-text queries to search text, but while term-level queries are usually used to search structured data, full-text queries are used for full-text search. The main difference between term-level and full-text queries is that term-level queries search documents for an exact specified term, while full-text queries analyze the query string. The following table summarizes the differences between term-level and full-text queries.
+You can use both term-level and full-text queries to search text, but while term-level queries are usually used to search structured data, full-text queries are used for full-text search. The main difference between term-level and full-text queries is that term-level queries search documents for an exact specified term, while full-text queries [analyze]({{{site.url}}{{site.baseurl}}/analyzers/) the query string. The following table summarizes the differences between term-level and full-text queries.
 
 | | Term-level queries | Full-text queries
 :--- | :--- | :---