opensearch-project
diff --git a/‎_analyzers/index.md
Lines changed: 326 additions & 5 deletions b/‎_analyzers/index.md
Lines changed: 326 additions & 5 deletions
@@ -15,16 +15,24 @@ redirect_from:
 
 # Text analysis
 
-When you are searching documents using a full-text search, you want to receive all relevant results and not only exact matches. If you're looking for "walk", you're interested in results that contain any form of the word, like "Walk", "walked", or "walking." To facilitate full-text search, OpenSearch uses text analysis.
+When you are searching documents using a full-text search, you want to receive all relevant results. If you're looking for "walk", you're interested in results that contain any form of the word, like "Walk", "walked", or "walking". To facilitate full-text search, OpenSearch uses text analysis.
 
-Text analysis consists of the following steps:
+The objective of the text analysis is to break down the unstructured free-text content of the source document into a sequence of terms, which are then stored in an inverted index. Subsequently when a similar text analysis is applied to a user's query, the resulting sequence of terms facilitates the matching of relevant source documents.
 
-1. _Tokenize_ text into terms: For example, after tokenization, the phrase `Actions speak louder than words` is split into tokens `Actions`, `speak`, `louder`, `than`, and `words`.
-1. _Normalize_ the terms by converting them into a standard format, for example, converting them to lowercase or performing stemming (reducing the word to its root): For example, after normalization, `Actions` becomes `action`, `louder` becomes `loud`, and `words` becomes `word`.
+From a technical point of view, the text analysis process consists of several steps, some of which are optional.
+
+1. Before the free-text content can be broken down into individual words it may be beneficial to treat it at the character level. The primary aim of this optional step would be to help the tokenizer (the subsequent stage in the analysis process) generating better tokens. This can include removal of markup tags (such as HTML) or handling a specific character patterns (like replacing emoji &#x1F642; with a text `:slightly_smiling_face:`).
+
+2. The next step is to split the free-text into individual words, i.e. tokens. This is a job for the Tokenizer. For example, after tokenization, the sentence `Actions speak louder than words` is split into tokens `Actions`, `speak`, `louder`, `than`, and `words`.
+
+3. The last step is to process individual tokens by series of token filters. The aim is to convert tokens into a predictable form that is directly stored in the index, for example, converting them to lowercase or performing stemming (reducing the word to its root): For example, the token `Actions` becomes `action`, `louder` becomes `loud`, and `words` becomes `word`.
+
+Although the terms "**token**" and "**term**" may sound similar and are occasionally used interchangeably, it is helpful to understand the difference between the two. In the context of Apache Lucene®, each holds a distinct role. A **token** is created by the Tokenizer during the text analysis process and often undergoes a number of additional modifications as it passes through the chain of token filters. Each token is associated with metadata that can be further used during the text analysis process. A **term** on the other hand is a data value that is directly stored in the inverted index and is associated with a lot less metadata. During the search, the matching process operates at the term level.
+{: .note}
 
 ## Analyzers
 
-In OpenSearch, text analysis is performed by an _analyzer_. Each analyzer contains the following sequentially applied components:
+In OpenSearch, the abstraction that encompasses text analysis is referred to as an _analyzer_. Each analyzer contains the following sequentially applied components:
 
 1. **Character filters**: First, a character filter receives the original text as a stream of characters and adds, removes, or modifies characters in the text. For example, a character filter can strip HTML characters from a string so that the text `<p><b>Actions</b> speak louder than <em>words</em></p>` becomes `\nActions speak louder than words\n`. The output of a character filter is a stream of characters.
 
@@ -35,6 +43,8 @@ In OpenSearch, text analysis is performed by an _analyzer_. Each analyzer contai
 An analyzer must contain exactly one tokenizer and may contain zero or more character filters and zero or more token filters.
 {: .note}
 
+There is also a special type of analyzer called **Normalizer**. A Normalizer is similar to Analyzer except it does not contain tokenizer, and it can only include specific type of Character filters and Token filters. These filters can do only character level operations, such as character or pattern replacement, but can not operate on the whole token level, this means that replacing the token with synonym or stemming is not supported. See [Normalizers]({{site.url}}{{site.baseurl}}/analyzers/normalizers/) for further details.
+
 ## Built-in analyzers
 
 The following table lists the built-in analyzers that OpenSearch provides. The last column of the table contains the result of applying the analyzer to the string `It’s fun to contribute a brand-new PR or 2 to OpenSearch!`.
@@ -54,6 +64,317 @@ Analyzer | Analysis performed | Analyzer output
 
 If needed, you can combine tokenizers, token filters, and character filters to create a custom analyzer.
 
+With the introduction of OpenSearch `v2.11.??`, you can retrieve a comprehensive list of all available text analysis components using [Nodes Info]({{site.url}}{{site.baseurl}}/api-reference/nodes-apis/nodes-info/). This can be helpful when building custom analyzers, especially in cases where you need to recall the component's name or identify associated analysis plugin this component is part of. 
+
+Introduced 2.11.??
+{: .label .label-purple }
+
+```json
+GET /_nodes/analysis_components?pretty=true&filter_path=nodes.*.analysis_components
+```
+{% include copy-curl.html %}
+
+<details open markdown="block">
+  <summary>
+    Response
+  </summary>
+  {: .text-delta}
+
+The following is an example of response from a node that has `common-analysis` module (a module that is present by default).
+
+```json
+{
+  "nodes" : {
+    "cZidmv5kQbWQN8M8dz9f5g" : {
+      "analysis_components" : {
+        "analyzers" : [
+          "arabic",
+          "armenian",
+          "basque",
+          "bengali",
+          "brazilian",
+          "bulgarian",
+          "catalan",
+          "chinese",
+          "cjk",
+          "czech",
+          "danish",
+          "default",
+          "dutch",
+          "english",
+          "estonian",
+          "fingerprint",
+          "finnish",
+          "french",
+          "galician",
+          "german",
+          "greek",
+          "hindi",
+          "hungarian",
+          "indonesian",
+          "irish",
+          "italian",
+          "keyword",
+          "latvian",
+          "lithuanian",
+          "norwegian",
+          "pattern",
+          "persian",
+          "portuguese",
+          "romanian",
+          "russian",
+          "simple",
+          "snowball",
+          "sorani",
+          "spanish",
+          "standard",
+          "stop",
+          "swedish",
+          "thai",
+          "turkish",
+          "whitespace"
+        ],
+        "tokenizers" : [
+          "PathHierarchy",
+          "char_group",
+          "classic",
+          "edgeNGram",
+          "edge_ngram",
+          "keyword",
+          "letter",
+          "lowercase",
+          "nGram",
+          "ngram",
+          "path_hierarchy",
+          "pattern",
+          "simple_pattern",
+          "simple_pattern_split",
+          "standard",
+          "thai",
+          "uax_url_email",
+          "whitespace"
+        ],
+        "tokenFilters" : [
+          "apostrophe",
+          "arabic_normalization",
+          "arabic_stem",
+          "asciifolding",
+          "bengali_normalization",
+          "brazilian_stem",
+          "cjk_bigram",
+          "cjk_width",
+          "classic",
+          "common_grams",
+          "concatenate_graph",
+          "condition",
+          "czech_stem",
+          "decimal_digit",
+          "delimited_payload",
+          "delimited_term_freq",
+          "dictionary_decompounder",
+          "dutch_stem",
+          "edgeNGram",
+          "edge_ngram",
+          "elision",
+          "fingerprint",
+          "flatten_graph",
+          "french_stem",
+          "german_normalization",
+          "german_stem",
+          "hindi_normalization",
+          "hunspell",
+          "hyphenation_decompounder",
+          "indic_normalization",
+          "keep",
+          "keep_types",
+          "keyword_marker",
+          "kstem",
+          "length",
+          "limit",
+          "lowercase",
+          "min_hash",
+          "multiplexer",
+          "nGram",
+          "ngram",
+          "pattern_capture",
+          "pattern_replace",
+          "persian_normalization",
+          "porter_stem",
+          "predicate_token_filter",
+          "remove_duplicates",
+          "reverse",
+          "russian_stem",
+          "scandinavian_folding",
+          "scandinavian_normalization",
+          "serbian_normalization",
+          "shingle",
+          "snowball",
+          "sorani_normalization",
+          "standard",
+          "stemmer",
+          "stemmer_override",
+          "stop",
+          "synonym",
+          "synonym_graph",
+          "trim",
+          "truncate",
+          "unique",
+          "uppercase",
+          "word_delimiter",
+          "word_delimiter_graph"
+        ],
+        "charFilters" : [
+          "html_strip",
+          "mapping",
+          "pattern_replace"
+        ],
+        "normalizers" : [
+          "lowercase"
+        ],
+        "plugins" : [
+          {
+            "name" : "analysis-common",
+            "classname" : "org.opensearch.analysis.common.CommonAnalysisModulePlugin",
+            "analyzers" : [
+              "arabic",
+              "armenian",
+              "basque",
+              "bengali",
+              "brazilian",
+              "bulgarian",
+              "catalan",
+              "chinese",
+              "cjk",
+              "czech",
+              "danish",
+              "dutch",
+              "english",
+              "estonian",
+              "fingerprint",
+              "finnish",
+              "french",
+              "galician",
+              "german",
+              "greek",
+              "hindi",
+              "hungarian",
+              "indonesian",
+              "irish",
+              "italian",
+              "latvian",
+              "lithuanian",
+              "norwegian",
+              "pattern",
+              "persian",
+              "portuguese",
+              "romanian",
+              "russian",
+              "snowball",
+              "sorani",
+              "spanish",
+              "swedish",
+              "thai",
+              "turkish"
+            ],
+            "tokenizers" : [
+              "PathHierarchy",
+              "char_group",
+              "classic",
+              "edgeNGram",
+              "edge_ngram",
+              "keyword",
+              "letter",
+              "lowercase",
+              "nGram",
+              "ngram",
+              "path_hierarchy",
+              "pattern",
+              "simple_pattern",
+              "simple_pattern_split",
+              "thai",
+              "uax_url_email",
+              "whitespace"
+            ],
+            "tokenFilters" : [
+              "apostrophe",
+              "arabic_normalization",
+              "arabic_stem",
+              "asciifolding",
+              "bengali_normalization",
+              "brazilian_stem",
+              "cjk_bigram",
+              "cjk_width",
+              "classic",
+              "common_grams",
+              "concatenate_graph",
+              "condition",
+              "czech_stem",
+              "decimal_digit",
+              "delimited_payload",
+              "delimited_term_freq",
+              "dictionary_decompounder",
+              "dutch_stem",
+              "edgeNGram",
+              "edge_ngram",
+              "elision",
+              "fingerprint",
+              "flatten_graph",
+              "french_stem",
+              "german_normalization",
+              "german_stem",
+              "hindi_normalization",
+              "hyphenation_decompounder",
+              "indic_normalization",
+              "keep",
+              "keep_types",
+              "keyword_marker",
+              "kstem",
+              "length",
+              "limit",
+              "lowercase",
+              "min_hash",
+              "multiplexer",
+              "nGram",
+              "ngram",
+              "pattern_capture",
+              "pattern_replace",
+              "persian_normalization",
+              "porter_stem",
+              "predicate_token_filter",
+              "remove_duplicates",
+              "reverse",
+              "russian_stem",
+              "scandinavian_folding",
+              "scandinavian_normalization",
+              "serbian_normalization",
+              "snowball",
+              "sorani_normalization",
+              "stemmer",
+              "stemmer_override",
+              "synonym",
+              "synonym_graph",
+              "trim",
+              "truncate",
+              "unique",
+              "uppercase",
+              "word_delimiter",
+              "word_delimiter_graph"
+            ],
+            "charFilters" : [
+              "html_strip",
+              "mapping",
+              "pattern_replace"
+            ],
+            "hunspellDictionaries" : [ ]
+          }
+        ]
+      }
+    }
+  }
+}
+```
+</details>
+
 ## Text analysis at indexing time and query time
 
 OpenSearch performs text analysis on text fields when you index a document and when you send a search request. Depending on the time of text analysis, the analyzers used for it are classified as follows: