You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This commit introduces documentation for new extension of Nodes API
that exposes names of analysis components available on cluster node(s).
This commit also contains additional changes:
- It makes strict distinction between terms: "token" and "term".
- It replaces the term "Normalize" in analysis part because it has
special meaning in this context
- It introduces a dedicated pages for Normalization (which is a
specific type of analysis)
This commit is part of PR OpenSearch/#10296
Signed-off-by: Lukáš Vlček <[email protected]>
Copy file name to clipboardExpand all lines: _analyzers/index.md
+326-5Lines changed: 326 additions & 5 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -15,16 +15,24 @@ redirect_from:
15
15
16
16
# Text analysis
17
17
18
-
When you are searching documents using a full-text search, you want to receive all relevant results and not only exact matches. If you're looking for "walk", you're interested in results that contain any form of the word, like "Walk", "walked", or "walking." To facilitate full-text search, OpenSearch uses text analysis.
18
+
When you are searching documents using a full-text search, you want to receive all relevant results. If you're looking for "walk", you're interested in results that contain any form of the word, like "Walk", "walked", or "walking". To facilitate full-text search, OpenSearch uses text analysis.
19
19
20
-
Text analysis consists of the following steps:
20
+
The objective of the text analysis is to break down the unstructured free-text content of the source document into a sequence of terms, which are then stored in an inverted index. Subsequently when a similar text analysis is applied to a user's query, the resulting sequence of terms facilitates the matching of relevant source documents.
21
21
22
-
1._Tokenize_ text into terms: For example, after tokenization, the phrase `Actions speak louder than words` is split into tokens `Actions`, `speak`, `louder`, `than`, and `words`.
23
-
1._Normalize_ the terms by converting them into a standard format, for example, converting them to lowercase or performing stemming (reducing the word to its root): For example, after normalization, `Actions` becomes `action`, `louder` becomes `loud`, and `words` becomes `word`.
22
+
From a technical point of view, the text analysis process consists of several steps, some of which are optional.
23
+
24
+
1. Before the free-text content can be broken down into individual words it may be beneficial to treat it at the character level. The primary aim of this optional step would be to help the tokenizer (the subsequent stage in the analysis process) generating better tokens. This can include removal of markup tags (such as HTML) or handling a specific character patterns (like replacing emoji 🙂 with a text `:slightly_smiling_face:`).
25
+
26
+
2. The next step is to split the free-text into individual words, i.e. tokens. This is a job for the Tokenizer. For example, after tokenization, the sentence `Actions speak louder than words` is split into tokens `Actions`, `speak`, `louder`, `than`, and `words`.
27
+
28
+
3. The last step is to process individual tokens by series of token filters. The aim is to convert tokens into a predictable form that is directly stored in the index, for example, converting them to lowercase or performing stemming (reducing the word to its root): For example, the token `Actions` becomes `action`, `louder` becomes `loud`, and `words` becomes `word`.
29
+
30
+
Although the terms "**token**" and "**term**" may sound similar and are occasionally used interchangeably, it is helpful to understand the difference between the two. In the context of Apache Lucene®, each holds a distinct role. A **token** is created by the Tokenizer during the text analysis process and often undergoes a number of additional modifications as it passes through the chain of token filters. Each token is associated with metadata that can be further used during the text analysis process. A **term** on the other hand is a data value that is directly stored in the inverted index and is associated with a lot less metadata. During the search, the matching process operates at the term level.
31
+
{: .note}
24
32
25
33
## Analyzers
26
34
27
-
In OpenSearch, text analysis is performed by an _analyzer_. Each analyzer contains the following sequentially applied components:
35
+
In OpenSearch, the abstraction that encompasses text analysis is referred to as an _analyzer_. Each analyzer contains the following sequentially applied components:
28
36
29
37
1.**Character filters**: First, a character filter receives the original text as a stream of characters and adds, removes, or modifies characters in the text. For example, a character filter can strip HTML characters from a string so that the text `<p><b>Actions</b> speak louder than <em>words</em></p>` becomes `\nActions speak louder than words\n`. The output of a character filter is a stream of characters.
30
38
@@ -35,6 +43,8 @@ In OpenSearch, text analysis is performed by an _analyzer_. Each analyzer contai
35
43
An analyzer must contain exactly one tokenizer and may contain zero or more character filters and zero or more token filters.
36
44
{: .note}
37
45
46
+
There is also a special type of analyzer called **Normalizer**. A Normalizer is similar to Analyzer except it does not contain tokenizer, and it can only include specific type of Character filters and Token filters. These filters can do only character level operations, such as character or pattern replacement, but can not operate on the whole token level, this means that replacing the token with synonym or stemming is not supported. See [Normalizers]({{site.url}}{{site.baseurl}}/analyzers/normalizers/) for further details.
47
+
38
48
## Built-in analyzers
39
49
40
50
The following table lists the built-in analyzers that OpenSearch provides. The last column of the table contains the result of applying the analyzer to the string `It’s fun to contribute a brand-new PR or 2 to OpenSearch!`.
If needed, you can combine tokenizers, token filters, and character filters to create a custom analyzer.
56
66
67
+
With the introduction of OpenSearch `v2.11.??`, you can retrieve a comprehensive list of all available text analysis components using [Nodes Info]({{site.url}}{{site.baseurl}}/api-reference/nodes-apis/nodes-info/). This can be helpful when building custom analyzers, especially in cases where you need to recall the component's name or identify associated analysis plugin this component is part of.
68
+
69
+
Introduced 2.11.??
70
+
{: .label .label-purple }
71
+
72
+
```json
73
+
GET /_nodes/analysis_components?pretty=true&filter_path=nodes.*.analysis_components
74
+
```
75
+
{% include copy-curl.html %}
76
+
77
+
<detailsopenmarkdown="block">
78
+
<summary>
79
+
Response
80
+
</summary>
81
+
{: .text-delta}
82
+
83
+
The following is an example of response from a node that has `common-analysis` module (a module that is present by default).
OpenSearch performs text analysis on text fields when you index a document and when you send a search request. Depending on the time of text analysis, the analyzers used for it are classified as follows:
0 commit comments