Skip to content

Commit aeed788

Browse files
committed
Add documentation for node analyzer components
This commit introduces documentation for new extension of Nodes API that exposes names of analysis components available on cluster node(s). This commit also contains additional changes: - It makes strict distinction between terms: "token" and "term". - It replaces the term "Normalize" in analysis part because it has special meaning in this context - It introduces a dedicated pages for Normalization (which is a specific type of analysis) This commit is part of PR OpenSearch/#10296 Signed-off-by: Lukáš Vlček <[email protected]>
1 parent d41ccb8 commit aeed788

File tree

5 files changed

+429
-7
lines changed

5 files changed

+429
-7
lines changed

_analyzers/index.md

Lines changed: 326 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -15,16 +15,24 @@ redirect_from:
1515

1616
# Text analysis
1717

18-
When you are searching documents using a full-text search, you want to receive all relevant results and not only exact matches. If you're looking for "walk", you're interested in results that contain any form of the word, like "Walk", "walked", or "walking." To facilitate full-text search, OpenSearch uses text analysis.
18+
When you are searching documents using a full-text search, you want to receive all relevant results. If you're looking for "walk", you're interested in results that contain any form of the word, like "Walk", "walked", or "walking". To facilitate full-text search, OpenSearch uses text analysis.
1919

20-
Text analysis consists of the following steps:
20+
The objective of the text analysis is to break down the unstructured free-text content of the source document into a sequence of terms, which are then stored in an inverted index. Subsequently when a similar text analysis is applied to a user's query, the resulting sequence of terms facilitates the matching of relevant source documents.
2121

22-
1. _Tokenize_ text into terms: For example, after tokenization, the phrase `Actions speak louder than words` is split into tokens `Actions`, `speak`, `louder`, `than`, and `words`.
23-
1. _Normalize_ the terms by converting them into a standard format, for example, converting them to lowercase or performing stemming (reducing the word to its root): For example, after normalization, `Actions` becomes `action`, `louder` becomes `loud`, and `words` becomes `word`.
22+
From a technical point of view, the text analysis process consists of several steps, some of which are optional.
23+
24+
1. Before the free-text content can be broken down into individual words it may be beneficial to treat it at the character level. The primary aim of this optional step would be to help the tokenizer (the subsequent stage in the analysis process) generating better tokens. This can include removal of markup tags (such as HTML) or handling a specific character patterns (like replacing emoji &#x1F642; with a text `:slightly_smiling_face:`).
25+
26+
2. The next step is to split the free-text into individual words, i.e. tokens. This is a job for the Tokenizer. For example, after tokenization, the sentence `Actions speak louder than words` is split into tokens `Actions`, `speak`, `louder`, `than`, and `words`.
27+
28+
3. The last step is to process individual tokens by series of token filters. The aim is to convert tokens into a predictable form that is directly stored in the index, for example, converting them to lowercase or performing stemming (reducing the word to its root): For example, the token `Actions` becomes `action`, `louder` becomes `loud`, and `words` becomes `word`.
29+
30+
Although the terms "**token**" and "**term**" may sound similar and are occasionally used interchangeably, it is helpful to understand the difference between the two. In the context of Apache Lucene®, each holds a distinct role. A **token** is created by the Tokenizer during the text analysis process and often undergoes a number of additional modifications as it passes through the chain of token filters. Each token is associated with metadata that can be further used during the text analysis process. A **term** on the other hand is a data value that is directly stored in the inverted index and is associated with a lot less metadata. During the search, the matching process operates at the term level.
31+
{: .note}
2432

2533
## Analyzers
2634

27-
In OpenSearch, text analysis is performed by an _analyzer_. Each analyzer contains the following sequentially applied components:
35+
In OpenSearch, the abstraction that encompasses text analysis is referred to as an _analyzer_. Each analyzer contains the following sequentially applied components:
2836

2937
1. **Character filters**: First, a character filter receives the original text as a stream of characters and adds, removes, or modifies characters in the text. For example, a character filter can strip HTML characters from a string so that the text `<p><b>Actions</b> speak louder than <em>words</em></p>` becomes `\nActions speak louder than words\n`. The output of a character filter is a stream of characters.
3038

@@ -35,6 +43,8 @@ In OpenSearch, text analysis is performed by an _analyzer_. Each analyzer contai
3543
An analyzer must contain exactly one tokenizer and may contain zero or more character filters and zero or more token filters.
3644
{: .note}
3745

46+
There is also a special type of analyzer called **Normalizer**. A Normalizer is similar to Analyzer except it does not contain tokenizer, and it can only include specific type of Character filters and Token filters. These filters can do only character level operations, such as character or pattern replacement, but can not operate on the whole token level, this means that replacing the token with synonym or stemming is not supported. See [Normalizers]({{site.url}}{{site.baseurl}}/analyzers/normalizers/) for further details.
47+
3848
## Built-in analyzers
3949

4050
The following table lists the built-in analyzers that OpenSearch provides. The last column of the table contains the result of applying the analyzer to the string `It’s fun to contribute a brand-new PR or 2 to OpenSearch!`.
@@ -54,6 +64,317 @@ Analyzer | Analysis performed | Analyzer output
5464

5565
If needed, you can combine tokenizers, token filters, and character filters to create a custom analyzer.
5666

67+
With the introduction of OpenSearch `v2.11.??`, you can retrieve a comprehensive list of all available text analysis components using [Nodes Info]({{site.url}}{{site.baseurl}}/api-reference/nodes-apis/nodes-info/). This can be helpful when building custom analyzers, especially in cases where you need to recall the component's name or identify associated analysis plugin this component is part of.
68+
69+
Introduced 2.11.??
70+
{: .label .label-purple }
71+
72+
```json
73+
GET /_nodes/analysis_components?pretty=true&filter_path=nodes.*.analysis_components
74+
```
75+
{% include copy-curl.html %}
76+
77+
<details open markdown="block">
78+
<summary>
79+
Response
80+
</summary>
81+
{: .text-delta}
82+
83+
The following is an example of response from a node that has `common-analysis` module (a module that is present by default).
84+
85+
```json
86+
{
87+
"nodes" : {
88+
"cZidmv5kQbWQN8M8dz9f5g" : {
89+
"analysis_components" : {
90+
"analyzers" : [
91+
"arabic",
92+
"armenian",
93+
"basque",
94+
"bengali",
95+
"brazilian",
96+
"bulgarian",
97+
"catalan",
98+
"chinese",
99+
"cjk",
100+
"czech",
101+
"danish",
102+
"default",
103+
"dutch",
104+
"english",
105+
"estonian",
106+
"fingerprint",
107+
"finnish",
108+
"french",
109+
"galician",
110+
"german",
111+
"greek",
112+
"hindi",
113+
"hungarian",
114+
"indonesian",
115+
"irish",
116+
"italian",
117+
"keyword",
118+
"latvian",
119+
"lithuanian",
120+
"norwegian",
121+
"pattern",
122+
"persian",
123+
"portuguese",
124+
"romanian",
125+
"russian",
126+
"simple",
127+
"snowball",
128+
"sorani",
129+
"spanish",
130+
"standard",
131+
"stop",
132+
"swedish",
133+
"thai",
134+
"turkish",
135+
"whitespace"
136+
],
137+
"tokenizers" : [
138+
"PathHierarchy",
139+
"char_group",
140+
"classic",
141+
"edgeNGram",
142+
"edge_ngram",
143+
"keyword",
144+
"letter",
145+
"lowercase",
146+
"nGram",
147+
"ngram",
148+
"path_hierarchy",
149+
"pattern",
150+
"simple_pattern",
151+
"simple_pattern_split",
152+
"standard",
153+
"thai",
154+
"uax_url_email",
155+
"whitespace"
156+
],
157+
"tokenFilters" : [
158+
"apostrophe",
159+
"arabic_normalization",
160+
"arabic_stem",
161+
"asciifolding",
162+
"bengali_normalization",
163+
"brazilian_stem",
164+
"cjk_bigram",
165+
"cjk_width",
166+
"classic",
167+
"common_grams",
168+
"concatenate_graph",
169+
"condition",
170+
"czech_stem",
171+
"decimal_digit",
172+
"delimited_payload",
173+
"delimited_term_freq",
174+
"dictionary_decompounder",
175+
"dutch_stem",
176+
"edgeNGram",
177+
"edge_ngram",
178+
"elision",
179+
"fingerprint",
180+
"flatten_graph",
181+
"french_stem",
182+
"german_normalization",
183+
"german_stem",
184+
"hindi_normalization",
185+
"hunspell",
186+
"hyphenation_decompounder",
187+
"indic_normalization",
188+
"keep",
189+
"keep_types",
190+
"keyword_marker",
191+
"kstem",
192+
"length",
193+
"limit",
194+
"lowercase",
195+
"min_hash",
196+
"multiplexer",
197+
"nGram",
198+
"ngram",
199+
"pattern_capture",
200+
"pattern_replace",
201+
"persian_normalization",
202+
"porter_stem",
203+
"predicate_token_filter",
204+
"remove_duplicates",
205+
"reverse",
206+
"russian_stem",
207+
"scandinavian_folding",
208+
"scandinavian_normalization",
209+
"serbian_normalization",
210+
"shingle",
211+
"snowball",
212+
"sorani_normalization",
213+
"standard",
214+
"stemmer",
215+
"stemmer_override",
216+
"stop",
217+
"synonym",
218+
"synonym_graph",
219+
"trim",
220+
"truncate",
221+
"unique",
222+
"uppercase",
223+
"word_delimiter",
224+
"word_delimiter_graph"
225+
],
226+
"charFilters" : [
227+
"html_strip",
228+
"mapping",
229+
"pattern_replace"
230+
],
231+
"normalizers" : [
232+
"lowercase"
233+
],
234+
"plugins" : [
235+
{
236+
"name" : "analysis-common",
237+
"classname" : "org.opensearch.analysis.common.CommonAnalysisModulePlugin",
238+
"analyzers" : [
239+
"arabic",
240+
"armenian",
241+
"basque",
242+
"bengali",
243+
"brazilian",
244+
"bulgarian",
245+
"catalan",
246+
"chinese",
247+
"cjk",
248+
"czech",
249+
"danish",
250+
"dutch",
251+
"english",
252+
"estonian",
253+
"fingerprint",
254+
"finnish",
255+
"french",
256+
"galician",
257+
"german",
258+
"greek",
259+
"hindi",
260+
"hungarian",
261+
"indonesian",
262+
"irish",
263+
"italian",
264+
"latvian",
265+
"lithuanian",
266+
"norwegian",
267+
"pattern",
268+
"persian",
269+
"portuguese",
270+
"romanian",
271+
"russian",
272+
"snowball",
273+
"sorani",
274+
"spanish",
275+
"swedish",
276+
"thai",
277+
"turkish"
278+
],
279+
"tokenizers" : [
280+
"PathHierarchy",
281+
"char_group",
282+
"classic",
283+
"edgeNGram",
284+
"edge_ngram",
285+
"keyword",
286+
"letter",
287+
"lowercase",
288+
"nGram",
289+
"ngram",
290+
"path_hierarchy",
291+
"pattern",
292+
"simple_pattern",
293+
"simple_pattern_split",
294+
"thai",
295+
"uax_url_email",
296+
"whitespace"
297+
],
298+
"tokenFilters" : [
299+
"apostrophe",
300+
"arabic_normalization",
301+
"arabic_stem",
302+
"asciifolding",
303+
"bengali_normalization",
304+
"brazilian_stem",
305+
"cjk_bigram",
306+
"cjk_width",
307+
"classic",
308+
"common_grams",
309+
"concatenate_graph",
310+
"condition",
311+
"czech_stem",
312+
"decimal_digit",
313+
"delimited_payload",
314+
"delimited_term_freq",
315+
"dictionary_decompounder",
316+
"dutch_stem",
317+
"edgeNGram",
318+
"edge_ngram",
319+
"elision",
320+
"fingerprint",
321+
"flatten_graph",
322+
"french_stem",
323+
"german_normalization",
324+
"german_stem",
325+
"hindi_normalization",
326+
"hyphenation_decompounder",
327+
"indic_normalization",
328+
"keep",
329+
"keep_types",
330+
"keyword_marker",
331+
"kstem",
332+
"length",
333+
"limit",
334+
"lowercase",
335+
"min_hash",
336+
"multiplexer",
337+
"nGram",
338+
"ngram",
339+
"pattern_capture",
340+
"pattern_replace",
341+
"persian_normalization",
342+
"porter_stem",
343+
"predicate_token_filter",
344+
"remove_duplicates",
345+
"reverse",
346+
"russian_stem",
347+
"scandinavian_folding",
348+
"scandinavian_normalization",
349+
"serbian_normalization",
350+
"snowball",
351+
"sorani_normalization",
352+
"stemmer",
353+
"stemmer_override",
354+
"synonym",
355+
"synonym_graph",
356+
"trim",
357+
"truncate",
358+
"unique",
359+
"uppercase",
360+
"word_delimiter",
361+
"word_delimiter_graph"
362+
],
363+
"charFilters" : [
364+
"html_strip",
365+
"mapping",
366+
"pattern_replace"
367+
],
368+
"hunspellDictionaries" : [ ]
369+
}
370+
]
371+
}
372+
}
373+
}
374+
}
375+
```
376+
</details>
377+
57378
## Text analysis at indexing time and query time
58379

59380
OpenSearch performs text analysis on text fields when you index a document and when you send a search request. Depending on the time of text analysis, the analyzers used for it are classified as follows:

0 commit comments

Comments
 (0)