You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Improve the documentation for Analyzers (some terminology was
not correct) also fix some broken links along the way.
Note: this commit is a spin-off of opensearch-project#6252
Signed-off-by: Lukáš Vlček <[email protected]>
Copy file name to clipboardExpand all lines: _analyzers/index.md
+15-5Lines changed: 15 additions & 5 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -15,16 +15,24 @@ redirect_from:
15
15
16
16
# Text analysis
17
17
18
-
When you are searching documents using a full-text search, you want to receive all relevant results and not only exact matches. If you're looking for "walk", you're interested in results that contain any form of the word, like "Walk", "walked", or "walking." To facilitate full-text search, OpenSearch uses text analysis.
18
+
When you are searching documents using a full-text search, you want to receive all relevant results. If you're looking for "walk", you're interested in results that contain any form of the word, like "Walk", "walked", or "walking". To facilitate full-text search, OpenSearch uses text analysis.
19
19
20
-
Text analysis consists of the following steps:
20
+
The objective of text analysis is to split the unstructured free text content of the source document into a sequence of terms, which are then stored in an inverted index. Subsequently, when a similar text analysis is applied to a user's query, the resulting sequence of terms facilitates the matching of relevant source documents.
21
21
22
-
1._Tokenize_ text into terms: For example, after tokenization, the phrase `Actions speak louder than words` is split into tokens `Actions`, `speak`, `louder`, `than`, and `words`.
23
-
1._Normalize_ the terms by converting them into a standard format, for example, converting them to lowercase or performing stemming (reducing the word to its root): For example, after normalization, `Actions` becomes `action`, `louder` becomes `loud`, and `words` becomes `word`.
22
+
From a technical point of view, the text analysis process consists of several steps, some of which are optional:
23
+
24
+
1. Before the free text content can be split into individual words, it may be beneficial to refine the text at the character level. The primary aim of this optional step is to help the tokenizer (the subsequent stage in the analysis process) generate better tokens. This can include removal of markup tags (such as HTML) or handling specific character patterns (like replacing the 🙂 emoji with the text `:slightly_smiling_face:`).
25
+
26
+
2. The next step is to split the free text into individual words---_tokens_. This is performed by a _tokenizer_. For example, after tokenization, the sentence `Actions speak louder than words` is split into tokens `Actions`, `speak`, `louder`, `than`, and `words`.
27
+
28
+
3. The last step is to process individual tokens by applying a series of token filters. The aim is to convert each token into a predictable form that is directly stored in the index, for example, by converting them to lowercase or performing stemming (reducing the word to its root). For example, the token `Actions` becomes `action`, `louder` becomes `loud`, and `words` becomes `word`.
29
+
30
+
Although the terms ***token*** and ***term*** may sound similar and are occasionally used interchangeably, it is helpful to understand the difference between the two. In the context of Apache Lucene, each holds a distinct role. A ***token*** is created by a tokenizer during text analysis and often undergoes a number of additional modifications as it passes through the chain of token filters. Each token is associated with metadata that can be further used during the text analysis process. A ***term*** is a data value that is directly stored in the inverted index and is associated with much less metadata. During search, matching operates at the term level.
31
+
{: .note}
24
32
25
33
## Analyzers
26
34
27
-
In OpenSearch, text analysis is performed by an _analyzer_. Each analyzer contains the following sequentially applied components:
35
+
In OpenSearch, the abstraction that encompasses text analysis is referred to as an _analyzer_. Each analyzer contains the following sequentially applied components:
28
36
29
37
1.**Character filters**: First, a character filter receives the original text as a stream of characters and adds, removes, or modifies characters in the text. For example, a character filter can strip HTML characters from a string so that the text `<p><b>Actions</b> speak louder than <em>words</em></p>` becomes `\nActions speak louder than words\n`. The output of a character filter is a stream of characters.
30
38
@@ -35,6 +43,8 @@ In OpenSearch, text analysis is performed by an _analyzer_. Each analyzer contai
35
43
An analyzer must contain exactly one tokenizer and may contain zero or more character filters and zero or more token filters.
36
44
{: .note}
37
45
46
+
There is also a special type of analyzer called a ***normalizer***. A normalizer is similar to an analyzer except that it does not contain a tokenizer and can only include specific types of character filters and token filters. These filters can perform only character-level operations, such as character or pattern replacement, and cannot perform operations on the token as a whole. This means that replacing a token with a synonym or stemming is not supported. See [Normalizers]({{site.url}}{{site.baseurl}}/analyzers/normalizers/) for further details.
47
+
38
48
## Built-in analyzers
39
49
40
50
The following table lists the built-in analyzers that OpenSearch provides. The last column of the table contains the result of applying the analyzer to the string `It’s fun to contribute a brand-new PR or 2 to OpenSearch!`.
A _normalizer_ functions similarly to an analyzer but outputs only a single token. It does not contain a tokenizer and can only include specific types of character and token filters. These filters can perform only character-level operations, such as character or pattern replacement, and cannot operate on the token as a whole. This means that replacing a token with a synonym or stemming is not supported.
10
+
11
+
A normalizer is useful in keyword search (that is, in term-based queries) because it allows you to run token and character filters on any given input. For instance, it makes it possible to match an incoming query `Naïve` with the index term `naive`.
12
+
13
+
Consider the following example.
14
+
15
+
Create a new index with a custom normalizer:
16
+
```json
17
+
PUT /sample-index
18
+
{
19
+
"settings": {
20
+
"analysis": {
21
+
"normalizer": {
22
+
"normalized_keyword": {
23
+
"type": "custom",
24
+
"char_filter": [],
25
+
"filter": [ "asciifolding", "lowercase" ]
26
+
}
27
+
}
28
+
}
29
+
},
30
+
"mappings": {
31
+
"properties": {
32
+
"approach": {
33
+
"type": "keyword",
34
+
"normalizer": "normalized_keyword"
35
+
}
36
+
}
37
+
}
38
+
}
39
+
```
40
+
{% include copy-curl.html %}
41
+
42
+
Index a document:
43
+
```json
44
+
POST /sample-index/_doc/
45
+
{
46
+
"approach": "naive"
47
+
}
48
+
```
49
+
{% include copy-curl.html %}
50
+
51
+
The following query matches the document. This is expected:
52
+
```json
53
+
GET /sample-index/_search
54
+
{
55
+
"query": {
56
+
"term": {
57
+
"approach": "naive"
58
+
}
59
+
}
60
+
}
61
+
```
62
+
{% include copy-curl.html %}
63
+
64
+
But this query matches the document as well:
65
+
```json
66
+
GET /sample-index/_search
67
+
{
68
+
"query": {
69
+
"term": {
70
+
"approach": "Naïve"
71
+
}
72
+
}
73
+
}
74
+
```
75
+
{% include copy-curl.html %}
76
+
77
+
To understand why, consider the effect of the normalizer:
78
+
```json
79
+
GET /sample-index/_analyze
80
+
{
81
+
"normalizer" : "normalized_keyword",
82
+
"text" : "Naïve"
83
+
}
84
+
```
85
+
86
+
Internally, a normalizer accepts only filters that are instances of either `NormalizingTokenFilterFactory` or `NormalizingCharFilterFactory`. The following is a list of compatible filters found in modules and plugins that are part of the core OpenSearch repository.
87
+
88
+
### The `common-analysis` module
89
+
90
+
This module does not require installation; it is available by default.
Character filters: `normalize_kanji`, `normalize_kana`
105
+
106
+
### The `analysis-nori` plugin
107
+
108
+
Character filters: `normalize_kanji`, `normalize_kana`
109
+
110
+
These lists of filters include only analysis components found in the [additional plugins]({{site.url}}{{site.baseurl}}/install-and-configure/plugins/#additional-plugins) that are part of the core OpenSearch repository.
Copy file name to clipboardExpand all lines: _field-types/supported-field-types/keyword.md
+1-1Lines changed: 1 addition & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -52,7 +52,7 @@ Parameter | Description
52
52
`index` | A Boolean value that specifies whether the field should be searchable. Default is `true`. To reduce disk space, set `index` to `false`.
53
53
`index_options` | Information to be stored in the index that will be considered when calculating relevance scores. Can be set to `freqs` for term frequency. Default is `docs`.
54
54
`meta` | Accepts metadata for this field.
55
-
`normalizer` | Specifies how to preprocess this field before indexing (for example, make it lowercase). Default is `null` (no preprocessing).
55
+
[`normalizer`]({{site.url}}{{site.baseurl}}/analyzers/normalizers/) | Specifies how to preprocess this field before indexing (for example, make it lowercase). Default is `null` (no preprocessing).
56
56
`norms` | A Boolean value that specifies whether the field length should be used when calculating relevance scores. Default is `false`.
57
57
[`null_value`]({{site.url}}{{site.baseurl}}/opensearch/supported-field-types/index#null-value) | A value to be used in place of `null`. Must be of the same type as the field. If this parameter is not specified, the field is treated as missing when its value is `null`. Default is `null`.
58
58
`similarity` | The ranking algorithm for calculating relevance scores. Default is `BM25`.
There are three ways to install plugins using the `opensearch-plugin`:
86
86
87
-
-[Install a plugin by name]({{site.url}}{{site.baseurl}}/opensearch/install/plugins#install-a-plugin-by-name)
88
-
-[Install a plugin by from a zip file]({{site.url}}{{site.baseurl}}/opensearch/install/plugins#install-a-plugin-from-a-zip-file)
89
-
-[Install a plugin using Maven coordinates]({{site.url}}{{site.baseurl}}/opensearch/install/plugins#install-a-plugin-using-maven-coordinates)
87
+
-[Install a plugin by name](#install-a-plugin-by-name).
88
+
-[Install a plugin from a ZIP file](#install-a-plugin-from-a-zip-file).
89
+
-[Install a plugin using Maven coordinates](#install-a-plugin-using-maven-coordinates).
90
90
91
91
### Install a plugin by name:
92
92
93
-
For a list of plugins that can be installed by name, see [Additional plugins]({{site.url}}{{site.baseurl}}/opensearch/install/plugins#additional-plugins).
93
+
For a list of plugins that can be installed by name, see [Additional plugins](#additional-plugins).
-> Installed analysis-icu with folder name analysis-icu
107
107
```
108
108
109
-
### Install a plugin from a zip file:
109
+
### Install a plugin from a zip file
110
110
111
111
Remote zip files can be installed by replacing `<zip-file>` with the URL of the hosted file. The tool only supports downloading over HTTP/HTTPS protocols. For local zip files, replace `<zip-file>` with `file:` followed by the absolute or relative path to the plugin zip file as in the second example below.
112
112
113
-
#### Usage:
113
+
#### Usage
114
114
```bash
115
115
bin/opensearch-plugin install <zip-file>
116
116
```
117
117
118
-
#### Example:
118
+
#### Example
119
119
```bash
120
120
# Zip file is hosted on a remote server - in this case, Maven central repository.
@@ -166,16 +166,16 @@ Continue with installation? [y/N]y
166
166
-> Installed opensearch-anomaly-detection with folder name opensearch-anomaly-detection
167
167
```
168
168
169
-
### Install a plugin using Maven coordinates:
169
+
### Install a plugin using Maven coordinates
170
170
171
171
The `opensearch-plugin install` tool also accepts Maven coordinates for available artifacts and versions hosted on [Maven Central](https://search.maven.org/search?q=org.opensearch.plugin). `opensearch-plugin` will parse the Maven coordinates you provide and construct a URL. As a result, the host must be able to connect directly to [Maven Central](https://search.maven.org/search?q=org.opensearch.plugin). The plugin installation will fail if you pass coordinates to a proxy or local repository.
Copy file name to clipboardExpand all lines: _query-dsl/term-vs-full-text.md
+1-1Lines changed: 1 addition & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -8,7 +8,7 @@ redirect_from:
8
8
9
9
# Term-level and full-text queries compared
10
10
11
-
You can use both term-level and full-text queries to search text, but while term-level queries are usually used to search structured data, full-text queries are used for full-text search. The main difference between term-level and full-text queries is that term-level queries search documents for an exact specified term, while full-text queries analyze the query string. The following table summarizes the differences between term-level and full-text queries.
11
+
You can use both term-level and full-text queries to search text, but while term-level queries are usually used to search structured data, full-text queries are used for full-text search. The main difference between term-level and full-text queries is that term-level queries search documents for an exact specified term, while full-text queries [analyze]({{{site.url}}{{site.baseurl}}/analyzers/) the query string. The following table summarizes the differences between term-level and full-text queries.
0 commit comments