diff --git a/manage-data/data-store/text-analysis.md b/manage-data/data-store/text-analysis.md index cdb1fb438e..3f3355dc33 100644 --- a/manage-data/data-store/text-analysis.md +++ b/manage-data/data-store/text-analysis.md @@ -7,15 +7,47 @@ mapped_urls: # Text analysis -% What needs to be done: Refine +_Text analysis_ is the process of converting unstructured text, like the body of an email or a product description, into a structured format that’s [optimized for search](/solutions/search/full-text.md). -% GitHub issue: docs-projects#371 +Text analysis enables {{es}} to perform full-text search, where the search returns all *relevant* results rather than just exact matches. For example, if you search for `Quick fox jumps`, you probably want the document that contains `A quick brown fox jumps over the lazy dog`, and you might also want documents that contain related words like `fast fox` or `foxes leap`. -% Scope notes: Combine the linked sources into a single intro/overview and add links to the relevant reference pages. +{{es}} performs text analysis when indexing or searching [`text`](https://www.elastic.co/guide/en/elasticsearch/reference/current/text.html) fields. If your index does _not_ contain `text` fields, no further setup is needed; you can skip the pages in this section. If you _do_ use `text` fields or your text searches aren’t returning results as expected, configuring text analysis can often help. You should also look into analysis configuration if you’re using {{es}} to: -% Use migrated content from existing pages that map to this page: +* Build a search engine +* Mine unstructured data +* Fine-tune search for a specific language +* Perform lexicographic or linguistic research -% - [ ] ./raw-migrated-files/elasticsearch/elasticsearch-reference/analysis.md -% Notes: Introduce analysis plugins, placed here because in an indexing context it's called from the mapping or the index settings, you can also call it from search but maybe we can just reference it in the context of the search API -% - [ ] ./raw-migrated-files/elasticsearch/elasticsearch-reference/analysis-overview.md -% - [ ] ./raw-migrated-files/elasticsearch/elasticsearch-reference/index-modules-analysis.md \ No newline at end of file +## Tokenization [tokenization] + +Analysis makes full-text search possible through *tokenization*: breaking a text down into smaller chunks, called *tokens*. In most cases, these tokens are individual words. + +If you index the phrase `the quick brown fox jumps` as a single string and the user searches for `quick fox`, it isn’t considered a match. However, if you tokenize the phrase and index each word separately, the terms in the query string can be looked up individually. This means they can be matched by searches for `quick fox`, `fox brown`, or other variations. + +## Normalization [normalization] + +Tokenization enables matching on individual terms, but each token is still matched literally. This means: + +* A search for `Quick` would not match `quick`, even though you likely want either term to match the other +* Although `fox` and `foxes` share the same root word, a search for `foxes` would not match `fox` or vice versa. +* A search for `jumps` would not match `leaps`. While they don’t share a root word, they are synonyms and have a similar meaning. + +To solve these problems, text analysis can *normalize* these tokens into a standard format. This allows you to match tokens that are not exactly the same as the search terms, but similar enough to still be relevant. For example: + +* `Quick` can be lowercased: `quick`. +* `foxes` can be *stemmed*, or reduced to its root word: `fox`. +* `jump` and `leap` are synonyms and can be indexed as a single word: `jump`. + +To ensure search terms match these words as intended, you can apply the same tokenization and normalization rules to the query string. For example, a search for `Foxes leap` can be normalized to a search for `fox jump`. + +## Customize text analysis [analysis-customization] + +Text analysis is performed by an [*analyzer*](/manage-data/data-store/text-analysis/anatomy-of-an-analyzer.md), a set of rules that govern the entire process. + +{{es}} includes a default analyzer, called the [standard analyzer](https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-standard-analyzer.html), which works well for most use cases right out of the box. + +If you want to tailor your search experience, you can choose a different [built-in analyzer](https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-analyzers.html) or even [configure a custom one](/manage-data/data-store/text-analysis/create-custom-analyzer.md). A custom analyzer gives you control over each step of the analysis process, including: + +* Changes to the text *before* tokenization +* How text is converted to tokens +* Normalization changes made to tokens before indexing or search diff --git a/manage-data/data-store/text-analysis/index-search-analysis.md b/manage-data/data-store/text-analysis/index-search-analysis.md index 6cab9eec64..ca750c0d90 100644 --- a/manage-data/data-store/text-analysis/index-search-analysis.md +++ b/manage-data/data-store/text-analysis/index-search-analysis.md @@ -11,10 +11,9 @@ Index time : When a document is indexed, any [`text`](https://www.elastic.co/guide/en/elasticsearch/reference/current/text.html) field values are analyzed. Search time -: When running a [full-text search](https://www.elastic.co/guide/en/elasticsearch/reference/current/full-text-queries.html) on a `text` field, the query string (the text the user is searching for) is analyzed. - - Search time is also called *query time*. +: When running a [full-text search](https://www.elastic.co/guide/en/elasticsearch/reference/current/full-text-queries.html) on a `text` field, the query string (the text the user is searching for) is analyzed. Search time is also called *query time*. + For more details on text analysis at search time, refer to [Text analysis during search](/solutions/search/full-text/text-analysis-during-search.md). The analyzer, or set of analysis rules, used at each time is called the *index analyzer* or *search analyzer* respectively. @@ -22,7 +21,7 @@ The analyzer, or set of analysis rules, used at each time is called the *index a In most cases, the same analyzer should be used at index and search time. This ensures the values and query strings for a field are changed into the same form of tokens. In turn, this ensures the tokens match as expected during a search. -::::{dropdown} **Example** +::::{dropdown} Example A document is indexed with the following value in a `text` field: ```text @@ -79,7 +78,7 @@ While less common, it sometimes makes sense to use different analyzers at index Generally, a separate search analyzer should only be specified when using the same form of tokens for field values and query strings would create unexpected or irrelevant search matches. -::::{dropdown} **Example** +::::{dropdown} Example :name: different-analyzer-ex {{es}} is used to create a search engine that matches only words that start with a provided prefix. For instance, a search for `tr` should return `tram` or `trope`—but never `taxi` or `bat`. diff --git a/manage-data/data-store/text-analysis/token-graphs.md b/manage-data/data-store/text-analysis/token-graphs.md index 55dbc063e1..657632bcbf 100644 --- a/manage-data/data-store/text-analysis/token-graphs.md +++ b/manage-data/data-store/text-analysis/token-graphs.md @@ -50,7 +50,7 @@ In the following graph, `domain name system` and its synonym, `dns`, both have a However, queries, such as the [`match`](https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-match-query.html) or [`match_phrase`](https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-match-query-phrase.html) query, can use these graphs to generate multiple sub-queries from a single query string. -:::::{dropdown} **Example** +:::::{dropdown} Example A user runs a search for the following phrase using the `match_phrase` query: `domain name system is fragile` diff --git a/raw-migrated-files/elasticsearch/elasticsearch-reference/analysis-overview.md b/raw-migrated-files/elasticsearch/elasticsearch-reference/analysis-overview.md deleted file mode 100644 index d5fe4067e1..0000000000 --- a/raw-migrated-files/elasticsearch/elasticsearch-reference/analysis-overview.md +++ /dev/null @@ -1,48 +0,0 @@ ---- -navigation_title: "Overview" ---- - -# Text analysis overview [analysis-overview] - - -Text analysis enables {{es}} to perform full-text search, where the search returns all *relevant* results rather than just exact matches. - -If you search for `Quick fox jumps`, you probably want the document that contains `A quick brown fox jumps over the lazy dog`, and you might also want documents that contain related words like `fast fox` or `foxes leap`. - - -## Tokenization [tokenization] - -Analysis makes full-text search possible through *tokenization*: breaking a text down into smaller chunks, called *tokens*. In most cases, these tokens are individual words. - -If you index the phrase `the quick brown fox jumps` as a single string and the user searches for `quick fox`, it isn’t considered a match. However, if you tokenize the phrase and index each word separately, the terms in the query string can be looked up individually. This means they can be matched by searches for `quick fox`, `fox brown`, or other variations. - - -## Normalization [normalization] - -Tokenization enables matching on individual terms, but each token is still matched literally. This means: - -* A search for `Quick` would not match `quick`, even though you likely want either term to match the other -* Although `fox` and `foxes` share the same root word, a search for `foxes` would not match `fox` or vice versa. -* A search for `jumps` would not match `leaps`. While they don’t share a root word, they are synonyms and have a similar meaning. - -To solve these problems, text analysis can *normalize* these tokens into a standard format. This allows you to match tokens that are not exactly the same as the search terms, but similar enough to still be relevant. For example: - -* `Quick` can be lowercased: `quick`. -* `foxes` can be *stemmed*, or reduced to its root word: `fox`. -* `jump` and `leap` are synonyms and can be indexed as a single word: `jump`. - -To ensure search terms match these words as intended, you can apply the same tokenization and normalization rules to the query string. For example, a search for `Foxes leap` can be normalized to a search for `fox jump`. - - -## Customize text analysis [analysis-customization] - -Text analysis is performed by an [*analyzer*](../../../manage-data/data-store/text-analysis/anatomy-of-an-analyzer.md), a set of rules that govern the entire process. - -{{es}} includes a default analyzer, called the [standard analyzer](https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-standard-analyzer.html), which works well for most use cases right out of the box. - -If you want to tailor your search experience, you can choose a different [built-in analyzer](https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-analyzers.html) or even [configure a custom one](../../../manage-data/data-store/text-analysis/create-custom-analyzer.md). A custom analyzer gives you control over each step of the analysis process, including: - -* Changes to the text *before* tokenization -* How text is converted to tokens -* Normalization changes made to tokens before indexing or search - diff --git a/raw-migrated-files/elasticsearch/elasticsearch-reference/analysis.md b/raw-migrated-files/elasticsearch/elasticsearch-reference/analysis.md deleted file mode 100644 index bb28ad81e1..0000000000 --- a/raw-migrated-files/elasticsearch/elasticsearch-reference/analysis.md +++ /dev/null @@ -1,30 +0,0 @@ -# Text analysis [analysis] - -*Text analysis* is the process of converting unstructured text, like the body of an email or a product description, into a structured format that’s [optimized for search](../../../solutions/search/full-text.md). - - -## When to configure text analysis [when-to-configure-analysis] - -{{es}} performs text analysis when indexing or searching [`text`](https://www.elastic.co/guide/en/elasticsearch/reference/current/text.html) fields. - -If your index doesn’t contain `text` fields, no further setup is needed; you can skip the pages in this section. - -However, if you use `text` fields or your text searches aren’t returning results as expected, configuring text analysis can often help. You should also look into analysis configuration if you’re using {{es}} to: - -* Build a search engine -* Mine unstructured data -* Fine-tune search for a specific language -* Perform lexicographic or linguistic research - - -## In this section [analysis-toc] - -* [Overview](../../../manage-data/data-store/text-analysis.md) -* [Concepts](../../../manage-data/data-store/text-analysis/concepts.md) -* [*Configure text analysis*](../../../manage-data/data-store/text-analysis/configure-text-analysis.md) -* [*Built-in analyzer reference*](https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-analyzers.html) -* [*Tokenizer reference*](https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-tokenizers.html) -* [*Token filter reference*](https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-tokenfilters.html) -* [*Character filters reference*](https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-charfilters.html) -* [*Normalizers*](https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-normalizers.html) - diff --git a/raw-migrated-files/elasticsearch/elasticsearch-reference/index-modules-analysis.md b/raw-migrated-files/elasticsearch/elasticsearch-reference/index-modules-analysis.md deleted file mode 100644 index 487406604f..0000000000 --- a/raw-migrated-files/elasticsearch/elasticsearch-reference/index-modules-analysis.md +++ /dev/null @@ -1,9 +0,0 @@ -# Analysis [index-modules-analysis] - -The index analysis module acts as a configurable registry of *analyzers* that can be used in order to convert a string field into individual terms which are: - -* added to the inverted index in order to make the document searchable -* used by high level queries such as the [`match` query](https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-match-query.html) to generate search terms. - -See [Text analysis](../../../solutions/search/full-text/text-analysis-during-search.md) for configuration details. - diff --git a/raw-migrated-files/toc.yml b/raw-migrated-files/toc.yml index 44c9dd8bfc..2936a41867 100644 --- a/raw-migrated-files/toc.yml +++ b/raw-migrated-files/toc.yml @@ -537,8 +537,6 @@ toc: children: - file: elasticsearch/elasticsearch-reference/_usage_example.md - file: elasticsearch/elasticsearch-reference/active-directory-realm.md - - file: elasticsearch/elasticsearch-reference/analysis-overview.md - - file: elasticsearch/elasticsearch-reference/analysis.md - file: elasticsearch/elasticsearch-reference/autoscaling-deciders.md - file: elasticsearch/elasticsearch-reference/autoscaling-fixed-decider.md - file: elasticsearch/elasticsearch-reference/autoscaling-frozen-existence-decider.md @@ -571,7 +569,6 @@ toc: - file: elasticsearch/elasticsearch-reference/index-lifecycle-management.md - file: elasticsearch/elasticsearch-reference/index-mgmt.md - file: elasticsearch/elasticsearch-reference/index-modules-allocation.md - - file: elasticsearch/elasticsearch-reference/index-modules-analysis.md - file: elasticsearch/elasticsearch-reference/index-modules-mapper.md - file: elasticsearch/elasticsearch-reference/ingest-enriching-data.md - file: elasticsearch/elasticsearch-reference/install-elasticsearch.md