Skip to content
Merged
Show file tree
Hide file tree
Changes from 3 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
63 changes: 60 additions & 3 deletions manage-data/data-store/text-analysis.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,64 @@ mapped_urls:

% Use migrated content from existing pages that map to this page:

% - [ ] ./raw-migrated-files/elasticsearch/elasticsearch-reference/analysis.md
% - [x] ./raw-migrated-files/elasticsearch/elasticsearch-reference/analysis.md
% Notes: Introduce analysis plugins, placed here because in an indexing context it's called from the mapping or the index settings, you can also call it from search but maybe we can just reference it in the context of the search API
% - [ ] ./raw-migrated-files/elasticsearch/elasticsearch-reference/analysis-overview.md
% - [ ] ./raw-migrated-files/elasticsearch/elasticsearch-reference/index-modules-analysis.md
% - [x] ./raw-migrated-files/elasticsearch/elasticsearch-reference/analysis-overview.md
% - [x] ./raw-migrated-files/elasticsearch/elasticsearch-reference/index-modules-analysis.md

% //////////////////////////////
% What is it?
% //////////////////////////////

_Text analysis_ is the process of converting unstructured text, like the body of an email or a product description, into a structured format that’s [optimized for search](/solutions/search/full-text.md).

% //////////////////////////////
% Why would someone use it?
% //////////////////////////////

Text analysis enables {{es}} to perform full-text search, where the search returns all *relevant* results rather than just exact matches. For example, if you search for `Quick fox jumps`, you probably want the document that contains `A quick brown fox jumps over the lazy dog`, and you might also want documents that contain related words like `fast fox` or `foxes leap`.

{{es}} performs text analysis when indexing or searching [`text`](https://www.elastic.co/guide/en/elasticsearch/reference/current/text.html) fields. If your index does _not_ contain `text` fields, no further setup is needed; you can skip the pages in this section. If you _do_ use `text` fields or your text searches aren’t returning results as expected, configuring text analysis can often help. You should also look into analysis configuration if you’re using {{es}} to:

* Build a search engine
* Mine unstructured data
* Fine-tune search for a specific language
* Perform lexicographic or linguistic research

% //////////////////////////////
% How does it work?
% //////////////////////////////

## Tokenization [tokenization]

Analysis makes full-text search possible through *tokenization*: breaking a text down into smaller chunks, called *tokens*. In most cases, these tokens are individual words.

If you index the phrase `the quick brown fox jumps` as a single string and the user searches for `quick fox`, it isn’t considered a match. However, if you tokenize the phrase and index each word separately, the terms in the query string can be looked up individually. This means they can be matched by searches for `quick fox`, `fox brown`, or other variations.

## Normalization [normalization]

Tokenization enables matching on individual terms, but each token is still matched literally. This means:

* A search for `Quick` would not match `quick`, even though you likely want either term to match the other
* Although `fox` and `foxes` share the same root word, a search for `foxes` would not match `fox` or vice versa.
* A search for `jumps` would not match `leaps`. While they don’t share a root word, they are synonyms and have a similar meaning.

To solve these problems, text analysis can *normalize* these tokens into a standard format. This allows you to match tokens that are not exactly the same as the search terms, but similar enough to still be relevant. For example:

* `Quick` can be lowercased: `quick`.
* `foxes` can be *stemmed*, or reduced to its root word: `fox`.
* `jump` and `leap` are synonyms and can be indexed as a single word: `jump`.

To ensure search terms match these words as intended, you can apply the same tokenization and normalization rules to the query string. For example, a search for `Foxes leap` can be normalized to a search for `fox jump`.

## Customize text analysis [analysis-customization]

Text analysis is performed by an [*analyzer*](/manage-data/data-store/text-analysis/anatomy-of-an-analyzer.md), a set of rules that govern the entire process.

{{es}} includes a default analyzer, called the [standard analyzer](https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-standard-analyzer.html), which works well for most use cases right out of the box.

If you want to tailor your search experience, you can choose a different [built-in analyzer](https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-analyzers.html) or even [configure a custom one](/manage-data/data-store/text-analysis/create-custom-analyzer.md). A custom analyzer gives you control over each step of the analysis process, including:

* Changes to the text *before* tokenization
* How text is converted to tokens
* Normalization changes made to tokens before indexing or search
Original file line number Diff line number Diff line change
Expand Up @@ -11,18 +11,17 @@ Index time
: When a document is indexed, any [`text`](https://www.elastic.co/guide/en/elasticsearch/reference/current/text.html) field values are analyzed.

Search time
: When running a [full-text search](https://www.elastic.co/guide/en/elasticsearch/reference/current/full-text-queries.html) on a `text` field, the query string (the text the user is searching for) is analyzed.

Search time is also called *query time*.
: When running a [full-text search](https://www.elastic.co/guide/en/elasticsearch/reference/current/full-text-queries.html) on a `text` field, the query string (the text the user is searching for) is analyzed. Search time is also called *query time*.

For more details on text analysis at search time, refer to [Text analysis during search](/solutions/search/full-text/text-analysis-during-search.md).

The analyzer, or set of analysis rules, used at each time is called the *index analyzer* or *search analyzer* respectively.

## How the index and search analyzer work together [analysis-same-index-search-analyzer]

In most cases, the same analyzer should be used at index and search time. This ensures the values and query strings for a field are changed into the same form of tokens. In turn, this ensures the tokens match as expected during a search.

::::{dropdown} **Example**
::::{dropdown} Example
A document is indexed with the following value in a `text` field:

```text
Expand Down Expand Up @@ -79,7 +78,7 @@ While less common, it sometimes makes sense to use different analyzers at index

Generally, a separate search analyzer should only be specified when using the same form of tokens for field values and query strings would create unexpected or irrelevant search matches.

::::{dropdown} **Example**
::::{dropdown} Example
:name: different-analyzer-ex

{{es}} is used to create a search engine that matches only words that start with a provided prefix. For instance, a search for `tr` should return `tram` or `trope`—but never `taxi` or `bat`.
Expand Down
2 changes: 1 addition & 1 deletion manage-data/data-store/text-analysis/token-graphs.md
Original file line number Diff line number Diff line change
Expand Up @@ -50,7 +50,7 @@ In the following graph, `domain name system` and its synonym, `dns`, both have a

However, queries, such as the [`match`](https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-match-query.html) or [`match_phrase`](https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-match-query-phrase.html) query, can use these graphs to generate multiple sub-queries from a single query string.

:::::{dropdown} **Example**
:::::{dropdown} Example
A user runs a search for the following phrase using the `match_phrase` query:

`domain name system is fragile`
Expand Down

This file was deleted.

This file was deleted.

This file was deleted.

3 changes: 0 additions & 3 deletions raw-migrated-files/toc.yml
Original file line number Diff line number Diff line change
Expand Up @@ -546,8 +546,6 @@ toc:
children:
- file: elasticsearch/elasticsearch-reference/_usage_example.md
- file: elasticsearch/elasticsearch-reference/active-directory-realm.md
- file: elasticsearch/elasticsearch-reference/analysis-overview.md
- file: elasticsearch/elasticsearch-reference/analysis.md
- file: elasticsearch/elasticsearch-reference/autoscaling-deciders.md
- file: elasticsearch/elasticsearch-reference/autoscaling-fixed-decider.md
- file: elasticsearch/elasticsearch-reference/autoscaling-frozen-existence-decider.md
Expand Down Expand Up @@ -580,7 +578,6 @@ toc:
- file: elasticsearch/elasticsearch-reference/index-lifecycle-management.md
- file: elasticsearch/elasticsearch-reference/index-mgmt.md
- file: elasticsearch/elasticsearch-reference/index-modules-allocation.md
- file: elasticsearch/elasticsearch-reference/index-modules-analysis.md
- file: elasticsearch/elasticsearch-reference/index-modules-mapper.md
- file: elasticsearch/elasticsearch-reference/ingest-enriching-data.md
- file: elasticsearch/elasticsearch-reference/install-elasticsearch.md
Expand Down
Loading