Skip to content

Commit f3018ee

Browse files
draft text analysis page
1 parent 79641ef commit f3018ee

File tree

1 file changed

+60
-3
lines changed

1 file changed

+60
-3
lines changed

manage-data/data-store/text-analysis.md

Lines changed: 60 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -15,7 +15,64 @@ mapped_urls:
1515

1616
% Use migrated content from existing pages that map to this page:
1717

18-
% - [ ] ./raw-migrated-files/elasticsearch/elasticsearch-reference/analysis.md
18+
% - [x] ./raw-migrated-files/elasticsearch/elasticsearch-reference/analysis.md
1919
% Notes: Introduce analysis plugins, placed here because in an indexing context it's called from the mapping or the index settings, you can also call it from search but maybe we can just reference it in the context of the search API
20-
% - [ ] ./raw-migrated-files/elasticsearch/elasticsearch-reference/analysis-overview.md
21-
% - [ ] ./raw-migrated-files/elasticsearch/elasticsearch-reference/index-modules-analysis.md
20+
% - [x] ./raw-migrated-files/elasticsearch/elasticsearch-reference/analysis-overview.md
21+
% - [x] ./raw-migrated-files/elasticsearch/elasticsearch-reference/index-modules-analysis.md
22+
23+
% //////////////////////////////
24+
% What is it?
25+
% //////////////////////////////
26+
27+
_Text analysis_ is the process of converting unstructured text, like the body of an email or a product description, into a structured format that’s [optimized for search](/solutions/search/full-text.md).
28+
29+
% //////////////////////////////
30+
% Why would someone use it?
31+
% //////////////////////////////
32+
33+
Text analysis enables {{es}} to perform full-text search, where the search returns all *relevant* results rather than just exact matches. For example, if you search for `Quick fox jumps`, you probably want the document that contains `A quick brown fox jumps over the lazy dog`, and you might also want documents that contain related words like `fast fox` or `foxes leap`.
34+
35+
{{es}} performs text analysis when indexing or searching [`text`](https://www.elastic.co/guide/en/elasticsearch/reference/current/text.html) fields. If your index does _not_ contain `text` fields, no further setup is needed; you can skip the pages in this section. If you _do_ use `text` fields or your text searches aren’t returning results as expected, configuring text analysis can often help. You should also look into analysis configuration if you’re using {{es}} to:
36+
37+
* Build a search engine
38+
* Mine unstructured data
39+
* Fine-tune search for a specific language
40+
* Perform lexicographic or linguistic research
41+
42+
% //////////////////////////////
43+
% How does it work?
44+
% //////////////////////////////
45+
46+
## Tokenization [tokenization]
47+
48+
Analysis makes full-text search possible through *tokenization*: breaking a text down into smaller chunks, called *tokens*. In most cases, these tokens are individual words.
49+
50+
If you index the phrase `the quick brown fox jumps` as a single string and the user searches for `quick fox`, it isn’t considered a match. However, if you tokenize the phrase and index each word separately, the terms in the query string can be looked up individually. This means they can be matched by searches for `quick fox`, `fox brown`, or other variations.
51+
52+
## Normalization [normalization]
53+
54+
Tokenization enables matching on individual terms, but each token is still matched literally. This means:
55+
56+
* A search for `Quick` would not match `quick`, even though you likely want either term to match the other
57+
* Although `fox` and `foxes` share the same root word, a search for `foxes` would not match `fox` or vice versa.
58+
* A search for `jumps` would not match `leaps`. While they don’t share a root word, they are synonyms and have a similar meaning.
59+
60+
To solve these problems, text analysis can *normalize* these tokens into a standard format. This allows you to match tokens that are not exactly the same as the search terms, but similar enough to still be relevant. For example:
61+
62+
* `Quick` can be lowercased: `quick`.
63+
* `foxes` can be *stemmed*, or reduced to its root word: `fox`.
64+
* `jump` and `leap` are synonyms and can be indexed as a single word: `jump`.
65+
66+
To ensure search terms match these words as intended, you can apply the same tokenization and normalization rules to the query string. For example, a search for `Foxes leap` can be normalized to a search for `fox jump`.
67+
68+
## Customize text analysis [analysis-customization]
69+
70+
Text analysis is performed by an [*analyzer*](/manage-data/data-store/text-analysis/anatomy-of-an-analyzer.md), a set of rules that govern the entire process.
71+
72+
{{es}} includes a default analyzer, called the [standard analyzer](https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-standard-analyzer.html), which works well for most use cases right out of the box.
73+
74+
If you want to tailor your search experience, you can choose a different [built-in analyzer](https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-analyzers.html) or even [configure a custom one](/manage-data/data-store/text-analysis/create-custom-analyzer.md). A custom analyzer gives you control over each step of the analysis process, including:
75+
76+
* Changes to the text *before* tokenization
77+
* How text is converted to tokens
78+
* Normalization changes made to tokens before indexing or search

0 commit comments

Comments
 (0)