Skip to content

Commit ef8d191

Browse files
committed
Cleanup, Breakout How full-text search works
1 parent 622fcf3 commit ef8d191

File tree

6 files changed

+225
-52
lines changed

6 files changed

+225
-52
lines changed

solutions/search/full-text.md

Lines changed: 2 additions & 35 deletions
Original file line numberDiff line numberDiff line change
@@ -15,43 +15,10 @@ Built on decades of information retrieval research, full-text search delivers re
1515

1616
You can combine full-text search with [semantic search using vectors](semantic-search.md) to build modern hybrid search applications. While vector search may require additional GPU resources, the full-text component remains cost-effective by leveraging existing CPU infrastructure.
1717

18-
19-
## How full-text search works [full-text-search-how-it-works]
20-
21-
The following diagram illustrates the components of full-text search.
22-
23-
:::{image} ../../images/elasticsearch-reference-full-text-search-overview.svg
24-
:alt: Components of full-text search from analysis to relevance scoring
25-
:width: 550px
26-
:::
27-
28-
At a high level, full-text search involves the following:
29-
30-
* [**Text analysis**](../../manage-data/data-store/text-analysis.md): Analysis consists of a pipeline of sequential transformations. Text is transformed into a format optimized for searching using techniques such as stemming, lowercasing, and stop word elimination. {{es}} contains a number of built-in [analyzers](https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-analyzers.html) and tokenizers, including options to analyze specific language text. You can also create custom analyzers.
31-
32-
::::{tip}
33-
Refer to [Test an analyzer](../../manage-data/data-store/text-analysis/test-an-analyzer.md) to learn how to test an analyzer and inspect the tokens and metadata it generates.
34-
35-
::::
36-
37-
* **Inverted index creation**: After analysis is complete, {{es}} builds an inverted index from the resulting tokens. An inverted index is a data structure that maps each token to the documents that contain it. It’s made up of two key components:
38-
39-
* **Dictionary**: A sorted list of all unique terms in the collection of documents in your index.
40-
* **Posting list**: For each term, a list of document IDs where the term appears, along with optional metadata like term frequency and position.
41-
42-
* **Relevance scoring**: Results are ranked by how relevant they are to the given query. The relevance score of each document is represented by a positive floating-point number called the `_score`. The higher the `_score`, the more relevant the document.
43-
44-
The default [similarity algorithm](https://www.elastic.co/guide/en/elasticsearch/reference/current/index-modules-similarity.html) {{es}} uses for calculating relevance scores is [Okapi BM25](https://en.wikipedia.org/wiki/Okapi_BM25), a variation of the [TF-IDF algorithm](https://en.wikipedia.org/wiki/Tf–idf). BM25 calculates relevance scores based on term frequency, document frequency, and document length. Refer to this [technical blog post](https://www.elastic.co/blog/practical-bm25-part-2-the-bm25-algorithm-and-its-variables) for a deep dive into BM25.
45-
46-
* **Full-text search query**: Query text is analyzed [the same way as the indexed text](../../manage-data/data-store/text-analysis/index-search-analysis.md), and the resulting tokens are used to search the inverted index.
47-
48-
Query DSL supports a number of [full-text queries](https://www.elastic.co/guide/en/elasticsearch/reference/current/full-text-queries.html).
49-
50-
As of 8.17, {{esql}} also supports [full-text search](https://www.elastic.co/guide/en/elasticsearch/reference/current/esql-functions-operators.html#esql-search-functions) functions.
51-
18+
## Getting started [full-text-search-getting-started]
5219

5320

54-
## Getting started [full-text-search-getting-started]
21+
For a high-level overview of how full-text search works, refer to [How full-text search works](full-text/how-full-text-works.md).
5522

5623
For a hands-on introduction to full-text search, refer to the [full-text search tutorial](get-started.md).
5724

Lines changed: 30 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,30 @@
1+
# How full-text search works [full-text-search-how-it-works]
2+
3+
The following diagram illustrates the components of full-text search.
4+
5+
:::{image} ../../../images/elasticsearch-reference-full-text-search-overview.svg
6+
:alt: Components of full-text search from analysis to relevance scoring
7+
:width: 550px
8+
:::
9+
10+
At a high level, full-text search involves the following:
11+
12+
* [**Text analysis**](../../../manage-data/data-store/text-analysis.md): Analysis consists of a pipeline of sequential transformations. Text is transformed into a format optimized for searching using techniques such as stemming, lowercasing, and stop word elimination. {{es}} contains a number of built-in [analyzers](https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-analyzers.html) and tokenizers, including options to analyze specific language text. You can also create custom analyzers.
13+
::::{tip}
14+
Refer to [Test an analyzer](../../../manage-data/data-store/text-analysis/test-an-analyzer.md) to learn how to test an analyzer and inspect the tokens and metadata it generates.
15+
::::
16+
17+
* **Inverted index creation**: After analysis is complete, {{es}} builds an inverted index from the resulting tokens. An inverted index is a data structure that maps each token to the documents that contain it. It’s made up of two key components:
18+
19+
* **Dictionary**: A sorted list of all unique terms in the collection of documents in your index.
20+
* **Posting list**: For each term, a list of document IDs where the term appears, along with optional metadata like term frequency and position.
21+
22+
* **Relevance scoring**: Results are ranked by how relevant they are to the given query. The relevance score of each document is represented by a positive floating-point number called the `_score`. The higher the `_score`, the more relevant the document.
23+
24+
The default [similarity algorithm](https://www.elastic.co/guide/en/elasticsearch/reference/current/index-modules-similarity.html) {{es}} uses for calculating relevance scores is [Okapi BM25](https://en.wikipedia.org/wiki/Okapi_BM25), a variation of the [TF-IDF algorithm](https://en.wikipedia.org/wiki/Tf–idf). BM25 calculates relevance scores based on term frequency, document frequency, and document length. Refer to this [technical blog post](https://www.elastic.co/blog/practical-bm25-part-2-the-bm25-algorithm-and-its-variables) for a deep dive into BM25.
25+
26+
* **Full-text search query**: Query text is analyzed [the same way as the indexed text](../../../manage-data/data-store/text-analysis/index-search-analysis.md), and the resulting tokens are used to search the inverted index.
27+
28+
Query DSL supports a number of [full-text queries](https://www.elastic.co/guide/en/elasticsearch/reference/current/full-text-queries.html).
29+
30+
As of 8.17, {{esql}} also supports [full-text search](https://www.elastic.co/guide/en/elasticsearch/reference/current/esql-functions-operators.html#esql-search-functions) functions.

solutions/search/full-text/search-with-synonyms.md

Lines changed: 188 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -6,17 +6,13 @@ mapped_urls:
66

77
# Search with synonyms
88

9-
% What needs to be done: Lift-and-shift
10-
11-
% Use migrated content from existing pages that map to this page:
12-
13-
% - [ ] ./raw-migrated-files/elasticsearch/elasticsearch-reference/search-with-synonyms.md
14-
% - [ ] ./raw-migrated-files/cloud/cloud-enterprise/ece-add-custom-bundle-plugin.md
15-
% Notes: Custom synonyms bundle
16-
179
$$$ece-add-custom-bundle-example-synonyms$$$
18-
% Just link to ECE reference page wherever it ends
10+
::::{note}
11+
Learn about [adding custom synonym bundles](https://www.elastic.co/guide/en/cloud-enterprise/current/ece-add-custom-bundle-plugin.html#ece-add-custom-bundle-example-synonyms) to your Elastic Cloud Enterprise deployment.
12+
::::
13+
1914

15+
% TODO: these bundle links do not belong here
2016

2117
$$$ece-add-custom-bundle-example-LDA$$$
2218

@@ -26,9 +22,189 @@ $$$ece-add-custom-bundle-example-cacerts$$$
2622

2723
$$$ece-add-custom-bundle-example-LDAP$$$
2824

29-
$$$synonyms-store-synonyms$$$
25+
# Search with synonyms [search-with-synonyms]
26+
27+
Synonyms are words or phrases that have the same or similar meaning. They are an important aspect of search, as they can improve the search experience and increase the scope of search results.
28+
29+
Synonyms allow you to:
30+
31+
* **Improve search relevance** by finding relevant documents that use different terms to express the same concept.
32+
* Make **domain-specific vocabulary** more user-friendly, allowing users to use search terms they are more familiar with.
33+
* **Define common misspellings and typos** to transparently handle common mistakes.
34+
35+
Synonyms are grouped together using **synonyms sets**. You can have as many synonyms sets as you need.
36+
37+
In order to use synonyms sets in {{es}}, you need to:
38+
39+
* [Store your synonyms set](#synonyms-store-synonyms)
40+
* [Configure synonyms token filters and analyzers](#synonyms-synonym-token-filters)
41+
42+
43+
## Store your synonyms set [synonyms-store-synonyms]
44+
45+
Your synonyms sets need to be stored in {{es}} so your analyzers can refer to them. There are three ways to store your synonyms sets:
46+
47+
48+
### Synonyms API [synonyms-store-synonyms-api]
49+
50+
You can use the [synonyms APIs](https://www.elastic.co/guide/en/elasticsearch/reference/current/synonyms-apis.html) to manage synonyms sets. This is the most flexible approach, as it allows to dynamically define and modify synonyms sets.
51+
52+
Changes in your synonyms sets will automatically reload the associated analyzers.
53+
54+
55+
### Synonyms File [synonyms-store-synonyms-file]
56+
57+
You can store your synonyms set in a file.
58+
59+
A synonyms set file needs to be uploaded to all your cluster nodes, and be located in the configuration directory for your {{es}} distribution. If you’re using {{ess}}, you can upload synonyms files using [custom bundles](../../../deploy-manage/deploy/elastic-cloud/upload-custom-plugins-bundles.md).
60+
61+
An example synonyms file:
62+
63+
```markdown
64+
# Blank lines and lines starting with pound are comments.
65+
66+
# Explicit mappings match any token sequence on the left hand side of "=>"
67+
# and replace with all alternatives on the right hand side.
68+
# These types of mappings ignore the expand parameter in the schema.
69+
# Examples:
70+
i-pod, i pod => ipod
71+
sea biscuit, sea biscit => seabiscuit
72+
73+
# Equivalent synonyms may be separated with commas and give
74+
# no explicit mapping. In this case the mapping behavior will
75+
# be taken from the expand parameter in the token filter configuration.
76+
# This allows the same synonym file to be used in different synonym handling strategies.
77+
# Examples:
78+
ipod, i-pod, i pod
79+
foozball , foosball
80+
universe , cosmos
81+
lol, laughing out loud
82+
83+
# If expand==true in the synonym token filter configuration,
84+
# "ipod, i-pod, i pod" is equivalent to the explicit mapping:
85+
ipod, i-pod, i pod => ipod, i-pod, i pod
86+
# If expand==false, "ipod, i-pod, i pod" is equivalent
87+
# to the explicit mapping:
88+
ipod, i-pod, i pod => ipod
89+
90+
# Multiple synonym mapping entries are merged.
91+
foo => foo bar
92+
foo => baz
93+
# is equivalent to
94+
foo => foo bar, baz
95+
```
96+
97+
To update an existing synonyms set, upload new files to your cluster. Synonyms set files must be kept in sync on every cluster node.
98+
99+
When a synonyms set is updated, search analyzers that use it need to be refreshed using the [reload search analyzers API](https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-reload-analyzers.html)
100+
101+
This manual syncing and reloading makes this approach less flexible than using the [synonyms API](../../../solutions/search/full-text/search-with-synonyms.md#synonyms-store-synonyms-api).
102+
103+
104+
### Inline [synonyms-store-synonyms-inline]
105+
106+
You can test your synonyms by adding them directly inline in your token filter definition.
107+
108+
::::{warning}
109+
Inline synonyms are not recommended for production usage. A large number of inline synonyms increases cluster size unnecessarily and can lead to performance issues.
110+
111+
::::
112+
113+
114+
115+
### Configure synonyms token filters and analyzers [synonyms-synonym-token-filters]
116+
117+
Once your synonyms sets are created, you can start configuring your token filters and analyzers to use them.
118+
119+
::::{warning}
120+
Synonyms sets must exist before they can be added to indices. If an index is created referencing a nonexistent synonyms set, the index will remain in a partially created and inoperable state. The only way to recover from this scenario is to ensure the synonyms set exists then either delete and re-create the index, or close and re-open the index.
121+
122+
::::
123+
124+
125+
::::{warning}
126+
Invalid synonym rules can cause errors when applying analyzer changes. For reloadable analyzers, this prevents reloading and applying changes. You must correct errors in the synonym rules and reload the analyzer.
127+
128+
An index with invalid synonym rules cannot be reopened, making it inoperable when:
129+
130+
* A node containing the index starts
131+
* The index is opened from a closed state
132+
* A node restart occurs (which reopens the node assigned shards)
133+
134+
::::
135+
136+
137+
{{es}} uses synonyms as part of the [analysis process](../../../manage-data/data-store/text-analysis.md). You can use two types of [token filter](https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-tokenfilters.html) to include synonyms:
138+
139+
* [Synonym graph](https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-synonym-graph-tokenfilter.html): It is recommended to use it, as it can correctly handle multi-word synonyms ("hurriedly", "in a hurry").
140+
* [Synonym](https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-synonym-tokenfilter.html): Not recommended if you need to use multi-word synonyms.
141+
142+
Check each synonym token filter documentation for configuration details and instructions on adding it to an analyzer.
143+
144+
145+
### Test your analyzer [synonyms-test-analyzer]
146+
147+
You can test an analyzer configuration without modifying your index settings. Use the [analyze API](https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-analyze.html) to test your analyzer chain:
148+
149+
```console
150+
GET /_analyze
151+
{
152+
"tokenizer": "standard",
153+
"filter" : [
154+
"lowercase",
155+
{
156+
"type": "synonym_graph",
157+
"synonyms": ["pc => personal computer", "computer, pc, laptop"]
158+
}
159+
],
160+
"text" : "Check how PC synonyms work"
161+
}
162+
```
163+
164+
165+
### Apply synonyms at index or search time [synonyms-apply-synonyms]
166+
167+
Analyzers can be applied at [index time or search time](../../../manage-data/data-store/text-analysis/index-search-analysis.md).
168+
169+
You need to decide when to apply your synonyms:
170+
171+
* Index time: Synonyms are applied when the documents are indexed into {{es}}. This is a less flexible alternative, as changes to your synonyms require [reindexing](https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-reindex.html).
172+
* Search time: Synonyms are applied when a search is executed. This is a more flexible approach, which doesn’t require reindexing. If token filters are configured with `"updateable": true`, search analyzers can be [reloaded](https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-reload-analyzers.html) when you make changes to your synonyms.
173+
174+
Synonyms sets created using the [synonyms API](../../../solutions/search/full-text/search-with-synonyms.md#synonyms-store-synonyms-api) can only be used at search time.
30175

31-
$$$synonyms-synonym-token-filters$$$
176+
You can specify the analyzer that contains your synonyms set as a [search time analyzer](../../../manage-data/data-store/text-analysis/specify-an-analyzer.md#specify-search-analyzer) or as an [index time analyzer](../../../manage-data/data-store/text-analysis/specify-an-analyzer.md#specify-index-time-analyzer).
32177

33-
$$$synonyms-store-synonyms-api$$$
178+
The following example adds `my_analyzer` as a search analyzer to the `title` field in an index mapping:
34179

180+
```JSON
181+
{
182+
"mappings": {
183+
"properties": {
184+
"title": {
185+
"type": "text",
186+
"search_analyzer": "my_analyzer"
187+
}
188+
}
189+
},
190+
"settings": {
191+
"analysis": {
192+
"analyzer": {
193+
"my_analyzer": {
194+
"tokenizer": "whitespace",
195+
"filter": [
196+
"synonyms_filter"
197+
]
198+
}
199+
},
200+
"filter": {
201+
"synonyms_filter": {
202+
"type": "synonym",
203+
"synonyms_path": "analysis/synonym-set.txt",
204+
"updateable": true
205+
}
206+
}
207+
}
208+
}
209+
}
210+
```

solutions/search/full-text/text-analysis-during-search.md

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -22,7 +22,9 @@ However, if you use `text` fields or your text searches aren’t returning resul
2222
* Perform lexicographic or linguistic research
2323

2424

25-
## In this section [analysis-toc]
25+
## Learn more [analysis-toc]
26+
27+
Learn more about text analysis in the **Manage Data** section of the documentation:
2628

2729
* [Overview](../../../manage-data/data-store/text-analysis.md)
2830
* [Concepts](../../../manage-data/data-store/text-analysis/concepts.md)

solutions/search/ingest-for-search.md

Lines changed: 1 addition & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -8,9 +8,6 @@ mapped_urls:
88

99
# Ingest for search use cases
1010

11-
% ----
12-
% navigation_title: "Ingest for search use cases"
13-
% ----
1411

1512
$$$elasticsearch-ingest-time-series-data$$$
1613
::::{note}
@@ -40,7 +37,7 @@ You can use these specialized tools to add general content to {{es}} indices.
4037
| Method | Description | Notes |
4138
|--------|-------------|-------|
4239
| [**Web crawler**](https://github.com/elastic/crawler) | Programmatically discover and index content from websites and knowledge bases | Crawl public-facing web content or internal sites accessible via HTTP proxy |
43-
| [**Search connectors**]() | Third-party integrations to popular content sources like databases, cloud storage, and business applications | Choose from a range of Elastic-built connectors or build your own in Python using the Elastic connector framework|
40+
| [**Search connectors**](https://github.com/elastic/connectors) | Third-party integrations to popular content sources like databases, cloud storage, and business applications | Choose from a range of Elastic-built connectors or build your own in Python using the Elastic connector framework|
4441
| [**File upload**](/manage-data/ingest/tools/upload-data-files.md)| One-off manual uploads through the UI | Useful for testing or very small-scale use cases, but not recommended for production workflows |
4542

4643
### Process data at ingest time

solutions/toc.yml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -595,6 +595,7 @@ toc:
595595
- file: search/search-approaches.md
596596
- file: search/full-text.md
597597
children:
598+
- file: search/full-text/how-full-text-works.md
598599
- file: search/full-text/search-with-synonyms.md
599600
- file: search/full-text/text-analysis-during-search.md
600601
- file: search/full-text/search-relevance.md

0 commit comments

Comments
 (0)