Skip to content

Commit b9cb754

Browse files
committed
Provide more details about vectorization
1 parent 27cb25c commit b9cb754

File tree

1 file changed

+49
-52
lines changed

1 file changed

+49
-52
lines changed

solutions/search/get-started/semantic-search.md

Lines changed: 49 additions & 52 deletions
Original file line numberDiff line numberDiff line change
@@ -13,68 +13,65 @@ _Semantic search_ is a type of AI-powered search that enables you to use natural
1313
It returns results that match the meaning of a query, as opposed to literal keyword matches.
1414
For example, if you want to search for workplace guidelines on a second income, you could search for "side hustle", which is not a term you're likely to see in a formal HR document.
1515

16+
Semantic search uses {{es}} vector database and vector search technology.
17+
Each _vector_ (or _vector embedding_) is an array of numbers that each represent a different characteristic of the text, such as sentiment, context, and syntactics.
18+
These numeric representations make comparison with other vectors very efficient.
19+
1620
In this guide, you'll learn how to perform semantic search on a small set of sample data.
17-
You'll use the default Learned Sparse Encoder model ([ELSER](/explore-analyze/machine-learning/nlp/ml-nlp-elser.md)), which automatically creates vector embeddings (numeric representations that capture the text meaning) when storing and searching the data.
18-
This model is built to provide great relevance across domains, without the need for additional fine tuning.
21+
You'll create vectors and store them in {{es}}.
22+
Then you'll run a query, which will be transformed into vectors and compared to the stored data.
23+
By playing with a simple use case, you'll take the first steps toward understanding whether this type of search is relevant to your own data.
1924

2025
## Prerequisites
2126

2227
- If you're using [{{es-serverless}}](/solutions/search/serverless-elasticsearch-get-started.md), create a project that is optimized for vectors. To add the sample data, you must have a `developer` or `admin` predefined role or an equivalent custom role.
23-
- If you're [running {{es}} locally](/solutions/search/run-elasticsearch-locally.md), start {{es}} and {{kib}}. To add the sample data, log in with the `elastic` user that has the `superuser` built-in role.
28+
- If you're [running {{es}} locally](/solutions/search/run-elasticsearch-locally.md), start {{es}} and {{kib}}. To add the sample data, log in with a user that has the `superuser` built-in role, such as `elastic`.
2429

2530
To learn about role-based access control, check out [](/deploy-manage/users-roles/cluster-or-deployment-auth/user-roles.md).
2631

2732
<!--
28-
TBD: It seems like semantic search fields exist in all, so what is the value of this "optimized for vectors" option?
33+
TBD: What is the impact of this "optimized for vectors" option?
2934
-->
3035

31-
## Add data
36+
## Create a vector database
3237

33-
% TBD: What type of data is ideal for semantic search?
38+
When you create vectors (or _vectorize_ your data), you convert complex and nuanced documents into multidimensional numerical representations.
39+
You can choose from many different vector embedding models. Some are extremely hardware efficient and can be run with less computational power. Others have a greater “understanding” of the context and can answer questions and lead a threaded conversation.
40+
These examples use the default Learned Sparse Encoder ([ELSER](/explore-analyze/machine-learning/nlp/ml-nlp-elser.md)) model, which provides great relevance across domains without the need for additional fine tuning.
3441

35-
::::{tab-set}
36-
:::{tab-item} {{serverless-short}}
37-
:sync: serverless
38-
There are some small data sets available for learning purposes.
39-
Go to **{{es}} > Home**, select the semantic search workflow, and click **Create a semantic optimized index**.
42+
The way that you store and index vectors has a significant impact on the performance and accuracy of search results.
43+
They must be stored in specialized data structures designed to ensure efficient similarity search and speedy vector distance calculations.
44+
These examples store the vectors in `semantic_text` fields, which provide sensible defaults and automation.
4045

41-
Follow the instructions to install an {{es}} client and copy the code examples.
42-
Alternatively, try out the API requests in the [Console](/explore-analyze/query-filter/tools/console.md).
43-
:::
44-
:::{tab-item} {{stack}}
45-
:sync: stack
46-
There are some small data sets available for learning purposes.
47-
Go to **{{es}} > Home** and click **Create API index**.
48-
Select the semantic search workflow in the guided index flow.
49-
50-
Follow the instructions to install an {{es}} client and copy the code examples.
51-
Alternatively, try out the API requests in the [Console](/explore-analyze/query-filter/tools/console.md).
52-
:::
53-
::::
46+
Try vectorizing a small set of documents.
47+
You can follow the guided index workflow:
5448

55-
:::::{stepper}
49+
- If you're using {{es-serverless}}, go to **{{es}} > Home**, select the semantic search workflow, and click **Create a semantic optimized index**.
50+
- If you're running {{es}} locally, go to **{{es}} > Home** and click **Create API index**. Select the semantic search workflow.
5651

57-
::::{step} Create a `semantic_text` field mapping
52+
Alternatively, run the following API requests in the [Console](/explore-analyze/query-filter/tools/console.md):
5853

59-
You can implement semantic search with varying levels of complexity and customization.
60-
The recommended method is to use `semantic_text` fields, which provide sensible defaults and automation.
61-
For example, it uses [ELSER](/explore-analyze/machine-learning/nlp/ml-nlp-elser.md) (and therefore sparse vectors) by default.
54+
:::::{stepper}
55+
::::{step} Create a semantic_text field mapping
6256

63-
The following example creates a mapping for a single field:
57+
The following example creates a mapping for a single field ("content"):
6458

6559
```console
6660
PUT /semantic-index/_mapping
6761
{
6862
"properties": {
69-
"text": {
63+
"content": {
7064
"type": "semantic_text"
7165
}
7266
}
7367
}
7468
```
7569

76-
Refer to [](elasticsearch://reference/elasticsearch/mapping-reference/semantic-text.md) for details about this field type.
77-
For a broader overview, check out [Mapping embeddings to Elasticsearch field types: semantic_text, dense_vector, sparse_vector](https://www.elastic.co/search-labs/blog/mapping-embeddings-to-elasticsearch-field-types).
70+
When you use `semantic_text` fields, the type of vector is determined by the vector embedding model.
71+
In this case, the default ELSER model will be used to create sparse vectors.
72+
73+
For more details about `semantic_text` fields, refer to [](elasticsearch://reference/elasticsearch/mapping-reference/semantic-text.md).
74+
For a deeper dive, check out [Mapping embeddings to Elasticsearch field types: semantic_text, dense_vector, sparse_vector](https://www.elastic.co/search-labs/blog/mapping-embeddings-to-elasticsearch-field-types).
7875
::::
7976

8077
::::{step} Add documents
@@ -84,30 +81,26 @@ You can use the Elasticsearch bulk API to ingest an array of documents:
8481
```console
8582
POST /_bulk?pretty
8683
{ "index": { "_index": "semantic-index" } }
87-
{"text":"Yellowstone National Park is one of the largest national parks in the United States. It ranges from the Wyoming to Montana and Idaho, and contains an area of 2,219,791 acress across three different states. Its most famous for hosting the geyser Old Faithful and is centered on the Yellowstone Caldera, the largest super volcano on the American continent. Yellowstone is host to hundreds of species of animal, many of which are endangered or threatened. Most notably, it contains free-ranging herds of bison and elk, alongside bears, cougars and wolves. The national park receives over 4.5 million visitors annually and is a UNESCO World Heritage Site."}
84+
{"content":"Yellowstone National Park is one of the largest national parks in the United States. It ranges from the Wyoming to Montana and Idaho, and contains an area of 2,219,791 acress across three different states. Its most famous for hosting the geyser Old Faithful and is centered on the Yellowstone Caldera, the largest super volcano on the American continent. Yellowstone is host to hundreds of species of animal, many of which are endangered or threatened. Most notably, it contains free-ranging herds of bison and elk, alongside bears, cougars and wolves. The national park receives over 4.5 million visitors annually and is a UNESCO World Heritage Site."}
8885
{ "index": { "_index": "semantic-index" } }
89-
{"text":"Yosemite National Park is a United States National Park, covering over 750,000 acres of land in California. A UNESCO World Heritage Site, the park is best known for its granite cliffs, waterfalls and giant sequoia trees. Yosemite hosts over four million visitors in most years, with a peak of five million visitors in 2016. The park is home to a diverse range of wildlife, including mule deer, black bears, and the endangered Sierra Nevada bighorn sheep. The park has 1,200 square miles of wilderness, and is a popular destination for rock climbers, with over 3,000 feet of vertical granite to climb. Its most famous and cliff is the El Capitan, a 3,000 feet monolith along its tallest face."}
86+
{"content":"Yosemite National Park is a United States National Park, covering over 750,000 acres of land in California. A UNESCO World Heritage Site, the park is best known for its granite cliffs, waterfalls and giant sequoia trees. Yosemite hosts over four million visitors in most years, with a peak of five million visitors in 2016. The park is home to a diverse range of wildlife, including mule deer, black bears, and the endangered Sierra Nevada bighorn sheep. The park has 1,200 square miles of wilderness, and is a popular destination for rock climbers, with over 3,000 feet of vertical granite to climb. Its most famous and cliff is the El Capitan, a 3,000 feet monolith along its tallest face."}
9087
{ "index": { "_index": "semantic-index" } }
91-
{"text":"Rocky Mountain National Park is one of the most popular national parks in the United States. It receives over 4.5 million visitors annually, and is known for its mountainous terrain, including Longs Peak, which is the highest peak in the park. The park is home to a variety of wildlife, including elk, mule deer, moose, and bighorn sheep. The park is also home to a variety of ecosystems, including montane, subalpine, and alpine tundra. The park is a popular destination for hiking, camping, and wildlife viewing, and is a UNESCO World Heritage Site."}
88+
{"content":"Rocky Mountain National Park is one of the most popular national parks in the United States. It receives over 4.5 million visitors annually, and is known for its mountainous terrain, including Longs Peak, which is the highest peak in the park. The park is home to a variety of wildlife, including elk, mule deer, moose, and bighorn sheep. The park is also home to a variety of ecosystems, including montane, subalpine, and alpine tundra. The park is a popular destination for hiking, camping, and wildlife viewing, and is a UNESCO World Heritage Site."}
9289
```
9390

9491
The bulk ingestion request might take longer than the default request timeout.
9592
If it times out, wait for the machine learning model loading to complete (typically 1-5 minutes) then retry it.
96-
97-
<!--
98-
TBD: Describe where to look for the downloaded model in Trained Models?
99-
-->
93+
::::
94+
:::::
10095

10196
What just happened? The content was transformed into a sparse vector, which involves two main steps.
10297
First, the content is divided into smaller, manageable chunks to ensure that meaningful segments can be more effectively processed and searched. Then each chunk of text is transformed into a sparse vector representation using text expansion techniques.
103-
By default, `semantic_text` fields leverage ELSER to transform the content.
104-
105-
% TBD: Confirm "Elser model" vs ".elser-2-elasticsearch" terminology.
10698

10799
![Semantic search chunking](https://images.contentstack.io/v3/assets/bltefdd0b53724fa2ce/blt9bbe5e260012b15d/67ffffc8165067d96124b586/animated-gif-semantic-search-chunking.gif)
108100

109-
::::
110-
::::{step} Explore the data
101+
102+
## Explore the data
103+
111104
To familiarize yourself with this data set, open [Discover](/explore-analyze/discover.md) from the navigation menu or by using the [global search field](/explore-analyze/find-and-organize/find-apps-and-objects.md).
112105

113106
In **Discover**, you can click the expand icon ![double arrow icon to open a flyout with the document details](/explore-analyze/images/kibana-expand-icon-2.png "") to show details about any documents in the table.
@@ -118,15 +111,19 @@ In **Discover**, you can click the expand icon ![double arrow icon to open a fly
118111
:::
119112

120113
For more tips, check out [](/explore-analyze/discover/discover-get-started.md).
121-
::::
122-
:::::
114+
123115
<!--
124116
TBD: When you view these documents in Discover they're shown as having "text" field type instead of "semantic_text" is this right?
125-
TBD: Should we call out that the KQL filters in Discover don't seem to work against semantic_text fields yet?
126117
-->
127118

128119
## Test semantic search
129120

121+
<!--
122+
TO-DO: Talk about the pipeline where vectors are required for both the data and search query
123+
% encodes details of searchable information into vectors and then compares vectors to determine which are most similar.
124+
When you run a query, the search engine transforms the query into embeddings, which are numerical representations of data and related contexts. They are stored in vectors. The kNN algorithm, or k-nearest neighbor algorithm, then matches vectors of existing documents (a semantic search concerns text) to the query vectors. The semantic search then generates results and ranks them based on conceptual relevance.
125+
-->
126+
130127
{{es}} provides a variety of query languages for interacting with your data.
131128
For an overview of their features and use cases, check out [](/explore-analyze/query-filter/languages.md).
132129

@@ -155,7 +152,7 @@ POST /semantic-index/_search
155152
"standard": {
156153
"query": {
157154
"semantic": {
158-
"field": "text",
155+
"field": "content",
159156
"query": "best park for rappelling"
160157
}
161158
}
@@ -191,7 +188,7 @@ The search results are sorted by a relevance score, which measures how well each
191188
"_id": "Pp0MtJcBZjjo1YKoXkWH",
192189
"_score": 11.389743,
193190
"_source": {
194-
"text": "Rocky Mountain National Park ..."
191+
"content": "Rocky Mountain National Park ..."
195192
...
196193
}
197194
```
@@ -213,8 +210,8 @@ Copy the following query:
213210

214211
```esql
215212
FROM semantic-index METADATA _score <1>
216-
| WHERE text: "what's the biggest park?" <2>
217-
| KEEP text, _score <3>
213+
| WHERE content: "what's the biggest park?" <2>
214+
| KEEP content, _score <3>
218215
| SORT _score DESC <4>
219216
| LIMIT 1000 <5>
220217
```

0 commit comments

Comments
 (0)