Skip to content

Commit e0a4edc

Browse files
szabosteveDavid Robertsabdonpijpelink
authored
[DOCS] Adds example of semantic search with ELSER (#95992)
Co-authored-by: David Roberts <[email protected]> Co-authored-by: Abdon Pijpelink <[email protected]>
1 parent 59fbd1b commit e0a4edc

File tree

1 file changed

+222
-2
lines changed

1 file changed

+222
-2
lines changed
Lines changed: 222 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,228 @@
11
[[semantic-search-elser]]
2-
== Perform semantic search with ELSER
2+
== Tutorial: semantic search with ELSER
33
++++
44
<titleabbrev>Semantic search with ELSER</titleabbrev>
55
++++
66

77
:keywords: {ml-init}, {stack}, {nlp}, ELSER
8-
:description: ELSER is a learned sparse ranking model trained by Elastic.
8+
:description: ELSER is a learned sparse ranking model trained by Elastic.
9+
10+
Elastic Learned Sparse EncodeR - or ELSER - is an NLP model trained by Elastic
11+
that enables you to perform semantic search by using sparse vector
12+
representation. Instead of literal matching on search terms, semantic search
13+
retrieves results based on the intent and the contextual meaning of a search
14+
query.
15+
16+
The instructions in this tutorial shows you how to use ELSER to perform semantic
17+
search on your data.
18+
19+
NOTE: Only the first 512 extracted tokens per field are considered during
20+
semantic search with ELSER v1. Refer to
21+
{ml-docs}/ml-nlp-limitations.html#ml-nlp-elser-v1-limit-512[this page] for more
22+
information.
23+
24+
25+
[discrete]
26+
[[requirements]]
27+
=== Requirements
28+
29+
To perform semantic search by using ELSER, you must have the NLP model deployed
30+
in your cluster. Refer to the
31+
{ml-docs}/ml-nlp-elser.html[ELSER documentation] to learn how to download and
32+
deploy the model.
33+
34+
35+
[discrete]
36+
[[elser-mappings]]
37+
=== Create the index mapping
38+
39+
First, the mapping of the destination index - the index that contains the tokens
40+
that the model created based on your text - must be created. The destination
41+
index must have a field with the <<rank-features, `rank_features`>> field type
42+
to index the ELSER output.
43+
44+
[source,console]
45+
----
46+
PUT my-index
47+
{
48+
"mappings": {
49+
"properties": {
50+
"ml.tokens": {
51+
"type": "rank_features" <1>
52+
},
53+
"text_field": {
54+
"type": "text" <2>
55+
}
56+
}
57+
}
58+
}
59+
----
60+
// TEST[skip:TBD]
61+
<1> The field that contains the prediction is a `rank_features` field.
62+
<2> The text field from which to create the sparse vector representation.
63+
64+
65+
[discrete]
66+
[[inference-ingest-pipeline]]
67+
=== Create an ingest pipeline with an inference processor
68+
69+
Create an <<ingest,ingest pipeline>> with an
70+
<<inference-processor,{infer} processor>> to use ELSER to infer against the data
71+
that is being ingested in the pipeline.
72+
73+
[source,console]
74+
----
75+
PUT _ingest/pipeline/elser-v1-test
76+
{
77+
"processors": [
78+
{
79+
"inference": {
80+
"model_id": ".elser_model_1",
81+
"target_field": "ml",
82+
"field_map": {
83+
"text": "text_field"
84+
},
85+
"inference_config": {
86+
"text_expansion": { <1>
87+
"results_field": "tokens"
88+
}
89+
}
90+
}
91+
}
92+
]
93+
}
94+
----
95+
// TEST[skip:TBD]
96+
<1> The `text_expansion` inference type needs to be used in the {infer} ingest
97+
processor.
98+
99+
100+
[discrete]
101+
[[load-data]]
102+
=== Load data
103+
104+
In this step, you load the data that you later use in the {infer} ingest
105+
pipeline to extract tokens from it.
106+
107+
Use the `msmarco-passagetest2019-top1000` data set, which is a subset of the MS
108+
MACRO Passage Ranking data set. It consists of 200 queries, each accompanied by
109+
a list of relevant text passages. All unique passages, along with their IDs,
110+
have been extracted from that data set and compiled into a
111+
https://github.com/elastic/stack-docs/blob/main/docs/en/stack/ml/nlp/data/msmarco-passagetest2019-unique.tsv[tsv file].
112+
113+
Download the file and upload it to your cluster using the
114+
{kibana-ref}/connect-to-elasticsearch.html#upload-data-kibana[Data Visualizer]
115+
in the {ml-app} UI. Assign the name `id` to the first column and `text` to the
116+
second column. The index name is `test-data`. Once the upload is complete, you
117+
can see an index named `test-data` with 182469 documents.
118+
119+
120+
[discrete]
121+
[[reindexing-data-elser]]
122+
=== Ingest the data through the {infer} ingest pipeline
123+
124+
Create the tokens from the text by reindexing the data throught the {infer}
125+
pipeline that uses ELSER as the inference model.
126+
127+
[source,console]
128+
----
129+
POST _reindex?wait_for_completion=false
130+
{
131+
"source": {
132+
"index": "test-data"
133+
},
134+
"dest": {
135+
"index": "my-index",
136+
"pipeline": "elser-v1-test"
137+
}
138+
}
139+
----
140+
// TEST[skip:TBD]
141+
142+
The call returns a task ID to monitor the progress:
143+
144+
[source,console]
145+
----
146+
GET _tasks/<task_id>
147+
----
148+
// TEST[skip:TBD]
149+
150+
You can also open the Trained Models UI, select the Pipelines tab under ELSER to
151+
follow the progress. It may take a couple of minutes to complete the process.
152+
153+
154+
[discrete]
155+
[[text-expansion-query]]
156+
=== Semantic search by using the `text_expansion` query
157+
158+
To perform semantic search, use the `text_expansion` query,
159+
and provide the query text and the ELSER model ID. The example below uses
160+
the query text "How to avoid muscle soreness after running?":
161+
162+
[source,console]
163+
----
164+
GET my-index/_search
165+
{
166+
"query":{
167+
"text_expansion":{
168+
"ml.tokens":{
169+
"model_id":".elser_model_1",
170+
"model_text":"How to avoid muscle soreness after running?"
171+
}
172+
}
173+
}
174+
}
175+
----
176+
// TEST[skip:TBD]
177+
178+
The result is the top 10 documents that are closest in meaning to your query
179+
text from the `my-index` index sorted by their relevancy. The result also
180+
contains the extracted tokens for each of the relevant search results with their
181+
weights.
182+
183+
[source,consol-result]
184+
----
185+
"hits":[
186+
{
187+
"_index":"my-index",
188+
"_id":"978UAYgBKCQMet06sLEy",
189+
"_score":18.612831,
190+
"_ignored":[
191+
"text.keyword"
192+
],
193+
"_source":{
194+
"id":7361587,
195+
"text":"For example, if you go for a run, you will mostly use the muscles in your lower body. Give yourself 2 days to rest those muscles so they have a chance to heal before you exercise them again. Not giving your muscles enough time to rest can cause muscle damage, rather than muscle development.",
196+
"ml":{
197+
"tokens":{
198+
"muscular":0.075696334,
199+
"mostly":0.52380747,
200+
"practice":0.23430172,
201+
"rehab":0.3673556,
202+
"cycling":0.13947526,
203+
"your":0.35725075,
204+
"years":0.69484913,
205+
"soon":0.005317828,
206+
"leg":0.41748235,
207+
"fatigue":0.3157955,
208+
"rehabilitation":0.13636169,
209+
"muscles":1.302141,
210+
"exercises":0.36694175,
211+
(...)
212+
},
213+
"model_id":".elser_model_1"
214+
}
215+
}
216+
},
217+
(...)
218+
]
219+
----
220+
// NOTCONSOLE
221+
222+
[discrete]
223+
[[further-reading]]
224+
=== Further reading
225+
226+
* {ml-docs}/ml-nlp-elser.html[How to download and deploy ELSER]
227+
* {ml-docs}/ml-nlp-limitations.html#ml-nlp-elser-v1-limit-512[ELSER v1 limitation]
228+
// TO DO: refer to the ELSER blog post

0 commit comments

Comments
 (0)