Skip to content

Commit e45d880

Browse files
[DOCS] Gives more details to the load data step of the semantic search tutorials (#113088) (#113095)
Co-authored-by: Liam Thompson <[email protected]>
1 parent abf3c0f commit e45d880

File tree

3 files changed

+39
-21
lines changed

3 files changed

+39
-21
lines changed

docs/reference/search/search-your-data/semantic-search-elser.asciidoc

Lines changed: 19 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -117,15 +117,15 @@ All unique passages, along with their IDs, have been extracted from that data se
117117
https://github.com/elastic/stack-docs/blob/main/docs/en/stack/ml/nlp/data/msmarco-passagetest2019-unique.tsv[tsv file].
118118

119119
IMPORTANT: The `msmarco-passagetest2019-top1000` dataset was not utilized to train the model.
120-
It is only used in this tutorial as a sample dataset that is easily accessible for demonstration purposes.
120+
We use this sample dataset in the tutorial because is easily accessible for demonstration purposes.
121121
You can use a different data set to test the workflow and become familiar with it.
122122

123-
Download the file and upload it to your cluster using the
124-
{kibana-ref}/connect-to-elasticsearch.html#upload-data-kibana[Data Visualizer]
125-
in the {ml-app} UI.
126-
Assign the name `id` to the first column and `content` to the second column.
127-
The index name is `test-data`.
128-
Once the upload is complete, you can see an index named `test-data` with 182469 documents.
123+
Download the file and upload it to your cluster using the {kibana-ref}/connect-to-elasticsearch.html#upload-data-kibana[File Uploader] in the UI.
124+
After your data is analyzed, click **Override settings**.
125+
Under **Edit field names**, assign `id` to the first column and `content` to the second.
126+
Click **Apply**, then **Import**.
127+
Name the index `test-data`, and click **Import**.
128+
After the upload is complete, you will see an index named `test-data` with 182,469 documents.
129129

130130
[discrete]
131131
[[reindexing-data-elser]]
@@ -161,6 +161,18 @@ GET _tasks/<task_id>
161161

162162
You can also open the Trained Models UI, select the Pipelines tab under ELSER to follow the progress.
163163

164+
Reindexing large datasets can take a long time.
165+
You can test this workflow using only a subset of the dataset.
166+
Do this by cancelling the reindexing process, and only generating embeddings for the subset that was reindexed.
167+
The following API request will cancel the reindexing task:
168+
169+
[source,console]
170+
----
171+
POST _tasks/<task_id>/_cancel
172+
----
173+
// TEST[skip:TBD]
174+
175+
164176
[discrete]
165177
[[text-expansion-query]]
166178
==== Semantic search by using the `sparse_vector` query

docs/reference/search/search-your-data/semantic-search-inference.asciidoc

Lines changed: 10 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -67,12 +67,12 @@ It consists of 200 queries, each accompanied by a list of relevant text passages
6767
All unique passages, along with their IDs, have been extracted from that data set and compiled into a
6868
https://github.com/elastic/stack-docs/blob/main/docs/en/stack/ml/nlp/data/msmarco-passagetest2019-unique.tsv[tsv file].
6969

70-
Download the file and upload it to your cluster using the
71-
{kibana-ref}/connect-to-elasticsearch.html#upload-data-kibana[Data Visualizer]
72-
in the {ml-app} UI.
73-
Assign the name `id` to the first column and `content` to the second column.
74-
The index name is `test-data`.
75-
Once the upload is complete, you can see an index named `test-data` with 182469 documents.
70+
Download the file and upload it to your cluster using the {kibana-ref}/connect-to-elasticsearch.html#upload-data-kibana[Data Visualizer] in the {ml-app} UI.
71+
After your data is analyzed, click **Override settings**.
72+
Under **Edit field names**, assign `id` to the first column and `content` to the second.
73+
Click **Apply**, then **Import**.
74+
Name the index `test-data`, and click **Import**.
75+
After the upload is complete, you will see an index named `test-data` with 182,469 documents.
7676

7777
[discrete]
7878
[[reindexing-data-infer]]
@@ -91,7 +91,10 @@ GET _tasks/<task_id>
9191
----
9292
// TEST[skip:TBD]
9393

94-
You can also cancel the reindexing process if you don't want to wait until the reindexing process is fully complete which might take hours for large data sets:
94+
Reindexing large datasets can take a long time.
95+
You can test this workflow using only a subset of the dataset.
96+
Do this by cancelling the reindexing process, and only generating embeddings for the subset that was reindexed.
97+
The following API request will cancel the reindexing task:
9598

9699
[source,console]
97100
----

docs/reference/search/search-your-data/semantic-search-semantic-text.asciidoc

Lines changed: 10 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -96,11 +96,12 @@ a list of relevant text passages. All unique passages, along with their IDs,
9696
have been extracted from that data set and compiled into a
9797
https://github.com/elastic/stack-docs/blob/main/docs/en/stack/ml/nlp/data/msmarco-passagetest2019-unique.tsv[tsv file].
9898

99-
Download the file and upload it to your cluster using the
100-
{kibana-ref}/connect-to-elasticsearch.html#upload-data-kibana[Data Visualizer]
101-
in the {ml-app} UI. Assign the name `id` to the first column and `content` to
102-
the second column. The index name is `test-data`. Once the upload is complete,
103-
you can see an index named `test-data` with 182469 documents.
99+
Download the file and upload it to your cluster using the {kibana-ref}/connect-to-elasticsearch.html#upload-data-kibana[Data Visualizer] in the {ml-app} UI.
100+
After your data is analyzed, click **Override settings**.
101+
Under **Edit field names**, assign `id` to the first column and `content` to the second.
102+
Click **Apply**, then **Import**.
103+
Name the index `test-data`, and click **Import**.
104+
After the upload is complete, you will see an index named `test-data` with 182,469 documents.
104105

105106

106107
[discrete]
@@ -137,8 +138,10 @@ GET _tasks/<task_id>
137138
------------------------------------------------------------
138139
// TEST[skip:TBD]
139140

140-
It is recommended to cancel the reindexing process if you don't want to wait
141-
until it is fully complete which might take a long time for an inference endpoint with few assigned resources:
141+
Reindexing large datasets can take a long time.
142+
You can test this workflow using only a subset of the dataset.
143+
Do this by cancelling the reindexing process, and only generating embeddings for the subset that was reindexed.
144+
The following API request will cancel the reindexing task:
142145

143146
[source,console]
144147
------------------------------------------------------------

0 commit comments

Comments
 (0)