Skip to content

Commit 5153905

Browse files
[DOCS] Gives more details to the load data step of the semantic search tutorials (#113088)
Co-authored-by: Liam Thompson <[email protected]>
1 parent 99b5ed8 commit 5153905

File tree

3 files changed

+39
-21
lines changed

3 files changed

+39
-21
lines changed

docs/reference/search/search-your-data/semantic-search-elser.asciidoc

Lines changed: 19 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -117,15 +117,15 @@ All unique passages, along with their IDs, have been extracted from that data se
117117
https://github.com/elastic/stack-docs/blob/main/docs/en/stack/ml/nlp/data/msmarco-passagetest2019-unique.tsv[tsv file].
118118

119119
IMPORTANT: The `msmarco-passagetest2019-top1000` dataset was not utilized to train the model.
120-
It is only used in this tutorial as a sample dataset that is easily accessible for demonstration purposes.
120+
We use this sample dataset in the tutorial because is easily accessible for demonstration purposes.
121121
You can use a different data set to test the workflow and become familiar with it.
122122

123-
Download the file and upload it to your cluster using the
124-
{kibana-ref}/connect-to-elasticsearch.html#upload-data-kibana[Data Visualizer]
125-
in the {ml-app} UI.
126-
Assign the name `id` to the first column and `content` to the second column.
127-
The index name is `test-data`.
128-
Once the upload is complete, you can see an index named `test-data` with 182469 documents.
123+
Download the file and upload it to your cluster using the {kibana-ref}/connect-to-elasticsearch.html#upload-data-kibana[File Uploader] in the UI.
124+
After your data is analyzed, click **Override settings**.
125+
Under **Edit field names**, assign `id` to the first column and `content` to the second.
126+
Click **Apply**, then **Import**.
127+
Name the index `test-data`, and click **Import**.
128+
After the upload is complete, you will see an index named `test-data` with 182,469 documents.
129129

130130
[discrete]
131131
[[reindexing-data-elser]]
@@ -161,6 +161,18 @@ GET _tasks/<task_id>
161161

162162
You can also open the Trained Models UI, select the Pipelines tab under ELSER to follow the progress.
163163

164+
Reindexing large datasets can take a long time.
165+
You can test this workflow using only a subset of the dataset.
166+
Do this by cancelling the reindexing process, and only generating embeddings for the subset that was reindexed.
167+
The following API request will cancel the reindexing task:
168+
169+
[source,console]
170+
----
171+
POST _tasks/<task_id>/_cancel
172+
----
173+
// TEST[skip:TBD]
174+
175+
164176
[discrete]
165177
[[text-expansion-query]]
166178
==== Semantic search by using the `sparse_vector` query

docs/reference/search/search-your-data/semantic-search-inference.asciidoc

Lines changed: 10 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -68,12 +68,12 @@ It consists of 200 queries, each accompanied by a list of relevant text passages
6868
All unique passages, along with their IDs, have been extracted from that data set and compiled into a
6969
https://github.com/elastic/stack-docs/blob/main/docs/en/stack/ml/nlp/data/msmarco-passagetest2019-unique.tsv[tsv file].
7070

71-
Download the file and upload it to your cluster using the
72-
{kibana-ref}/connect-to-elasticsearch.html#upload-data-kibana[Data Visualizer]
73-
in the {ml-app} UI.
74-
Assign the name `id` to the first column and `content` to the second column.
75-
The index name is `test-data`.
76-
Once the upload is complete, you can see an index named `test-data` with 182469 documents.
71+
Download the file and upload it to your cluster using the {kibana-ref}/connect-to-elasticsearch.html#upload-data-kibana[Data Visualizer] in the {ml-app} UI.
72+
After your data is analyzed, click **Override settings**.
73+
Under **Edit field names**, assign `id` to the first column and `content` to the second.
74+
Click **Apply**, then **Import**.
75+
Name the index `test-data`, and click **Import**.
76+
After the upload is complete, you will see an index named `test-data` with 182,469 documents.
7777

7878
[discrete]
7979
[[reindexing-data-infer]]
@@ -92,7 +92,10 @@ GET _tasks/<task_id>
9292
----
9393
// TEST[skip:TBD]
9494

95-
You can also cancel the reindexing process if you don't want to wait until the reindexing process is fully complete which might take hours for large data sets:
95+
Reindexing large datasets can take a long time.
96+
You can test this workflow using only a subset of the dataset.
97+
Do this by cancelling the reindexing process, and only generating embeddings for the subset that was reindexed.
98+
The following API request will cancel the reindexing task:
9699

97100
[source,console]
98101
----

docs/reference/search/search-your-data/semantic-search-semantic-text.asciidoc

Lines changed: 10 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -96,11 +96,12 @@ a list of relevant text passages. All unique passages, along with their IDs,
9696
have been extracted from that data set and compiled into a
9797
https://github.com/elastic/stack-docs/blob/main/docs/en/stack/ml/nlp/data/msmarco-passagetest2019-unique.tsv[tsv file].
9898

99-
Download the file and upload it to your cluster using the
100-
{kibana-ref}/connect-to-elasticsearch.html#upload-data-kibana[Data Visualizer]
101-
in the {ml-app} UI. Assign the name `id` to the first column and `content` to
102-
the second column. The index name is `test-data`. Once the upload is complete,
103-
you can see an index named `test-data` with 182469 documents.
99+
Download the file and upload it to your cluster using the {kibana-ref}/connect-to-elasticsearch.html#upload-data-kibana[Data Visualizer] in the {ml-app} UI.
100+
After your data is analyzed, click **Override settings**.
101+
Under **Edit field names**, assign `id` to the first column and `content` to the second.
102+
Click **Apply**, then **Import**.
103+
Name the index `test-data`, and click **Import**.
104+
After the upload is complete, you will see an index named `test-data` with 182,469 documents.
104105

105106

106107
[discrete]
@@ -137,8 +138,10 @@ GET _tasks/<task_id>
137138
------------------------------------------------------------
138139
// TEST[skip:TBD]
139140

140-
It is recommended to cancel the reindexing process if you don't want to wait
141-
until it is fully complete which might take a long time for an inference endpoint with few assigned resources:
141+
Reindexing large datasets can take a long time.
142+
You can test this workflow using only a subset of the dataset.
143+
Do this by cancelling the reindexing process, and only generating embeddings for the subset that was reindexed.
144+
The following API request will cancel the reindexing task:
142145

143146
[source,console]
144147
------------------------------------------------------------

0 commit comments

Comments
 (0)