[DOCS] Gives more details to the load data step of the semantic search tutorials (#113088) (#113095)

szabosteve · leemthompo · web-flow · commit e45d880f50b4 · 2024-09-18T20:03:10.000+10:00
Co-authored-by: Liam Thompson &lt;32779855+leemthompo@users.noreply.github.com&gt;
diff --git a/docs/reference/search/search-your-data/semantic-search-elser.asciidoc b/docs/reference/search/search-your-data/semantic-search-elser.asciidoc
@@ -117,15 +117,15 @@ All unique passages, along with their IDs, have been extracted from that data se
 https://github.com/elastic/stack-docs/blob/main/docs/en/stack/ml/nlp/data/msmarco-passagetest2019-unique.tsv[tsv file].
 
 IMPORTANT: The `msmarco-passagetest2019-top1000` dataset was not utilized to train the model.
-It is only used in this tutorial as a sample dataset that is easily accessible for demonstration purposes.
+We use this sample dataset in the tutorial because is easily accessible for demonstration purposes.
 You can use a different data set to test the workflow and become familiar with it.
 
-Download the file and upload it to your cluster using the
-{kibana-ref}/connect-to-elasticsearch.html#upload-data-kibana[Data Visualizer]
-in the {ml-app} UI.
-Assign the name `id` to the first column and `content` to the second column.
-The index name is `test-data`.
-Once the upload is complete, you can see an index named `test-data` with 182469 documents.
+Download the file and upload it to your cluster using the {kibana-ref}/connect-to-elasticsearch.html#upload-data-kibana[File Uploader] in the UI.
+After your data is analyzed, click **Override settings**.
+Under **Edit field names**, assign `id` to the first column and `content` to the second.
+Click **Apply**, then **Import**.
+Name the index `test-data`, and click **Import**.
+After the upload is complete, you will see an index named `test-data` with 182,469 documents.
 
 [discrete]
 [[reindexing-data-elser]]
@@ -161,6 +161,18 @@ GET _tasks/<task_id>
 
 You can also open the Trained Models UI, select the Pipelines tab under ELSER to follow the progress.
 
+Reindexing large datasets can take a long time.
+You can test this workflow using only a subset of the dataset.
+Do this by cancelling the reindexing process, and only generating embeddings for the subset that was reindexed.
+The following API request will cancel the reindexing task:
+
+[source,console]
+----
+POST _tasks/<task_id>/_cancel
+----
+// TEST[skip:TBD]
+
+
 [discrete]
 [[text-expansion-query]]
 ==== Semantic search by using the `sparse_vector` query
diff --git a/docs/reference/search/search-your-data/semantic-search-inference.asciidoc b/docs/reference/search/search-your-data/semantic-search-inference.asciidoc
@@ -67,12 +67,12 @@ It consists of 200 queries, each accompanied by a list of relevant text passages
 All unique passages, along with their IDs, have been extracted from that data set and compiled into a
 https://github.com/elastic/stack-docs/blob/main/docs/en/stack/ml/nlp/data/msmarco-passagetest2019-unique.tsv[tsv file].
 
-Download the file and upload it to your cluster using the
-{kibana-ref}/connect-to-elasticsearch.html#upload-data-kibana[Data Visualizer]
-in the {ml-app} UI.
-Assign the name `id` to the first column and `content` to the second column.
-The index name is `test-data`.
-Once the upload is complete, you can see an index named `test-data` with 182469 documents.
+Download the file and upload it to your cluster using the {kibana-ref}/connect-to-elasticsearch.html#upload-data-kibana[Data Visualizer] in the {ml-app} UI.
+After your data is analyzed, click **Override settings**.
+Under **Edit field names**, assign `id` to the first column and `content` to the second.
+Click **Apply**, then **Import**.
+Name the index `test-data`, and click **Import**.
+After the upload is complete, you will see an index named `test-data` with 182,469 documents.
 
 [discrete]
 [[reindexing-data-infer]]
@@ -91,7 +91,10 @@ GET _tasks/<task_id>
 ----
 // TEST[skip:TBD]
 
-You can also cancel the reindexing process if you don't want to wait until the reindexing process is fully complete which might take hours for large data sets:
+Reindexing large datasets can take a long time.
+You can test this workflow using only a subset of the dataset.
+Do this by cancelling the reindexing process, and only generating embeddings for the subset that was reindexed.
+The following API request will cancel the reindexing task:
 
 [source,console]
 ----
diff --git a/docs/reference/search/search-your-data/semantic-search-semantic-text.asciidoc b/docs/reference/search/search-your-data/semantic-search-semantic-text.asciidoc
@@ -96,11 +96,12 @@ a list of relevant text passages. All unique passages, along with their IDs,
 have been extracted from that data set and compiled into a
 https://github.com/elastic/stack-docs/blob/main/docs/en/stack/ml/nlp/data/msmarco-passagetest2019-unique.tsv[tsv file].
 
-Download the file and upload it to your cluster using the
-{kibana-ref}/connect-to-elasticsearch.html#upload-data-kibana[Data Visualizer]
-in the {ml-app} UI. Assign the name `id` to the first column and `content` to
-the second column. The index name is `test-data`. Once the upload is complete,
-you can see an index named `test-data` with 182469 documents.
+Download the file and upload it to your cluster using the {kibana-ref}/connect-to-elasticsearch.html#upload-data-kibana[Data Visualizer] in the {ml-app} UI.
+After your data is analyzed, click **Override settings**.
+Under **Edit field names**, assign `id` to the first column and `content` to the second.
+Click **Apply**, then **Import**.
+Name the index `test-data`, and click **Import**.
+After the upload is complete, you will see an index named `test-data` with 182,469 documents.
 
 
 [discrete]
@@ -137,8 +138,10 @@ GET _tasks/<task_id>
 ------------------------------------------------------------
 // TEST[skip:TBD]
 
-It is recommended to cancel the reindexing process if you don't want to wait
-until it is fully complete which might take a long time for an inference endpoint with few assigned resources:
+Reindexing large datasets can take a long time.
+You can test this workflow using only a subset of the dataset.
+Do this by cancelling the reindexing process, and only generating embeddings for the subset that was reindexed.
+The following API request will cancel the reindexing task:
 
 [source,console]
 ------------------------------------------------------------