[DOCS] Improve inference API documentation (elastic#115235) (elastic#115525)

szabosteve · davidkyle · web-flow · commit 36e95ca34b53 · 2024-10-24T23:34:44.000+11:00
Co-authored-by: David Kyle &lt;david.kyle@elastic.co&gt;
diff --git a/docs/reference/inference/inference-apis.asciidoc b/docs/reference/inference/inference-apis.asciidoc
@@ -34,6 +34,24 @@ Elastic –, then create an {infer} endpoint by the <<put-inference-api>>.
 Now use <<semantic-search-semantic-text, semantic text>> to perform
 <<semantic-search, semantic search>> on your data.
 
+
+[discrete]
+[[default-enpoints]]
+=== Default {infer} endpoints
+
+Your {es} deployment contains some preconfigured {infer} endpoints that makes it easier for you to use them when defining `semantic_text` fields or {infer} processors.
+The following list contains the default {infer} endpoints listed by `inference_id`:
+
+* `.elser-2-elasticsearch`: uses the {ml-docs}/ml-nlp-elser.html[ELSER] built-in trained model for `sparse_embedding` tasks (recommended for English language texts)
+* `.multilingual-e5-small-elasticsearch`: uses the {ml-docs}/ml-nlp-e5.html[E5] built-in trained model for `text_embedding` tasks (recommended for non-English language texts)
+
+Use the `inference_id` of the endpoint in a <<semantic-text,`semantic_text`>> field definition or when creating an <<inference-processor,{infer} processor>>.
+The API call will automatically download and deploy the model which might take a couple of minutes.
+Default {infer} enpoints have {ml-docs}/ml-nlp-auto-scale.html#nlp-model-adaptive-allocations[adaptive allocations] enabled.
+For these models, the minimum number of allocations is `0`. 
+If there is no {infer} activity that uses the endpoint, the number of allocations will scale down to `0` automatically after 15 minutes.
+
+
 include::delete-inference.asciidoc[]
 include::get-inference.asciidoc[]
 include::post-inference.asciidoc[]
diff --git a/docs/reference/inference/service-elasticsearch.asciidoc b/docs/reference/inference/service-elasticsearch.asciidoc
@@ -1,12 +1,9 @@
 [[infer-service-elasticsearch]]
 === Elasticsearch {infer} service
 
-Creates an {infer} endpoint to perform an {infer} task with the `elasticsearch`
-service.
+Creates an {infer} endpoint to perform an {infer} task with the `elasticsearch` service.
 
-NOTE: If you use the E5 model through the `elasticsearch` service, the API
-request will automatically download and deploy the model if it isn't downloaded
-yet.
+NOTE: If you use the ELSER or the E5 model through the `elasticsearch` service, the API request will automatically download and deploy the model if it isn't downloaded yet.
 
 
 [discrete]
@@ -56,6 +53,11 @@ These settings are specific to the `elasticsearch` service.
 (Optional, object)
 include::{es-ref-dir}/ml/ml-shared.asciidoc[tag=adaptive-allocation]
 
+`deployment_id`:::
+(Optional, string)
+The `deployment_id` of an existing trained model deployment.
+When `deployment_id` is used the `model_id` is optional.
+
 `enabled`::::
 (Optional, Boolean)
 include::{es-ref-dir}/ml/ml-shared.asciidoc[tag=adaptive-allocation-enabled]
@@ -71,7 +73,7 @@ include::{es-ref-dir}/ml/ml-shared.asciidoc[tag=adaptive-allocation-min-number]
 `model_id`:::
 (Required, string)
 The name of the model to use for the {infer} task.
-It can be the ID of either a built-in model (for example, `.multilingual-e5-small` for E5) or a text embedding model already
+It can be the ID of either a built-in model (for example, `.multilingual-e5-small` for E5), a text embedding model already
 {ml-docs}/ml-nlp-import-model.html#ml-nlp-import-script[uploaded through Eland].
 
 `num_allocations`:::
@@ -98,15 +100,44 @@ Returns the document instead of only the index. Defaults to `true`.
 =====
 
 
+[discrete]
+[[inference-example-elasticsearch-elser]]
+==== ELSER via the `elasticsearch` service
+
+The following example shows how to create an {infer} endpoint called `my-elser-model` to perform a `sparse_embedding` task type.
+
+The API request below will automatically download the ELSER model if it isn't already downloaded and then deploy the model.
+
+[source,console]
+------------------------------------------------------------
+PUT _inference/sparse_embedding/my-elser-model
+{
+  "service": "elasticsearch",
+  "service_settings": {
+    "adaptive_allocations": { <1>
+      "enabled": true,
+      "min_number_of_allocations": 1,
+      "max_number_of_allocations": 10
+    },
+    "num_threads": 1,
+    "model_id": ".elser_model_2" <2>
+  }
+}
+------------------------------------------------------------
+// TEST[skip:TBD]
+<1> Adaptive allocations will be enabled with the minimum of 1 and the maximum of 10 allocations.
+<2> The `model_id` must be the ID of one of the built-in ELSER models.
+Valid values are `.elser_model_2` and `.elser_model_2_linux-x86_64`.
+For further details, refer to the {ml-docs}/ml-nlp-elser.html[ELSER model documentation].
+
+
 [discrete]
 [[inference-example-elasticsearch]]
 ==== E5 via the `elasticsearch` service
 
-The following example shows how to create an {infer} endpoint called
-`my-e5-model` to perform a `text_embedding` task type.
+The following example shows how to create an {infer} endpoint called `my-e5-model` to perform a `text_embedding` task type.
 
-The API request below will automatically download the E5 model if it isn't
-already downloaded and then deploy the model.
+The API request below will automatically download the E5 model if it isn't already downloaded and then deploy the model.
 
 [source,console]
 ------------------------------------------------------------
@@ -185,3 +216,46 @@ PUT _inference/text_embedding/my-e5-model
 }
 ------------------------------------------------------------
 // TEST[skip:TBD]
+
+
+[discrete]
+[[inference-example-existing-deployment]]
+==== Using an existing model deployment with the `elasticsearch` service
+
+The following example shows how to use an already existing model deployment when creating an {infer} endpoint.
+
+[source,console]
+------------------------------------------------------------
+PUT _inference/sparse_embedding/use_existing_deployment
+{
+  "service": "elasticsearch",
+  "service_settings": {
+    "deployment_id": ".elser_model_2" <1>
+  }
+}
+------------------------------------------------------------
+// TEST[skip:TBD]
+<1> The `deployment_id` of the already existing model deployment.
+
+The API response contains the `model_id`, and the threads and allocations settings from the model deployment:
+
+[source,console-result]
+------------------------------------------------------------
+{
+  "inference_id": "use_existing_deployment",
+  "task_type": "sparse_embedding",
+  "service": "elasticsearch",
+  "service_settings": {
+    "num_allocations": 2,
+    "num_threads": 1,
+    "model_id": ".elser_model_2",
+    "deployment_id": ".elser_model_2"
+  },
+  "chunking_settings": {
+    "strategy": "sentence",
+    "max_chunk_size": 250,
+    "sentence_overlap": 1
+  }
+}
+------------------------------------------------------------
+// NOTCONSOLE
diff --git a/docs/reference/inference/service-elser.asciidoc b/docs/reference/inference/service-elser.asciidoc
@@ -2,6 +2,7 @@
 === ELSER {infer} service
 
 Creates an {infer} endpoint to perform an {infer} task with the `elser` service.
+You can also deploy ELSER by using the <<infer-service-elasticsearch>>.
 
 NOTE: The API request will automatically download and deploy the ELSER model if
 it isn't already downloaded.
@@ -128,7 +129,7 @@ If using the Python client, you can set the `timeout` parameter to a higher valu
 
 [discrete]
 [[inference-example-elser-adaptive-allocation]]
-==== Setting adaptive allocation for the ELSER service
+==== Setting adaptive allocations for the ELSER service
 
 NOTE: For more information on how to optimize your ELSER endpoints, refer to {ml-docs}/ml-nlp-elser.html#elser-recommendations[the ELSER recommendations] section in the model documentation.
 To learn more about model autoscaling, refer to the {ml-docs}/ml-nlp-auto-scale.html[trained model autoscaling] page.