[DOCS] Adds adaptive_allocations to inference and trained model API docs (#111476) (#111508)

szabosteve · web-flow · commit 7d103074293f · 2024-08-01T21:01:32.000+10:00
diff --git a/docs/reference/inference/service-elasticsearch.asciidoc b/docs/reference/inference/service-elasticsearch.asciidoc
@@ -51,6 +51,22 @@ include::inference-shared.asciidoc[tag=service-settings]
 These settings are specific to the `elasticsearch` service.
 --
 
+`adaptive_allocations`:::
+(Optional, object)
+include::{es-ref-dir}/ml/ml-shared.asciidoc[tag=adaptive-allocation]
+
+`enabled`::::
+(Optional, Boolean)
+include::{es-ref-dir}/ml/ml-shared.asciidoc[tag=adaptive-allocation-enabled]
+
+`max_number_of_allocations`::::
+(Optional, integer)
+include::{es-ref-dir}/ml/ml-shared.asciidoc[tag=adaptive-allocation-max-number]
+
+`min_number_of_allocations`::::
+(Optional, integer)
+include::{es-ref-dir}/ml/ml-shared.asciidoc[tag=adaptive-allocation-min-number]
+
 `model_id`:::
 (Required, string)
 The name of the model to use for the {infer} task.
@@ -59,7 +75,9 @@ It can be the ID of either a built-in model (for example, `.multilingual-e5-smal
 
 `num_allocations`:::
 (Required, integer)
-The total number of allocations this model is assigned across machine learning nodes. Increasing this value generally increases the throughput.
+The total number of allocations this model is assigned across machine learning nodes.
+Increasing this value generally increases the throughput.
+If `adaptive_allocations` is enabled, do not set this value, because it's automatically set.
 
 `num_threads`:::
 (Required, integer)
@@ -137,3 +155,31 @@ PUT _inference/text_embedding/my-msmarco-minilm-model <1>
 <1> Provide an unique identifier for the inference endpoint. The `inference_id` must be unique and must not match the `model_id`.
 <2> The `model_id` must be the ID of a text embedding model which has already been
 {ml-docs}/ml-nlp-import-model.html#ml-nlp-import-script[uploaded through Eland].
+
+[discrete]
+[[inference-example-adaptive-allocation]]
+==== Setting adaptive allocation for E5 via the `elasticsearch` service
+
+The following example shows how to create an {infer} endpoint called
+`my-e5-model` to perform a `text_embedding` task type and configure adaptive
+allocations.
+
+The API request below will automatically download the E5 model if it isn't
+already downloaded and then deploy the model.
+
+[source,console]
+------------------------------------------------------------
+PUT _inference/text_embedding/my-e5-model
+{
+  "service": "elasticsearch",
+  "service_settings": {
+    "adaptive_allocations": {
+      "enabled": true,
+      "min_number_of_allocations": 3,
+      "max_number_of_allocations": 10
+    },
+    "model_id": ".multilingual-e5-small"
+  }
+}
+------------------------------------------------------------
+// TEST[skip:TBD]
diff --git a/docs/reference/inference/service-elser.asciidoc b/docs/reference/inference/service-elser.asciidoc
@@ -48,9 +48,27 @@ include::inference-shared.asciidoc[tag=service-settings]
 These settings are specific to the `elser` service.
 --
 
+`adaptive_allocations`:::
+(Optional, object)
+include::{es-ref-dir}/ml/ml-shared.asciidoc[tag=adaptive-allocation]
+
+`enabled`::::
+(Optional, Boolean)
+include::{es-ref-dir}/ml/ml-shared.asciidoc[tag=adaptive-allocation-enabled]
+
+`max_number_of_allocations`::::
+(Optional, integer)
+include::{es-ref-dir}/ml/ml-shared.asciidoc[tag=adaptive-allocation-max-number]
+
+`min_number_of_allocations`::::
+(Optional, integer)
+include::{es-ref-dir}/ml/ml-shared.asciidoc[tag=adaptive-allocation-min-number] 
+
 `num_allocations`:::
 (Required, integer)
-The total number of allocations this model is assigned across machine learning nodes. Increasing this value generally increases the throughput.
+The total number of allocations this model is assigned across machine learning nodes.
+Increasing this value generally increases the throughput.
+If `adaptive_allocations` is enabled, do not set this value, because it's automatically set.
 
 `num_threads`:::
 (Required, integer)
@@ -107,3 +125,30 @@ This error usually just reflects a timeout, while the model downloads in the bac
 You can check the download progress in the {ml-app} UI.
 If using the Python client, you can set the `timeout` parameter to a higher value.
 ====
+
+[discrete]
+[[inference-example-elser-adaptive-allocation]]
+==== Setting adaptive allocation for the ELSER service
+
+The following example shows how to create an {infer} endpoint called
+`my-elser-model` to perform a `sparse_embedding` task type and configure
+adaptive allocations.
+
+The request below will automatically download the ELSER model if it isn't
+already downloaded and then deploy the model.
+
+[source,console]
+------------------------------------------------------------
+PUT _inference/sparse_embedding/my-elser-model
+{
+  "service": "elser",
+  "service_settings": {
+    "adaptive_allocations": {
+      "enabled": true,
+      "min_number_of_allocations": 3,
+      "max_number_of_allocations": 10
+    }
+  }
+}
+------------------------------------------------------------
+// TEST[skip:TBD]
diff --git a/docs/reference/ml/ml-shared.asciidoc b/docs/reference/ml/ml-shared.asciidoc
@@ -1,3 +1,27 @@
+tag::adaptive-allocation[]
+Adaptive allocations configuration object.
+If enabled, the number of allocations of the model is set based on the current load the process gets.
+When the load is high, a new model allocation is automatically created (respecting the value of `max_number_of_allocations` if it's set).
+When the load is low, a model allocation is automatically removed (respecting the value of `min_number_of_allocations` if it's set).
+The number of model allocations cannot be scaled down to less than `1` this way.
+If `adaptive_allocations` is enabled, do not set the number of allocations manually.
+end::adaptive-allocation[]
+
+tag::adaptive-allocation-enabled[]
+If `true`, `adaptive_allocations` is enabled.
+Defaults to `false`.
+end::adaptive-allocation-enabled[]
+
+tag::adaptive-allocation-max-number[]
+Specifies the maximum number of allocations to scale to.
+If set, it must be greater than or equal to `min_number_of_allocations`.
+end::adaptive-allocation-max-number[]
+
+tag::adaptive-allocation-min-number[]
+Specifies the minimum number of allocations to scale to.
+If set, it must be greater than or equal to `1`.
+end::adaptive-allocation-min-number[]
+
 tag::aggregations[]
 If set, the {dfeed} performs aggregation searches. Support for aggregations is
 limited and should be used only with low cardinality data. For more information,
diff --git a/docs/reference/ml/trained-models/apis/start-trained-model-deployment.asciidoc b/docs/reference/ml/trained-models/apis/start-trained-model-deployment.asciidoc
@@ -30,7 +30,10 @@ must be unique and should not match any other deployment ID or model ID, unless
 it is the same as the ID of the model being deployed. If `deployment_id` is not
 set, it defaults to the `model_id`.
 
-Scaling inference performance can be achieved by setting the parameters
+You can enable adaptive allocations to automatically scale model allocations up
+and down based on the actual resource requirement of the processes.
+
+Manually scaling inference performance can be achieved by setting the parameters
 `number_of_allocations` and `threads_per_allocation`.
 
 Increasing `threads_per_allocation` means more threads are used when an
@@ -58,22 +61,58 @@ include::{es-ref-dir}/ml/ml-shared.asciidoc[tag=model-id]
 [[start-trained-model-deployment-query-params]]
 == {api-query-parms-title}
 
+`deployment_id`::
+(Optional, string)
+include::{es-ref-dir}/ml/ml-shared.asciidoc[tag=deployment-id]
++
+--
+Defaults to `model_id`.
+--
+
+`timeout`::
+(Optional, time)
+Controls the amount of time to wait for the model to deploy. Defaults to 30
+seconds.
+
+`wait_for`::
+(Optional, string)
+Specifies the allocation status to wait for before returning. Defaults to
+`started`. The value `starting` indicates deployment is starting but not yet on
+any node. The value `started` indicates the model has started on at least one
+node. The value `fully_allocated` indicates the deployment has started on all
+valid nodes.
+
+[[start-trained-model-deployment-request-body]]
+== {api-request-body-title}
+
+`adaptive_allocations`::
+(Optional, object)
+include::{es-ref-dir}/ml/ml-shared.asciidoc[tag=adaptive-allocation]
+
+`enabled`:::
+(Optional, Boolean)
+include::{es-ref-dir}/ml/ml-shared.asciidoc[tag=adaptive-allocation-enabled]
+
+`max_number_of_allocations`:::
+(Optional, integer)
+include::{es-ref-dir}/ml/ml-shared.asciidoc[tag=adaptive-allocation-max-number]
+
+`min_number_of_allocations`:::
+(Optional, integer)
+include::{es-ref-dir}/ml/ml-shared.asciidoc[tag=adaptive-allocation-min-number]
+
 `cache_size`::
 (Optional, <<byte-units,byte value>>)
 The inference cache size (in memory outside the JVM heap) per node for the
 model. In serverless, the cache is disabled by default. Otherwise, the default value is the size of the model as reported by the
 `model_size_bytes` field in the <<get-trained-models-stats>>. To disable the
 cache, `0b` can be provided.
 
-`deployment_id`::
-(Optional, string)
-include::{es-ref-dir}/ml/ml-shared.asciidoc[tag=deployment-id]
-Defaults to `model_id`.
-
 `number_of_allocations`::
 (Optional, integer)
 The total number of allocations this model is assigned across {ml} nodes.
-Increasing this value generally increases the throughput. Defaults to 1.
+Increasing this value generally increases the throughput. Defaults to `1`.
+If `adaptive_allocations` is enabled, do not set this value, because it's automatically set.
 
 `priority`::
 (Optional, string)
@@ -110,18 +149,6 @@ compute-bound process; `threads_per_allocations` must not exceed the number of
 available allocated processors per node. Defaults to 1. Must be a power of 2.
 Max allowed value is 32.
 
-`timeout`::
-(Optional, time)
-Controls the amount of time to wait for the model to deploy. Defaults to 30
-seconds.
-
-`wait_for`::
-(Optional, string)
-Specifies the allocation status to wait for before returning. Defaults to
-`started`. The value `starting` indicates deployment is starting but not yet on
-any node. The value `started` indicates the model has started on at least one
-node. The value `fully_allocated` indicates the deployment has started on all
-valid nodes.
 
 [[start-trained-model-deployment-example]]
 == {api-examples-title}
@@ -182,3 +209,24 @@ The `my_model` trained model can be deployed again with a different ID:
 POST _ml/trained_models/my_model/deployment/_start?deployment_id=my_model_for_search
 --------------------------------------------------
 // TEST[skip:TBD]
+
+
+[[start-trained-model-deployment-adaptive-allocation-example]]
+=== Setting adaptive allocations
+
+The following example starts a new deployment of the `my_model` trained model
+with the ID `my_model_for_search` and enables adaptive allocations with the
+minimum number of 3 allocations and the maximum number of 10. 
+
+[source,console]
+--------------------------------------------------
+POST _ml/trained_models/my_model/deployment/_start?deployment_id=my_model_for_search
+{
+  "adaptive_allocations": {
+    "enabled": true,
+    "min_number_of_allocations": 3,
+    "max_number_of_allocations": 10
+  }
+}
+--------------------------------------------------
+// TEST[skip:TBD]
diff --git a/docs/reference/ml/trained-models/apis/update-trained-model-deployment.asciidoc b/docs/reference/ml/trained-models/apis/update-trained-model-deployment.asciidoc
@@ -25,7 +25,11 @@ Requires the `manage_ml` cluster privilege. This privilege is included in the
 == {api-description-title}
 
 You can update a trained model deployment whose `assignment_state` is `started`.
-You can either increase or decrease the number of allocations of such a deployment.
+You can enable adaptive allocations to automatically scale model allocations up
+and down based on the actual resource requirement of the processes.
+Or you can manually increase or decrease the number of allocations of a model
+deployment.
+
 
 [[update-trained-model-deployments-path-parms]]
 == {api-path-parms-title}
@@ -37,17 +41,34 @@ include::{es-ref-dir}/ml/ml-shared.asciidoc[tag=deployment-id]
 [[update-trained-model-deployment-request-body]]
 == {api-request-body-title}
 
+`adaptive_allocations`::
+(Optional, object)
+include::{es-ref-dir}/ml/ml-shared.asciidoc[tag=adaptive-allocation]
+
+`enabled`:::
+(Optional, Boolean)
+include::{es-ref-dir}/ml/ml-shared.asciidoc[tag=adaptive-allocation-enabled]
+
+`max_number_of_allocations`:::
+(Optional, integer)
+include::{es-ref-dir}/ml/ml-shared.asciidoc[tag=adaptive-allocation-max-number]
+
+`min_number_of_allocations`:::
+(Optional, integer)
+include::{es-ref-dir}/ml/ml-shared.asciidoc[tag=adaptive-allocation-min-number]
+
 `number_of_allocations`::
 (Optional, integer)
 The total number of allocations this model is assigned across {ml} nodes.
 Increasing this value generally increases the throughput.
+If `adaptive_allocations` is enabled, do not set this value, because it's automatically set.
 
 
 [[update-trained-model-deployment-example]]
 == {api-examples-title}
 
 The following example updates the deployment for a
- `elastic__distilbert-base-uncased-finetuned-conll03-english` trained model to have 4 allocations:
+`elastic__distilbert-base-uncased-finetuned-conll03-english` trained model to have 4 allocations:
 
 [source,console]
 --------------------------------------------------
@@ -84,3 +105,21 @@ The API returns the following results:
     }
 }
 ----
+
+The following example updates the deployment for a
+`elastic__distilbert-base-uncased-finetuned-conll03-english` trained model to
+enable adaptive allocations with the minimum number of 3 allocations and the
+maximum number of 10:
+
+[source,console]
+--------------------------------------------------
+POST _ml/trained_models/elastic__distilbert-base-uncased-finetuned-conll03-english/deployment/_update
+{
+  "adaptive_allocations": {
+    "enabled": true,
+    "min_number_of_allocations": 3,
+    "max_number_of_allocations": 10
+  }
+}
+--------------------------------------------------
+// TEST[skip:TBD]