[DOCS] Adds adaptive allocations information to Inference APIs (#117546)

kosabogi · leemthompo · kosabogi · commit d46ec49e5748 · 2024-12-04T13:24:56.000Z
* Adds adaptive allocations information to Inference APIs

* Update docs/reference/inference/inference-apis.asciidoc

Co-authored-by: Liam Thompson &lt;32779855+leemthompo@users.noreply.github.com&gt;

* Update docs/reference/inference/put-inference.asciidoc

Co-authored-by: Liam Thompson &lt;32779855+leemthompo@users.noreply.github.com&gt;

* Update docs/reference/inference/inference-apis.asciidoc

Co-authored-by: Liam Thompson &lt;32779855+leemthompo@users.noreply.github.com&gt;

---------

Co-authored-by: Liam Thompson &lt;32779855+leemthompo@users.noreply.github.com&gt;
diff --git a/docs/reference/inference/inference-apis.asciidoc b/docs/reference/inference/inference-apis.asciidoc
@@ -35,6 +35,19 @@ Elastic –, then create an {infer} endpoint by the <<put-inference-api>>.
 Now use <<semantic-search-semantic-text, semantic text>> to perform
 <<semantic-search, semantic search>> on your data.
 
+[discrete]
+[[adaptive-allocations]]
+=== Adaptive allocations
+
+Adaptive allocations allow inference services to dynamically adjust the number of model allocations based on the current load.
+
+When adaptive allocations are enabled:
+
+* The number of allocations scales up automatically when the load increases.
+- Allocations scale down to a minimum of 0 when the load decreases, saving resources.
+
+For more information about adaptive allocations and resources, refer to the {ml-docs}/ml-nlp-auto-scale.html[trained model autoscaling] documentation.
+
 //[discrete]
 //[[default-enpoints]]
 //=== Default {infer} endpoints
diff --git a/docs/reference/inference/put-inference.asciidoc b/docs/reference/inference/put-inference.asciidoc
@@ -67,4 +67,17 @@ Click the links to review the configuration details of the services:
 * <<infer-service-watsonx-ai>> (`text_embedding`)
 
 The {es} and ELSER services run on a {ml} node in your {es} cluster. The rest of
-the services connect to external providers.
+the services connect to external providers.
+
+[discrete]
+[[adaptive-allocations-put-inference]]
+==== Adaptive allocations
+
+Adaptive allocations allow inference services to dynamically adjust the number of model allocations based on the current load.
+
+When adaptive allocations are enabled:
+
+- The number of allocations scales up automatically when the load increases.
+- Allocations scale down to a minimum of 0 when the load decreases, saving resources.
+
+For more information about adaptive allocations and resources, refer to the {ml-docs}/ml-nlp-auto-scale.html[trained model autoscaling] documentation.