Adds information about the importance of adaptive allocations (elastic#1454)

kosabogi · szabosteve · alaudazzi · web-flow · commit 47acfae33c8b · 2025-06-06T10:00:25.000+02:00
### [📸 Preveiw](https://docs-v3-preview.elastic.dev/elastic/docs-content/pull/1454/explore-analyze/elastic-inference/inference-api) ### Description This PR updates the Inference integration documentation to: - Clearly state that not enabling adaptive allocations can result in unnecessary resource usage and higher costs. - Expand the scope of the page to cover not only third-party service integrations, but also the Elasticsearch service. ### Related issue: elastic#1393 --------- Co-authored-by: István Zoltán Szabó <szabosteve@gmail.com> Co-authored-by: Arianna Laudazzi <46651782+alaudazzi@users.noreply.github.com>
diff --git a/explore-analyze/elastic-inference/inference-api.md b/explore-analyze/elastic-inference/inference-api.md
@@ -9,15 +9,16 @@ products:
   - id: kibana
 ---
 
-# Integrate with third-party services
+# Inference integrations
 
-{{es}} provides a machine learning [inference API](https://www.elastic.co/docs/api/doc/elasticsearch/v8/operation/operation-inference-get-1) to create and manage inference endpoints to integrate with machine learning models provide by popular third-party services like Amazon Bedrock, Anthropic, Azure AI Studio, Cohere, Google AI, Mistral, OpenAI, Hugging Face, and more.
+{{es}} provides a machine learning [inference API](https://www.elastic.co/docs/api/doc/elasticsearch/v8/operation/operation-inference-get-1) to create and manage inference endpoints that integrate with services such as Elasticsearch (for built-in NLP models like [ELSER](/explore-analyze/machine-learning/nlp/ml-nlp-elser.md) and [E5](/explore-analyze/machine-learning/nlp/ml-nlp-e5.md)), as well as  popular third-party services like Amazon Bedrock, Anthropic, Azure AI Studio, Cohere, Google AI, Mistral, OpenAI, Hugging Face, and more.
 
-Learn how to integrate with specific services in the subpages of this section.
+You can create a new inference endpoint:
 
-## Inference endpoints UI [inference-endpoints]
+- using the [Create an inference endpoint API](https://www.elastic.co/docs/api/doc/elasticsearch/v8/operation/operation-inference-put-1)
+- through the [Inference endpoints UI](#add-inference-endpoints).
 
-You can also manage inference endpoints using the UI.
+## Inference endpoints UI [inference-endpoints]
 
 The **Inference endpoints** page provides an interface for managing inference endpoints.
 
@@ -33,7 +34,7 @@ Available actions:
 * Copy the inference endpoint ID
 * Delete endpoints
 
-## Add new inference endpoint
+## Add new inference endpoint [add-inference-endpoints]
 
 To add a new interference endpoint using the UI:
 
@@ -42,18 +43,33 @@ To add a new interference endpoint using the UI:
 1. Provide the required configuration details.
 1. Select **Save** to create the endpoint.
 
+If your inference endpoint uses a model deployed in Elastic’s infrastructure, such as ELSER, E5, or a model uploaded through Eland, you can configure [adaptive allocations](#adaptive-allocations) to dynamically adjust resource usage based on the current demand.
+
 ## Adaptive allocations [adaptive-allocations]
 
 Adaptive allocations allow inference services to dynamically adjust the number of model allocations based on the current load.
+This feature is only supported for models deployed in Elastic’s infrastructure, such as ELSER, E5, or models uploaded through Eland. It is not available for third-party services (for example, Alibaba Cloud, Cohere, or OpenAI), because those models are hosted externally and not deployed within your Elasticsearch cluster.
 
 When adaptive allocations are enabled:
 
 * The number of allocations scales up automatically when the load increases.
 * Allocations scale down to a minimum of 0 when the load decreases, saving resources.
 
-For more information about adaptive allocations and resources, refer to the trained model autoscaling documentation.
+### Allocation scaling behavior
+
+The behavior of allocations depends on several factors:
+
+- Deployment type (Elastic Cloud Hosted, Elastic Cloud Enterprise, or Serverless)
+- Usage level (low, medium, or high)
+- Optimization type ([ingest](/deploy-manage/autoscaling/trained-model-autoscaling.md#ingest-optimized) or [search](/deploy-manage/autoscaling/trained-model-autoscaling.md#search-optimized))
+
+::::{important}
+If you enable adaptive allocations and set the `min_number_of_allocations` to a value greater than `0`, you will be charged for the machine learning resources, even if no inference requests are sent.
+
+However, setting the `min_number_of_allocations` to a value greater than `0` keeps the model always available without scaling delays. Choose the configuration that best fits your workload and availability needs.
+:::: 
 
-% TO DO: Add a link to trained model autoscaling when the page is available.%
+For more information about adaptive allocations and resources, refer to the [trained model autoscaling](/deploy-manage/autoscaling/trained-model-autoscaling.md) documentation.
 
 ## Default {{infer}} endpoints [default-enpoints]