elastic · kosabogi · Jun 6, 2025 · May 22, 2025 · May 22, 2025 · May 22, 2025
@@ -9,15 +9,18 @@ products:
   - id: kibana
 ---
 
-# Integrate with third-party services
+# Inference integrations
 
-{{es}} provides a machine learning [inference API](https://www.elastic.co/docs/api/doc/elasticsearch/v8/operation/operation-inference-get-1) to create and manage inference endpoints to integrate with machine learning models provide by popular third-party services like Amazon Bedrock, Anthropic, Azure AI Studio, Cohere, Google AI, Mistral, OpenAI, Hugging Face, and more.
+{{es}} provides a machine learning [inference API](https://www.elastic.co/docs/api/doc/elasticsearch/v8/operation/operation-inference-get-1) to create and manage inference endpoints that integrate with services such as Elasticsearch (for built-in NLP models like [ELSER](/explore-analyze/machine-learning/nlp/ml-nlp-elser.md) and [E5](/explore-analyze/machine-learning/nlp/ml-nlp-e5.md)), as well as  popular third-party services like Amazon Bedrock, Anthropic, Azure AI Studio, Cohere, Google AI, Mistral, OpenAI, Hugging Face, and more.
 
-Learn how to integrate with specific services in the subpages of this section.
+You can create a new inference endpoint:
+
+- using the [Create an inference endpoint API](https://www.elastic.co/docs/api/doc/elasticsearch/v8/operation/operation-inference-put-1)
+- through the [Inference endpoints UI](#add-inference-endpoints).
 
 ## Inference endpoints UI [inference-endpoints]
 
-You can also manage inference endpoints using the UI.
+You can manage inference endpoints using the UI.
 
 The **Inference endpoints** page provides an interface for managing inference endpoints.
 
@@ -33,7 +36,7 @@ Available actions:
 * Copy the inference endpoint ID
 * Delete endpoints
 
-## Add new inference endpoint
+## Add new inference endpoint [add-inference-endpoints]
 
 To add a new interference endpoint using the UI:
 
@@ -42,18 +45,98 @@ To add a new interference endpoint using the UI:
 1. Provide the required configuration details.
 1. Select **Save** to create the endpoint.
 
+If your inference endpoint uses a model deployed in Elastic’s infrastructure, such as ELSER, E5, or a model uploaded through Eland, you can configure [adaptive allocations](#adaptive-allocations) to dynamically adjust resource usage based on the current demand.
+
 ## Adaptive allocations [adaptive-allocations]
 
 Adaptive allocations allow inference services to dynamically adjust the number of model allocations based on the current load.
+This feature is only supported for models deployed in Elastic’s infrastructure, such as ELSER, E5, or models uploaded through Eland. It is not available for third-party services like Alibaba Cloud, Cohere, or OpenAI, because those models are hosted externally and not deployed within your Elasticsearch cluster.
 
 When adaptive allocations are enabled:
 
 * The number of allocations scales up automatically when the load increases.
 * Allocations scale down to a minimum of 0 when the load decreases, saving resources.
 
-For more information about adaptive allocations and resources, refer to the trained model autoscaling documentation.
+### Allocation scaling behavior
+
+The behavior of allocations depends on several factors:
+
+- Platform (Elastic Cloud Hosted, Elastic Cloud Enterprise, or Serverless)
+- Usage level (low, medium, or high)
+- Optimization type (ingest or search)
+
+The tables below apply when adaptive resource settings are [configured through the UI](/deploy-manage/autoscaling/trained-model-autoscaling.md#enabling-autoscaling-in-kibana-adaptive-resources).
+
+#### Adaptive resources enabled
+
+::::{tab-set}
+
+:::{tab-item} ECH, ECE
+| Usage level | Optimization | Allocations |
+|-------------|--------------|-------------------------------|
+| Low         | Ingest       | 0 to 2 if available, dynamically |
+| Medium      | Ingest       | 1 to 32 dynamically |
+| High        | Ingest       | 1 to limit set in the Cloud console*, dynamically |
+| Low         | Search       | 1 |
+| Medium      | Search       | 1 to 2 (if threads=16), dynamically |
+| High        | Search       | 1 to limit set in the Cloud console*, dynamically |
+
+\* The Cloud console doesn’t directly set an allocations limit; it only sets a vCPU limit. This vCPU limit indirectly determines the number of allocations, calculated as the vCPU limit divided by the number of threads.
+
+:::
+
+
+:::{tab-item} Serverless
+| Usage level | Optimization | Allocations |
+|-------------|--------------|-------------------------------|
+| Low         | Ingest       | 0 to 2 dynamically |
+| Medium      | Ingest       | 1 to 32 dynamically |
+| High        | Ingest       | 1 to 512 for Search<br>1 to 128 for Security and Observability |
+| Low         | Search       | 0 to 1 dynamically |
+| Medium      | Search       | 1 to 2 (if threads=16), dynamically |
+| High        | Search       | 1 to 32 (if threads=16), dynamically<br>1 to 128 for Security and Observability |
+:::
+
+::::
+
+#### Adaptive resources disabled
+
+::::{tab-set}
+
+:::{tab-item} ECH, ECE
+| Usage level | Optimization | Allocations |
+|-------------|--------------|-------------------------------|
+| Low         | Ingest       | 2 if available, otherwise 1, statically |
+| Medium      | Ingest       | The smaller of 32 or the limit set in the Cloud console*, statically |
+| High        | Ingest       | Maximum available set in the Cloud console*, statically |
+| Low         | Search       | 1 if available, statically |
+| Medium      | Search       | 2 (if threads=16) statically |
+| High        | Search       | Maximum available set in the Cloud console*, statically |
+
+\* The Cloud console doesn’t directly set an allocations limit; it only sets a vCPU limit. This vCPU limit indirectly determines the number of allocations, calculated as the vCPU limit divided by the number of threads.
+
+:::
+
+:::{tab-item} Serverless
+| Usage level | Optimization | Allocations |
+|-------------|--------------|-------------------------------|
+| Low         | Ingest       | Exactly 32 |
+| Medium      | Ingest       | 1 to 32 dynamically |
+| High        | Ingest       | 512 for Search<br>No static allocations for Security and Observability |
+| Low         | Search       | 1 statically |
+| Medium      | Search       | 2 statically (if threads=16) |
+| High        | Search       | 32 statically (if threads=16) for Search<br>No static allocations for Security and Observability |
+:::
+
+::::
+
+You can also configure adaptive allocations via the API using parameters like `num_allocations`, `min_number_of_allocations`, and `threads_per_allocation`. Refer to [Enable autoscaling through APIs](/deploy-manage/autoscaling/trained-model-autoscaling.md#enabling-autoscaling-through-apis-adaptive-allocations) for details.
+
+::::{warning}
+If you don't use adaptive allocations, the deployment will always consume a fixed amount of resources, regardless of actual usage. This can lead to inefficient resource utilization and higher costs.
+::::
 
-% TO DO: Add a link to trained model autoscaling when the page is available.%
+For more information about adaptive allocations and resources, refer to the [trained model autoscaling](/deploy-manage/autoscaling/trained-model-autoscaling.md) documentation.
 
 ## Default {{infer}} endpoints [default-enpoints]