Skip to content
Merged
Changes from 6 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
97 changes: 90 additions & 7 deletions explore-analyze/elastic-inference/inference-api.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,15 +9,18 @@ products:
- id: kibana
---

# Integrate with third-party services
# Inference integrations

{{es}} provides a machine learning [inference API](https://www.elastic.co/docs/api/doc/elasticsearch/v8/operation/operation-inference-get-1) to create and manage inference endpoints to integrate with machine learning models provide by popular third-party services like Amazon Bedrock, Anthropic, Azure AI Studio, Cohere, Google AI, Mistral, OpenAI, Hugging Face, and more.
{{es}} provides a machine learning [inference API](https://www.elastic.co/docs/api/doc/elasticsearch/v8/operation/operation-inference-get-1) to create and manage inference endpoints that integrate with services such as Elasticsearch (for built-in NLP models like [ELSER](/explore-analyze/machine-learning/nlp/ml-nlp-elser.md) and [E5](/explore-analyze/machine-learning/nlp/ml-nlp-e5.md)), as well as popular third-party services like Amazon Bedrock, Anthropic, Azure AI Studio, Cohere, Google AI, Mistral, OpenAI, Hugging Face, and more.

Learn how to integrate with specific services in the subpages of this section.
You can create a new inference endpoint:

- using the [Create an inference endpoint API](https://www.elastic.co/docs/api/doc/elasticsearch/v8/operation/operation-inference-put-1)
- through the [Inference endpoints UI](#add-inference-endpoints).

## Inference endpoints UI [inference-endpoints]

You can also manage inference endpoints using the UI.
You can manage inference endpoints using the UI.

The **Inference endpoints** page provides an interface for managing inference endpoints.

Expand All @@ -33,7 +36,7 @@ Available actions:
* Copy the inference endpoint ID
* Delete endpoints

## Add new inference endpoint
## Add new inference endpoint [add-inference-endpoints]

To add a new interference endpoint using the UI:

Expand All @@ -42,18 +45,98 @@ To add a new interference endpoint using the UI:
1. Provide the required configuration details.
1. Select **Save** to create the endpoint.

If your inference endpoint uses a model deployed in Elastic’s infrastructure, such as ELSER, E5, or a model uploaded through Eland, you can configure [adaptive allocations](#adaptive-allocations) to dynamically adjust resource usage based on the current demand.

## Adaptive allocations [adaptive-allocations]

Adaptive allocations allow inference services to dynamically adjust the number of model allocations based on the current load.
This feature is only supported for models deployed in Elastic’s infrastructure, such as ELSER, E5, or models uploaded through Eland. It is not available for third-party services like Alibaba Cloud, Cohere, or OpenAI, because those models are hosted externally and not deployed within your Elasticsearch cluster.

When adaptive allocations are enabled:

* The number of allocations scales up automatically when the load increases.
* Allocations scale down to a minimum of 0 when the load decreases, saving resources.

For more information about adaptive allocations and resources, refer to the trained model autoscaling documentation.
### Allocation scaling behavior

The behavior of allocations depends on several factors:

- Platform (Elastic Cloud Hosted, Elastic Cloud Enterprise, or Serverless)
- Usage level (low, medium, or high)
- Optimization type (ingest or search)

The tables below apply when adaptive resource settings are [configured through the UI](/deploy-manage/autoscaling/trained-model-autoscaling.md#enabling-autoscaling-in-kibana-adaptive-resources).

#### Adaptive resources enabled

::::{tab-set}

:::{tab-item} ECH, ECE
| Usage level | Optimization | Allocations |
|-------------|--------------|-------------------------------|
| Low | Ingest | 0 to 2 if available, dynamically |
| Medium | Ingest | 1 to 32 dynamically |
| High | Ingest | 1 to limit set in the Cloud console*, dynamically |
| Low | Search | 1 |
| Medium | Search | 1 to 2 (if threads=16), dynamically |
| High | Search | 1 to limit set in the Cloud console*, dynamically |

\* The Cloud console doesn’t directly set an allocations limit; it only sets a vCPU limit. This vCPU limit indirectly determines the number of allocations, calculated as the vCPU limit divided by the number of threads.

:::


:::{tab-item} Serverless
| Usage level | Optimization | Allocations |
|-------------|--------------|-------------------------------|
| Low | Ingest | 0 to 2 dynamically |
| Medium | Ingest | 1 to 32 dynamically |
| High | Ingest | 1 to 512 for Search<br>1 to 128 for Security and Observability |
| Low | Search | 0 to 1 dynamically |
| Medium | Search | 1 to 2 (if threads=16), dynamically |
| High | Search | 1 to 32 (if threads=16), dynamically<br>1 to 128 for Security and Observability |
:::

::::

#### Adaptive resources disabled

::::{tab-set}

:::{tab-item} ECH, ECE
| Usage level | Optimization | Allocations |
|-------------|--------------|-------------------------------|
| Low | Ingest | 2 if available, otherwise 1, statically |
| Medium | Ingest | The smaller of 32 or the limit set in the Cloud console*, statically |
| High | Ingest | Maximum available set in the Cloud console*, statically |
| Low | Search | 1 if available, statically |
| Medium | Search | 2 (if threads=16) statically |
| High | Search | Maximum available set in the Cloud console*, statically |

\* The Cloud console doesn’t directly set an allocations limit; it only sets a vCPU limit. This vCPU limit indirectly determines the number of allocations, calculated as the vCPU limit divided by the number of threads.

:::

:::{tab-item} Serverless
| Usage level | Optimization | Allocations |
|-------------|--------------|-------------------------------|
| Low | Ingest | Exactly 32 |
| Medium | Ingest | 1 to 32 dynamically |
| High | Ingest | 512 for Search<br>No static allocations for Security and Observability |
| Low | Search | 1 statically |
| Medium | Search | 2 statically (if threads=16) |
| High | Search | 32 statically (if threads=16) for Search<br>No static allocations for Security and Observability |
:::

::::

You can also configure adaptive allocations via the API using parameters like `num_allocations`, `min_number_of_allocations`, and `threads_per_allocation`. Refer to [Enable autoscaling through APIs](/deploy-manage/autoscaling/trained-model-autoscaling.md#enabling-autoscaling-through-apis-adaptive-allocations) for details.

::::{warning}
If you don't use adaptive allocations, the deployment will always consume a fixed amount of resources, regardless of actual usage. This can lead to inefficient resource utilization and higher costs.
::::

% TO DO: Add a link to trained model autoscaling when the page is available.%
For more information about adaptive allocations and resources, refer to the [trained model autoscaling](/deploy-manage/autoscaling/trained-model-autoscaling.md) documentation.

## Default {{infer}} endpoints [default-enpoints]

Expand Down
Loading