Skip to content
Merged
Changes from 9 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
32 changes: 24 additions & 8 deletions explore-analyze/elastic-inference/inference-api.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,15 +9,16 @@ products:
- id: kibana
---

# Integrate with third-party services
# Inference integrations

{{es}} provides a machine learning [inference API](https://www.elastic.co/docs/api/doc/elasticsearch/v8/operation/operation-inference-get-1) to create and manage inference endpoints to integrate with machine learning models provide by popular third-party services like Amazon Bedrock, Anthropic, Azure AI Studio, Cohere, Google AI, Mistral, OpenAI, Hugging Face, and more.
{{es}} provides a machine learning [inference API](https://www.elastic.co/docs/api/doc/elasticsearch/v8/operation/operation-inference-get-1) to create and manage inference endpoints that integrate with services such as Elasticsearch (for built-in NLP models like [ELSER](/explore-analyze/machine-learning/nlp/ml-nlp-elser.md) and [E5](/explore-analyze/machine-learning/nlp/ml-nlp-e5.md)), as well as popular third-party services like Amazon Bedrock, Anthropic, Azure AI Studio, Cohere, Google AI, Mistral, OpenAI, Hugging Face, and more.

Learn how to integrate with specific services in the subpages of this section.
You can create a new inference endpoint:

## Inference endpoints UI [inference-endpoints]
- using the [Create an inference endpoint API](https://www.elastic.co/docs/api/doc/elasticsearch/v8/operation/operation-inference-put-1)
- through the [Inference endpoints UI](#add-inference-endpoints).

You can also manage inference endpoints using the UI.
## Inference endpoints UI [inference-endpoints]

The **Inference endpoints** page provides an interface for managing inference endpoints.

Expand All @@ -33,7 +34,7 @@ Available actions:
* Copy the inference endpoint ID
* Delete endpoints

## Add new inference endpoint
## Add new inference endpoint [add-inference-endpoints]

To add a new interference endpoint using the UI:

Expand All @@ -42,18 +43,33 @@ To add a new interference endpoint using the UI:
1. Provide the required configuration details.
1. Select **Save** to create the endpoint.

If your inference endpoint uses a model deployed in Elastic’s infrastructure, such as ELSER, E5, or a model uploaded through Eland, you can configure [adaptive allocations](#adaptive-allocations) to dynamically adjust resource usage based on the current demand.

## Adaptive allocations [adaptive-allocations]

Adaptive allocations allow inference services to dynamically adjust the number of model allocations based on the current load.
This feature is only supported for models deployed in Elastic’s infrastructure, such as ELSER, E5, or models uploaded through Eland. It is not available for third-party services (for example, Alibaba Cloud, Cohere, or OpenAI), because those models are hosted externally and not deployed within your Elasticsearch cluster.

When adaptive allocations are enabled:

* The number of allocations scales up automatically when the load increases.
* Allocations scale down to a minimum of 0 when the load decreases, saving resources.

For more information about adaptive allocations and resources, refer to the trained model autoscaling documentation.
### Allocation scaling behavior

The behavior of allocations depends on several factors:

- Deployment type (Elastic Cloud Hosted, Elastic Cloud Enterprise, or Serverless)
- Usage level (low, medium, or high)
- Optimization type ([ingest](/deploy-manage/autoscaling/trained-model-autoscaling.md#ingest-optimized) or [search](/deploy-manage/autoscaling/trained-model-autoscaling.md#search-optimized))

For more information about adaptive allocations and resources, refer to the [trained model autoscaling](/deploy-manage/autoscaling/trained-model-autoscaling.md) documentation.

::::{note}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we're delegating to the trained model page, should this note still live here or would it make sense to move that too?

Copy link
Contributor Author

@kosabogi kosabogi Jun 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where do you think it would make more sense to place it?
IMO it should live here because it links to the Trained model autoscaling page, where the adaptive allocations information is located. But maybe I'm misunderstanding your comment? 🤔

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • I agree it's good to have cost warning here. But I'd also expect this to be on the main page we're linking to, so I was just wondering if that duplication is OK.
  • I'm not sure about the current placement of the note.
    • Here's my thinking: The ::::{note} goes into details about pricing implications but we've already sent readers to another page with "For more information about adaptive allocations and resources, refer to the trained model autoscaling documentation." So the flow feels wrong. I'd argue for moving the cost note before the link. And also perhaps it should be important given it deals with real cost implications.
  • Nit: I think the wording could be more concise too.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I thought you were referring to the sentence before the note - but it totally makes sense now.

My point is that we should include it here because the support ticket that raised this issue were about customers using inference services without realizing the cost implications.
Part of this is already mentioned on the trained model autoscaling page, specifically:

Note: If you set the minimum number of allocations to 1, you will be charged even if the system is not using those resources.

In this case, I think the duplication is fine, as it's important to warn users about potential costs (according to the issue, this information was missing from this page.)

You're absolutely right that placing the note before that sentence interrupts the flow - I’ll move it and rework the wording a bit.

Thanks so much for your suggestions!

If you enable adaptive allocations and set the `min_number_of_allocations` to a value greater than `0`, you will be charged for the machine learning resources associated with your inference endpoint, even if no inference requests are sent.

% TO DO: Add a link to trained model autoscaling when the page is available.%
However, enabling adaptive allocations with a `min_number_of_allocations` greater than `0` helps ensure that the model remains available at all times, without delays due to scaling. This configuration may lead to higher resource usage and associated costs. Consider your workload and availability requirements when choosing the appropriate settings.
::::

## Default {{infer}} endpoints [default-enpoints]

Expand Down
Loading