Skip to content
Merged
Show file tree
Hide file tree
Changes from 10 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 7 additions & 1 deletion explore-analyze/elastic-inference.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,13 @@ navigation_title: Elastic Inference

# Elastic {{infer-cap}}

There are several ways to perform {{infer}} in the {{stack}}. This page provides a brief overview of the different methods:
## Overview

{{infer-cap}} is a process of using an LLM or a {{ml}} trained model to make predictions or operations - such as text embedding, completion, or reranking - on your data.
You can use {{infer}} during ingest time (for example, to create embeddings from textual data you ingest) or search time (to perform [semantic search](/solutions/search/semantic-search.md)).
There are several ways to perform {{infer}} in the {{stack}}:

* [Using the Elastic {{infer-cap}} Service](elastic-inference/eis.md)
* [Using `semantic_text` if you want to perform semantic search](/solutions/search/semantic-search/semantic-search-semantic-text.md)
* [Using the {{infer}} API](elastic-inference/inference-api.md)
* [Trained models deployed in your cluster](machine-learning/nlp/ml-nlp-overview.md)
56 changes: 56 additions & 0 deletions explore-analyze/elastic-inference/eis.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,56 @@
---
navigation_title: Elastic Inference Service (EIS)
applies_to:
stack: ga 9.0
serverless: ga
---

# Elastic {{infer-cap}} Service [elastic-inference-service-eis]

The Elastic {{infer-cap}} Service (EIS) enables you to leverage AI-powered search as a service without deploying a model in your cluster.
With EIS, you don't need to manage the infrastructure and resources required for {{ml}} {{infer}} by adding, configuring, and scaling {{ml}} nodes.
Instead, you can use {{ml}} models for ingest, search, and chat independently of your {{es}} infrastructure.

## AI features powered by EIS [ai-features-powered-by-eis]

* Your Elastic deployment or project comes with a default [`Elastic Managed LLM` connector](https://www.elastic.co/docs/reference/kibana/connectors-kibana/elastic-managed-llm). This connector is used in the AI Assistant, Attack Discovery, Automatic Import and Search Playground.

* {applies_to}`stack: preview 9.1` {applies_to}`serverless: preview` You can use [ELSER](/explore-analyze/machine-learning/nlp/ml-nlp-elser.md) to perform semantic search as a service (ELSER on EIS).

## Region and hosting [eis-regions]

The EIS requests are currently proxying to AWS Bedrock in AWS US regions, beginning with `us-east-1`.
The request routing does not restrict the location of your deployments.

## ELSER via Elastic {{infer-cap}} Service (ELSER on EIS)

{applies_to}`stack: preview 9.1` {applies_to}`serverless: preview`

ELSER on EIS enables you to use the ELSER model without using ML nodes in your infrastructure and with that, it simplifies the semantic search and hybrid search experience.

### Private preview access

Private preview access is available by submitting the form provided [here](https://docs.google.com/forms/d/e/1FAIpQLSfp2rLsayhw6pLVQYYp4KM6BFtaaljplWdYowJfflpOICgViA/viewform).

### Limitations

#### Access

This feature is being gradually rolled out to Serverless and Cloud Hosted customers.
It may not be available to all users at launch.

#### Uptime

There are no uptime guarantees during the Technical Preview.
While Elastic will address issues promptly, the feature may be unavailable for extended periods.

#### Throughput and latency

{{infer-cap}} throughput via this endpoint is expected to exceed that of {{infer}} operations on an ML node.
However, throughput and latency are not guaranteed.
Performance may vary during the Technical Preview.

#### Batch size

Batches are limited to a maximum of 16 documents.
This is particularly relevant when using the [_bulk API](https://www.elastic.co/docs/api/doc/elasticsearch/v9/operation/operation-bulk) for data ingestion.
53 changes: 27 additions & 26 deletions explore-analyze/elastic-inference/inference-api.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,18 +9,28 @@ products:
- id: kibana
---

# Inference integrations
# {{infer-cap}} integrations

{{es}} provides a machine learning [inference API](https://www.elastic.co/docs/api/doc/elasticsearch/v8/operation/operation-inference-get-1) to create and manage inference endpoints that integrate with services such as Elasticsearch (for built-in NLP models like [ELSER](/explore-analyze/machine-learning/nlp/ml-nlp-elser.md) and [E5](/explore-analyze/machine-learning/nlp/ml-nlp-e5.md)), as well as popular third-party services like Amazon Bedrock, Anthropic, Azure AI Studio, Cohere, Google AI, Mistral, OpenAI, Hugging Face, and more.
{{es}} provides a machine learning [{{infer}} API](https://www.elastic.co/docs/api/doc/elasticsearch/v9/group/endpoint-inference) to create and manage {{infer}} endpoints that integrate with services such as {{es}} (for built-in NLP models like [ELSER](/explore-analyze/machine-learning/nlp/ml-nlp-elser.md) and [E5](/explore-analyze/machine-learning/nlp/ml-nlp-e5.md)), as well as popular third-party services like Amazon Bedrock, Anthropic, Azure AI Studio, Cohere, Google AI, Mistral, OpenAI, Hugging Face, and more.

You can create a new inference endpoint:
You can use the default {{infer}} endpoints your deployment contains or create a new {{infer}} endpoint:

- using the [Create an inference endpoint API](https://www.elastic.co/docs/api/doc/elasticsearch/v8/operation/operation-inference-put-1)
- using the [Create an inference endpoint API](https://www.elastic.co/docs/api/doc/elasticsearch/v9/operation/operation-inference-put)
- through the [Inference endpoints UI](#add-inference-endpoints).

## Inference endpoints UI [inference-endpoints]
## Default {{infer}} endpoints [default-enpoints]

Your {{es}} deployment contains preconfigured {{infer}} endpoints which makes them easier to use when defining `semantic_text` fields or using {{infer}} processors. The following list contains the default {{infer}} endpoints listed by `inference_id`:

- {applies_to}`stack: preview 9.1` {applies_to}`serverless: preview` `.elser-2-elastic`: uses the [ELSER](/explore-analyze/machine-learning/nlp/ml-nlp-elser.md) trained model as an Elastic {{infer-cap}} Service for `sparse_embedding` tasks (recommended for English language text). The `model_id` is `.elser_model_2`.
- `.elser-2-elasticsearch`: uses the [ELSER](/explore-analyze/machine-learning/nlp/ml-nlp-elser.md) built-in trained model for `sparse_embedding` tasks (recommended for English language text). The `model_id` is `.elser_model_2_linux-x86_64`.
- `.multilingual-e5-small-elasticsearch`: uses the [E5](../../explore-analyze/machine-learning/nlp/ml-nlp-e5.md) built-in trained model for `text_embedding` tasks (recommended for non-English language texts). The `model_id` is `.e5_model_2_linux-x86_64`.

Use the `inference_id` of the endpoint in a [`semantic_text`](elasticsearch://reference/elasticsearch/mapping-reference/semantic-text.md) field definition or when creating an [{{infer}} processor](elasticsearch://reference/enrich-processor/inference-processor.md). The API call will automatically download and deploy the model which might take a couple of minutes. Default {{infer}} enpoints have adaptive allocations enabled. For these models, the minimum number of allocations is `0`. If there is no {{infer}} activity that uses the endpoint, the number of allocations will scale down to `0` automatically after 15 minutes.

## {{infer-cap}} endpoints UI [inference-endpoints]

The **Inference endpoints** page provides an interface for managing inference endpoints.
The **{{infer-cap}} endpoints** page provides an interface for managing {{infer}} endpoints.

:::{image} /explore-analyze/images/kibana-inference-endpoints-ui.png
:alt: Inference endpoints UI
Expand All @@ -29,31 +39,31 @@ The **Inference endpoints** page provides an interface for managing inference en

Available actions:

* Add new endpoint
* View endpoint details
* Copy the inference endpoint ID
* Delete endpoints
- Add new endpoint
- View endpoint details
- Copy the inference endpoint ID
- Delete endpoints

## Add new inference endpoint [add-inference-endpoints]
## Add new {{infer}} endpoint [add-inference-endpoints]

To add a new interference endpoint using the UI:
To add a new {{infer}} endpoint using the UI:

1. Select the **Add endpoint** button.
1. Select a service from the drop down menu.
1. Provide the required configuration details.
1. Select **Save** to create the endpoint.

If your inference endpoint uses a model deployed in Elastic’s infrastructure, such as ELSER, E5, or a model uploaded through Eland, you can configure [adaptive allocations](#adaptive-allocations) to dynamically adjust resource usage based on the current demand.
If your {{infer}} endpoint uses a model deployed in Elastic’s infrastructure, such as ELSER, E5, or a model uploaded through Eland, you can configure [adaptive allocations](#adaptive-allocations) to dynamically adjust resource usage based on the current demand.

## Adaptive allocations [adaptive-allocations]

Adaptive allocations allow inference services to dynamically adjust the number of model allocations based on the current load.
This feature is only supported for models deployed in Elastic’s infrastructure, such as ELSER, E5, or models uploaded through Eland. It is not available for third-party services (for example, Alibaba Cloud, Cohere, or OpenAI), because those models are hosted externally and not deployed within your Elasticsearch cluster.
Adaptive allocations allow {{infer}} services to dynamically adjust the number of model allocations based on the current load.
This feature is only supported for models deployed in Elastic’s infrastructure, such as ELSER, E5, or models uploaded through Eland. It is not available for models used through the Elastic {{infer-cap}} Service (EIS) and third-party services (for example, Alibaba Cloud, Cohere, or OpenAI), because those models are not deployed within your Elasticsearch cluster.

When adaptive allocations are enabled:

* The number of allocations scales up automatically when the load increases.
* Allocations scale down to a minimum of 0 when the load decreases, saving resources.
- The number of allocations scales up automatically when the load increases.
- Allocations scale down to a minimum of 0 when the load decreases, saving resources.

### Allocation scaling behavior

Expand All @@ -71,15 +81,6 @@ However, setting the `min_number_of_allocations` to a value greater than `0` kee

For more information about adaptive allocations and resources, refer to the [trained model autoscaling](/deploy-manage/autoscaling/trained-model-autoscaling.md) documentation.

## Default {{infer}} endpoints [default-enpoints]

Your {{es}} deployment contains preconfigured {{infer}} endpoints which makes them easier to use when defining `semantic_text` fields or using {{infer}} processors. The following list contains the default {{infer}} endpoints listed by `inference_id`:

* `.elser-2-elasticsearch`: uses the [ELSER](../../explore-analyze/machine-learning/nlp/ml-nlp-elser.md) built-in trained model for `sparse_embedding` tasks (recommended for English language tex). The `model_id` is `.elser_model_2_linux-x86_64`.
* `.multilingual-e5-small-elasticsearch`: uses the [E5](../../explore-analyze/machine-learning/nlp/ml-nlp-e5.md) built-in trained model for `text_embedding` tasks (recommended for non-English language texts). The `model_id` is `.e5_model_2_linux-x86_64`.

Use the `inference_id` of the endpoint in a [`semantic_text`](elasticsearch://reference/elasticsearch/mapping-reference/semantic-text.md) field definition or when creating an [{{infer}} processor](elasticsearch://reference/enrich-processor/inference-processor.md). The API call will automatically download and deploy the model which might take a couple of minutes. Default {{infer}} enpoints have adaptive allocations enabled. For these models, the minimum number of allocations is `0`. If there is no {{infer}} activity that uses the endpoint, the number of allocations will scale down to `0` automatically after 15 minutes.

## Configuring chunking [infer-chunking-config]

{{infer-cap}} endpoints have a limit on the amount of text they can process at once, determined by the model's input capacity. Chunking is the process of splitting the input text into pieces that remain within these limits.
Expand Down
1 change: 1 addition & 0 deletions explore-analyze/toc.yml
Original file line number Diff line number Diff line change
Expand Up @@ -122,6 +122,7 @@ toc:
- file: transforms/transform-limitations.md
- file: elastic-inference.md
children:
- file: elastic-inference/eis.md
- file: elastic-inference/inference-api.md
- file: machine-learning.md
children:
Expand Down
Loading