elastic · szabosteve · Mar 6, 2025 · Mar 6, 2025 · Mar 6, 2025 · Mar 6, 2025
@@ -0,0 +1,14 @@
+---
+applies_to:
+  stack: ga
+  serverless: ga
+navigation_title: Elastic Inference
+---
+
+# Elastic {{infer-cap}}
+
+There are several ways to perform {{infer}} in the {{stack}}. This page provides a brief overview of the different methods:
+
+* [Using EIS (Elastic Inference Service)](elastic-inference/eis.md)
+* [Using the {{infer}} API](elastic-inference/inference-api.md)
+* [Trained models deployed in your cluster](machine-learning/nlp/ml-nlp-overview.md)
@@ -0,0 +1,10 @@
+---
+applies_to:
+  stack: ga
+  serverless: ga
+navigation_title: Elastic Inference Service (EIS)
+---
+
+# Elastic {{infer-cap}} Service
+
+This is the documentation of the Elastic Inference Service.
@@ -38,7 +38,7 @@ Creates an {{infer}} endpoint to perform an {{infer}} task with the `elastic` se
 ::::{note} 
 The `chat_completion` task type only supports streaming and only through the `_stream` API.
 
-For more information on how to use the `chat_completion` task type, please refer to the [chat completion documentation](/solutions/search/inference-api/chat-completion-inference-api.md).
+For more information on how to use the `chat_completion` task type, please refer to the [chat completion documentation](chat-completion-inference-api.md).
 
 ::::
 

@@ -27,7 +27,7 @@ When adaptive allocations are enabled, the number of allocations of the model is
 
 You can enable adaptive allocations by using:
 
-* the create inference endpoint API for [ELSER](../../../solutions/search/inference-api/elser-inference-integration.md), [E5 and models uploaded through Eland](../../../solutions/search/inference-api/elasticsearch-inference-integration.md) that are used as {{infer}} services.
+* the create inference endpoint API for [ELSER](../../elastic-inference/inference-api/elser-inference-integration.md), [E5 and models uploaded through Eland](../../elastic-inference/inference-api/elasticsearch-inference-integration.md) that are used as {{infer}} services.
 * the [start trained model deployment](https://www.elastic.co/docs/api/doc/elasticsearch/operation/operation-ml-start-trained-model-deployment) or [update trained model deployment](https://www.elastic.co/docs/api/doc/elasticsearch/operation/operation-ml-update-trained-model-deployment) APIs for trained models that are deployed on {{ml}} nodes.
 
 If the new allocations fit on the current {{ml}} nodes, they are immediately started. If more resource capacity is needed for creating new model allocations, then your {{ml}} node will be scaled up if {{ml}} autoscaling is enabled to provide enough resources for the new allocation. The number of model allocations can be scaled down to 0. They cannot be scaled up to more than 32 allocations, unless you explicitly set the maximum number of allocations to more. Adaptive allocations must be set up independently for each deployment and [{{infer}} endpoint](https://www.elastic.co/docs/api/doc/elasticsearch/operation/operation-inference-put).

@@ -13,7 +13,7 @@ EmbEddings from bidirEctional Encoder rEpresentations - or E5 -  is a {{nlp}} mo
 
 [Semantic search](../../../solutions/search/semantic-search.md) provides you search results based on contextual meaning and user intent, rather than exact keyword matches.
 
-E5 has two versions: one cross-platform version which runs on any hardware and one version which is optimized for Intel® silicon. The **Model Management** > **Trained Models** page shows you which version of E5 is recommended to deploy based on your cluster’s hardware. However, the recommended way to use E5 is through the [{{infer}} API](../../../solutions/search/inference-api/elasticsearch-inference-integration.md) as a service which makes it easier to download and deploy the model and you don’t need to select from different versions.
+E5 has two versions: one cross-platform version which runs on any hardware and one version which is optimized for Intel® silicon. The **Model Management** > **Trained Models** page shows you which version of E5 is recommended to deploy based on your cluster’s hardware. However, the recommended way to use E5 is through the [{{infer}} API](../../elastic-inference/inference-api/elasticsearch-inference-integration.md) as a service which makes it easier to download and deploy the model and you don’t need to select from different versions.
 
 Refer to the model cards of the [multilingual-e5-small](https://huggingface.co/elastic/multilingual-e5-small) and the [multilingual-e5-small-optimized](https://huggingface.co/elastic/multilingual-e5-small-optimized) models on HuggingFace for further information including licensing.
 
@@ -44,7 +44,7 @@ PUT _inference/text_embedding/my-e5-model
 
     The API request automatically initiates the model download and then deploy the model.
 
-Refer to the [`elasticsearch` {{infer}} service documentation](../../../solutions/search/inference-api/elasticsearch-inference-integration.md) to learn more about the available settings.
+Refer to the [`elasticsearch` {{infer}} service documentation](../../elastic-inference/inference-api/elasticsearch-inference-integration.md) to learn more about the available settings.
 
 After you created the E5 {{infer}} endpoint, it’s ready to be used for semantic search. The easiest way to perform semantic search in the {{stack}} is to [follow the `semantic_text` workflow](../../../solutions/search/semantic-search/semantic-search-semantic-text.md).
 

@@ -39,7 +39,7 @@ Enabling trained model autoscaling for your ELSER deployment is recommended. Ref
 
 Compared to the initial version of the model, ELSER v2 offers improved retrieval accuracy and more efficient indexing. This enhancement is attributed to the extension of the training data set, which includes high-quality question and answer pairs and the improved FLOPS regularizer which reduces the cost of computing the similarity between a query and a document.
 
-ELSER v2 has two versions: one cross-platform version which runs on any hardware and one version which is optimized for Intel® silicon. The **Model Management** > **Trained Models** page shows you which version of ELSER v2 is recommended to deploy based on your cluster’s hardware. However, the recommended way to use ELSER is through the [{{infer}} API](../../../solutions/search/inference-api/elser-inference-integration.md) as a service which makes it easier to download and deploy the model and you don’t need to select from different versions.
+ELSER v2 has two versions: one cross-platform version which runs on any hardware and one version which is optimized for Intel® silicon. The **Model Management** > **Trained Models** page shows you which version of ELSER v2 is recommended to deploy based on your cluster’s hardware. However, the recommended way to use ELSER is through the [{{infer}} API](../../elastic-inference/inference-api/elser-inference-integration.md) as a service which makes it easier to download and deploy the model and you don’t need to select from different versions.
 
 If you want to learn more about the ELSER V2 improvements, refer to [this blog post](https://www.elastic.co/search-labs/blog/introducing-elser-v2-part-1).
 
@@ -74,7 +74,7 @@ PUT _inference/sparse_embedding/my-elser-model
 
 The API request automatically initiates the model download and then deploy the model. This example uses [autoscaling](ml-nlp-auto-scale.md) through adaptive allocation.
 
-Refer to the [ELSER {{infer}} integration documentation](../../../solutions/search/inference-api/elser-inference-integration.md) to learn more about the available settings.
+Refer to the [ELSER {{infer}} integration documentation](../../elastic-inference/inference-api/elser-inference-integration.md) to learn more about the available settings.
 
 After you created the ELSER {{infer}} endpoint, it’s ready to be used for semantic search. The easiest way to perform semantic search in the {{stack}} is to [follow the `semantic_text` workflow](../../../solutions/search/semantic-search/semantic-search-semantic-text.md).
 
@@ -306,7 +306,7 @@ To gain the biggest value out of ELSER trained models, consider to follow this l
 ## Benchmark information [elser-benchmarks]
 
 ::::{important}
-The recommended way to use ELSER is through the [{{infer}} API](../../../solutions/search/inference-api/elser-inference-integration.md) as a service.
+The recommended way to use ELSER is through the [{{infer}} API](../../elastic-inference/inference-api/elser-inference-integration.md) as a service.
 ::::
 
 The following sections provide information about how ELSER performs on different hardwares and compares the model performance to {{es}} BM25 and other strong baselines.

@@ -16,7 +16,7 @@ Elastic offers a wide range of possibilities to leverage natural language proces
 
 You can **integrate NLP models from different providers** such as Cohere, HuggingFace, or OpenAI and use them as a service through the [semantic_text](../../../solutions/search/semantic-search/semantic-search-semantic-text.md) workflow. You can also use [ELSER](ml-nlp-elser.md) (the retrieval model trained by Elastic) and [E5](ml-nlp-e5.md) in the same way.
 
-The [{{infer}} API](https://www.elastic.co/docs/api/doc/elasticsearch/group/endpoint-inference) enables you to use the same services with a more complex workflow, for greater control over your configurations settings. This [tutorial](../../../solutions/search/inference-api.md) walks you through the process of using the various services with the {{infer}} API.
+The [{{infer}} API](https://www.elastic.co/docs/api/doc/elasticsearch/group/endpoint-inference) enables you to use the same services with a more complex workflow, for greater control over your configurations settings. This [tutorial](../../elastic-inference/inference-api.md) walks you through the process of using the various services with the {{infer}} API.
 
 You can **upload and manage NLP models** using the Eland client and the [{{stack}}](ml-nlp-deploy-models.md). Find the  [list of recommended and compatible models here](ml-nlp-model-ref.md). Refer to [*Examples*](ml-nlp-examples.md) to learn more about how to use {{ml}} models deployed in your cluster.
 

@@ -44,7 +44,7 @@ Elastic Rerank is available in Elastic Stack version 8.17+:
 
 ## Download and deploy [ml-nlp-rerank-deploy]
 
-To download and deploy Elastic Rerank, use the [create inference API](../../../solutions/search/inference-api/elasticsearch-inference-integration.md) to create an {{es}} service `rerank` endpoint.
+To download and deploy Elastic Rerank, use the [create inference API](../../elastic-inference/inference-api/elasticsearch-inference-integration.md) to create an {{es}} service `rerank` endpoint.
 
 ::::{tip} 
 Refer to this [Python notebook](https://github.com/elastic/elasticsearch-labs/blob/main/notebooks/search/12-semantic-reranking-elastic-rerank.ipynb) for an end-to-end example using Elastic Rerank.
@@ -280,7 +280,7 @@ For detailed benchmark information, including complete dataset results and metho
 **Documentation**:
 
 * [Semantic re-ranking in {{es}} overview](../../../solutions/search/ranking/semantic-reranking.md#semantic-reranking-in-es)
-* [Inference API example](../../../solutions/search/inference-api/elasticsearch-inference-integration.md#inference-example-elastic-reranker)
+* [Inference API example](../../elastic-inference/inference-api/elasticsearch-inference-integration.md#inference-example-elastic-reranker)
 
 **Blogs**:
 

@@ -116,6 +116,27 @@ toc:
       - file: transforms/transform-examples.md
       - file: transforms/transform-painless-examples.md
       - file: transforms/transform-limitations.md
+  - file: elastic-inference.md
+    children:
+      - file: elastic-inference/eis.md
+      - file: elastic-inference/inference-api.md
+        children:
+          - file: elastic-inference/inference-api/elastic-inference-service-eis.md
+          - file: elastic-inference/inference-api/alibabacloud-ai-search-inference-integration.md
+          - file: elastic-inference/inference-api/amazon-bedrock-inference-integration.md
+          - file: elastic-inference/inference-api/anthropic-inference-integration.md
+          - file: elastic-inference/inference-api/azure-ai-studio-inference-integration.md
+          - file: elastic-inference/inference-api/azure-openai-inference-integration.md
+          - file: elastic-inference/inference-api/chat-completion-inference-api.md
+          - file: elastic-inference/inference-api/cohere-inference-integration.md
+          - file: elastic-inference/inference-api/elasticsearch-inference-integration.md
+          - file: elastic-inference/inference-api/elser-inference-integration.md
+          - file: elastic-inference/inference-api/google-ai-studio-inference-integration.md
+          - file: elastic-inference/inference-api/google-vertex-ai-inference-integration.md
+          - file: elastic-inference/inference-api/huggingface-inference-integration.md
+          - file: elastic-inference/inference-api/jinaai-inference-integration.md
+          - file: elastic-inference/inference-api/mistral-inference-integration.md
+          - file: elastic-inference/inference-api/openai-inference-integration.md
   - file: machine-learning.md
     children:
       - file: machine-learning/setting-up-machine-learning.md

@@ -28,7 +28,7 @@ If you set the minimum number of allocations to 1, you will be charged even if t
 
 You can enable adaptive allocations by using:
 
-* the create inference endpoint API for [ELSER](../../../solutions/search/inference-api/elser-inference-integration.md), [E5 and models uploaded through Eland](../../../solutions/search/inference-api/elasticsearch-inference-integration.md) that are used as inference services.
+* the create inference endpoint API for [ELSER](../../../explore-analyze/elastic-inference/inference-api/elser-inference-integration.md ), [E5 and models uploaded through Eland](../../../explore-analyze/elastic-inference/inference-api/elasticsearch-inference-integration.md) that are used as inference services.
 * the [start trained model deployment](https://www.elastic.co/docs/api/doc/elasticsearch/operation/operation-ml-start-trained-model-deployment) or [update trained model deployment](https://www.elastic.co/docs/api/doc/elasticsearch/operation/operation-ml-update-trained-model-deployment) APIs for trained models that are deployed on machine learning nodes.
 
 If the new allocations fit on the current machine learning nodes, they are immediately started. If more resource capacity is needed for creating new model allocations, then your machine learning node will be scaled up if machine learning autoscaling is enabled to provide enough resources for the new allocation. The number of model allocations can be scaled down to 0. They cannot be scaled up to more than 32 allocations, unless you explicitly set the maximum number of allocations to more. Adaptive allocations must be set up independently for each deployment and [inference endpoint](https://www.elastic.co/docs/api/doc/elasticsearch/operation/operation-inference-put).

@@ -23,7 +23,7 @@ The following examples use the:
 * `amazon.titan-embed-text-v1` model for [Amazon Bedrock](https://docs.aws.amazon.com/bedrock/latest/userguide/model-ids.md)
 * `ops-text-embedding-zh-001` model for [AlibabaCloud AI](https://help.aliyun.com/zh/open-search/search-platform/developer-reference/text-embedding-api-details)
 
-You can use any Cohere and OpenAI models, they are all supported by the {{infer}} API. For a list of recommended models available on HuggingFace, refer to [the supported model list](../../../solutions/search/inference-api/huggingface-inference-integration.md#inference-example-hugging-face-supported-models).
+You can use any Cohere and OpenAI models, they are all supported by the {{infer}} API. For a list of recommended models available on HuggingFace, refer to [the supported model list](../../../explore-analyze/elastic-inference/inference-api/huggingface-inference-integration.md).
 
 Click the name of the service you want to use on any of the widgets below to review the corresponding instructions.
 

@@ -14,7 +14,7 @@ This tutorial demonstrates how to perform hybrid search, combining semantic sear
 
 In hybrid search, semantic search retrieves results based on the meaning of the text, while full-text search focuses on exact word matches. By combining both methods, hybrid search delivers more relevant results, particularly in cases where relying on a single approach may not be sufficient.
 
-The recommended way to use hybrid search in the {{stack}} is following the `semantic_text` workflow. This tutorial uses the [`elasticsearch` service](inference-api/elasticsearch-inference-integration.md) for demonstration, but you can use any service and their supported models offered by the {{infer-cap}} API.
+The recommended way to use hybrid search in the {{stack}} is following the `semantic_text` workflow. This tutorial uses the [`elasticsearch` service](../../explore-analyze/elastic-inference/inference-api/elasticsearch-inference-integration.md) for demonstration, but you can use any service and their supported models offered by the {{infer-cap}} API.
 
 
 ## Create an index mapping [hybrid-search-create-index-mapping]