elastic · szabosteve · Oct 14, 2024 · Sep 12, 2024 · Sep 12, 2024 · Sep 12, 2024
@@ -4,6 +4,7 @@ include::ml-nlp-extract-info.asciidoc[leveloffset=+2]
 include::ml-nlp-classify-text.asciidoc[leveloffset=+2]
 include::ml-nlp-search-compare.asciidoc[leveloffset=+2]
 include::ml-nlp-deploy-models.asciidoc[leveloffset=+1]
+include::ml-nlp-autoscaling.asciidoc[leveloffset=+1]
 include::ml-nlp-inference.asciidoc[leveloffset=+1]
 include::ml-nlp-apis.asciidoc[leveloffset=+1]
 include::ml-nlp-built-in-models.asciidoc[leveloffset=+1]

@@ -0,0 +1,153 @@
+[[ml-nlp-auto-scale]]
+= Trained model autoscaling
+
+You can enable autoscaling for each of your trained model deployments.
+Autoscaling allows {es} to automatically adjust the resources the deployment can use based on the workload demand.
+
+There are two ways to enable autoscaling:
+
+* through APIs by enabling adaptive allocations
+* in {kib} by enabling adaptive resources
+
+IMPORTANT: To fully leverage model autoscaling, it is highly recommended to enable {cloud}/ec-autoscaling.html[deployment autoscaling].
+
+
+[discrete]
+[[nlp-model-adaptive-allocations]]
+== Enabling autoscaling through APIs - adaptive allocations
+
+Model allocations are independent units of work for NLP tasks.
+If you set the numbers of threads and allocations for a model manually, they remain constant even when not all the available resources are fully used or when the load on the model requires more resources.
+Instead of setting the number of allocations manually, you can enable adaptive allocations to set the number of allocations based on the load on the process.
+This can help you to manage performance and cost more easily.
+(Refer to the https://cloud.elastic.co/pricing[pricing calculator] to learn more about the possible costs.)
+
+When adaptive allocations are enabled, the number of allocations of the model is set automatically based on the current load.
+When the load is high, a new model allocation is automatically created.
+When the load is low, a model allocation is automatically removed.
+
+You can enable adaptive allocations by using:
+
+* the create inference endpoint API for {ref}/infer-service-elser.html[ELSER], {ref}/infer-service-elasticsearch.html[E5 and models uploaded through Eland] that are used as {infer} services.
+* the {ref}/start-trained-model-deployment.html[start trained model deployment] or {ref}/update-trained-model-deployment.html[update trained model deployment] APIs for trained models that are deployed on {ml} nodes.
+
+If the new allocations fit on the current {ml} nodes, they are immediately started.
+If more resource capacity is needed for creating new model allocations, then your {ml} node will be scaled up if {ml} autoscaling is enabled to provide enough resources for the new allocation.
+The number of model allocations can be scaled down to 0.
+They cannot be scaled up to more than 32 allocations, unless you explicitly set the maximum number of allocations to more.
+Adaptive allocations must be set up independently for each deployment and {infer} endpoint.
+
+
+[discrete]
+[[optimize-use-case]]
+=== Optimize for typical use cases by using adaptive allocations
+
+You can optimize your model deplyoment for typical use cases, such as search and ingest.
+When you optimize for ingest, the throughput will be higher, which increases the number of {infer} requests that can be performed in parallel.
+When you optimize for search, the latency will be lower during search processes.
+
+* If you want to optimize for ingest, set the number of threads to `1` (`"num_threads": 1`).
+* If you want to optimize for search, set the number of threads to greater than `1`.
+Increasing the number of threads will make the search processes more performant.
+
+
+[discrete]
+[[nlp-model-adaptive-resources]]
+== Enabling autoscaling in {kib} - adaptive resources
+
+You can enable adaptive resources for your models when starting or updating the model deployment.
+Adaptive resources make it possible for {es} to scale up or down the available resources based on the load on the process.
+This can help you to manage performance and cost more easily.
+When adaptive resources are enabled, the number of vCPUs that the model deployment uses is set automatically based on the current load.
+When the load is high, the number of vCPUs that the process can use is automatically increased.
+When the load is low, the number of vCPUs that the process can use is automatically decreased.
+
+You can choose from three levels of resource usage for your trained model deployment.
+Refer to the tables in the <<auto-scaling-matrix>> section to find out the setings for the level you selected.
+
+
+[role="screenshot"]
+image::images/ml-nlp-deployment-id-elser-v2.png["ELSER deployment with adaptive resources enabled.",width=640]
+
+
+[discrete]
+[[auto-scaling-matrix]]
+== Model deployment resource matrix
+
+The used resources for trained model deployments depend on three factors:
+
+* your cluster environment (Serverless, Cloud, or on-premises)
+* the use case you optimize the model deployment for (ingest or search)
+* whether adaptive resources are enabled or disabled (dynamic or static resources)
+
+If you use {es} on-premises, adaptive resources behavior is fully dynamic and highly dependent on the hardware configuration.
+The following tables show you the number of allocations, threads, and vCPUs available in Cloud when adaptive resources are enabled or disabled.
+
+
+[discrete]
+=== Deployments in Cloud optimized for ingest
+
+In case of ingest-optimized deployments, we maximize the number of model allocations.
+
+
+[discrete]
+==== Adaptive resources enabled
+
+[cols="4*", options="header"]
+|==========
+| Level  | Allocations                                          | Threads | vCPUs
+| Low    | 0 to 2 if available, dynamically                     | 1       | 0 to 2 if available, dynamically 
+| Medium | 1 to 32 dynamically                                  | 1       | 1 to the smaller of 32 or the limit set in the Cloud console, dynamically
+| High   | 1 to limit set in the Cloud console ^*^, dynamically | 1       | 1 to limit set in the Cloud console, dynamically
+|==========
+
+^*^ The Cloud console doesn't directly set an allocations limit; it only sets a vCPU limit.
+This vCPU limit indirectly determines the number of allocations, calculated as the vCPU limit divided by the number of threads.
+
+[discrete]
+==== Adaptive resources disabled
+
+[cols="4*", options="header"]
+|==========
+| Level  | Allocations                                                                  | Threads | vCPUs
+| Low    | 2 if available, otherwise 1, statically                                      | 1       | 2 if available
+| Medium | the smaller of 32 or the limit set in the Cloud console, statically          | 1       | 32 if available
+| High   | Maximum available set in the  Cloud console ^*^, statically                  | 1       | Maximum available set in the Cloud console, statically
+|==========
+
+^*^ The Cloud console doesn't directly set an allocations limit; it only sets a vCPU limit.
+This vCPU limit indirectly determines the number of allocations, calculated as the vCPU limit divided by the number of threads.
+
+[discrete]
+=== Deployments in Cloud optimized for search
+
+In case of search-optimized deployments, we maximize the number of threads.
+The maximum number of threads that can be claimed depends on the hardware your architecture has.
+
+[discrete]
+==== Adaptive resources enabled
+
+[cols="4*", options="header"]
+|==========
+| Level  | Allocations                                          | Threads                                            | vCPUs
+| Low    |  1                                                   | 2                                                  | 2
+| Medium |  1 to 2 (if threads=16) dynamically                  | maximum that the hardware allows (for example, 16) | 1 to 32 dynamically
+| High   |  1 to limit set in the Cloud console ^*^, dynamically| maximum that the hardware allows (for example, 16) | 1 to limit set in the Cloud console, dynamically
+|==========
+
+^*^ The Cloud console doesn't directly set an allocations limit; it only sets a vCPU limit.
+This vCPU limit indirectly determines the number of allocations, calculated as the vCPU limit divided by the number of threads.
+
+[discrete]
+==== Adaptive resources disabled
+
+[cols="4*", options="header"]
+|==========
+| Level  | Allocations                                                      | Threads                                                  | vCPUs
+| Low    | 1 if available, statically                                       | 2                                                        | 2 if available
+| Medium | 2 (if threads=16) statically                                     | maximum that the hardware allows (for example, 16)       | 32 if available
+| High   | Maximum available set in the Cloud console ^*^, statically       | maximum that the hardware allows (for example, 16)       | Maximum available set in the Cloud console, statically
+|==========
+
+^*^ The Cloud console doesn't directly set an allocations limit; it only sets a vCPU limit.
+This vCPU limit indirectly determines the number of allocations, calculated as the vCPU limit divided by the number of threads.
@@ -164,66 +164,34 @@ their deployment across your cluster under **{ml-app}** > *Model Management*.
 Alternatively, you can use the
 {ref}/start-trained-model-deployment.html[start trained model deployment API].
 
-You can deploy a model multiple times by assigning a unique deployment ID when 
-starting the deployment. It enables you to have dedicated deployments for 
-different purposes, such as search and ingest. By doing so, you ensure that the 
-search speed remains unaffected by ingest workloads, and vice versa. Having 
-separate deployments for search and ingest mitigates performance issues 
-resulting from interactions between the two, which can be hard to diagnose.
+You can deploy a model multiple times by assigning a unique deployment ID when starting the deployment.
+
+You can optimize your deplyoment for typical use cases, such as search and ingest.
+When you optimize for ingest, the throughput will be higher, which increases the number of {infer} requests that can be performed in parallel.
+When you optimize for search, the latency will be lower during search processes.
+When you have dedicated deployments for different purposes, you ensure that the search speed remains unaffected by ingest workloads, and vice versa.
+Having separate deployments for search and ingest mitigates performance issues resulting from interactions between the two, which can be hard to diagnose.
 
 [role="screenshot"]
 image::images/ml-nlp-deployment-id-elser-v2.png["Model deployment on the Trained Models UI."]
 
-It is recommended to fine-tune each deployment based on its specific purpose. To 
-improve ingest performance, increase throughput by adding more allocations to 
-the deployment. For improved search speed, increase the number of threads per 
-allocation.
-
-NOTE: Since eland uses APIs to deploy the models, you cannot see the models in
-{kib} until the saved objects are synchronized. You can follow the prompts in
-{kib}, wait for automatic synchronization, or use the
-{kibana-ref}/machine-learning-api-sync.html[sync {ml} saved objects API].
-
-When you deploy the model, its allocations are distributed across available {ml} 
-nodes. Model allocations are independent units of work for NLP tasks. To 
-influence model performance, you can configure the number of allocations and the 
-number of threads used by each allocation of your deployment. Alternatively, you
-can enable <<nlp-model-adaptive-allocations>> to automatically create and remove
-model allocations based on the current workload of the model (you still need to 
-manually set the number of threads).
-
-IMPORTANT: If your deployed trained model has only one allocation, it's likely 
-that you will experience downtime in the service your trained model performs. 
-You can reduce or eliminate downtime by adding more allocations to your trained 
-models.
+Each deployment will be fine-tuned automatically based on its specific purpose you choose.
 
-Throughput can be scaled by adding more allocations to the deployment; it 
-increases the number of {infer} requests that can be performed in parallel. All 
-allocations assigned to a node share the same copy of the model in memory. The 
-model is loaded into memory in a native process that encapsulates `libtorch`, 
-which is the underlying {ml} library of PyTorch. The number of allocations 
-setting affects the amount of model allocations across all the {ml} nodes. Model 
-allocations are distributed in such a way that the total number of used threads 
-does not exceed the allocated processors of a node.
-
-The threads per allocation setting affects the number of threads used by each 
-model allocation during {infer}. Increasing the number of threads generally 
-increases the speed of {infer} requests. The value of this setting must not 
-exceed the number of available allocated processors per node.
-
-You can view the allocation status in {kib} or by using the
-{ref}/get-trained-models-stats.html[get trained model stats API]. If you want to
-change the number of allocations, you can use the
-{ref}/update-trained-model-deployment.html[update trained model stats API] after
-the allocation status is `started`. You can also enable
-<<nlp-model-adaptive-allocations>> to automatically create and remove model
-allocations based on the current workload of the model.
+NOTE: Since eland uses APIs to deploy the models, you cannot see the models in {kib} until the saved objects are synchronized.
+You can follow the prompts in {kib}, wait for automatic synchronization, or use the {kibana-ref}/machine-learning-api-sync.html[sync {ml} saved objects API].
 
-[discrete]
-[[nlp-model-adaptive-allocations]]
-=== Adaptive allocations
+You can define the resource usage level of the NLP model during model deployment.
+The resource usage levels behave differently depending on <<nlp-model-adaptive-resources, adaptive resources>> being enabled or disabled.
+When adaptive resources are disabled but {ml} autoscaling is enabled, vCPU usage of Cloud deployments derived from the Cloud console and functions as follows:
+
+* Low: This level limits resources to one vCPU, which may be suitable for development, testing, and demos depending on your parameters.
+It is not recommended for production use
+* Medium: This level limits resources to two vCPUs, which may be suitable for development, testing, and demos depending on your parameters.
+It is not recommended for production use.
+* High: This level may use the maximum number of vCPUs available for this deployment from the Cloud console.
+If the maximum is 2 vCPUs or fewer, this level is equivalent to the medium or low level.
 
-include::ml-nlp-shared.asciidoc[tag=ml-nlp-adaptive-allocations]
+For the resource levels when adaptive resources are enabled, refer to <<<ml-nlp-auto-scale>>.
 
 
 [discrete]

@@ -41,6 +41,9 @@ models on HuggingFace for further information including licensing.
 To use E5, you must have the {subscriptions}[appropriate subscription] level 
 for semantic search or the trial period activated.
 
+Enabling trained model autoscaling for your E5 deployment is recommended.
+Refer to <<ml-nlp-auto-scale>> to learn more.
+
 
 [discrete]
 [[download-deploy-e5]]
@@ -313,12 +316,6 @@ Once it's uploaded to {es}, the model will have the ID specified by
 underscores `__`.
 --
 
-[discrete]
-[[e5-adaptive-allocations]]
-== Adaptive allocations
-
-include::ml-nlp-shared.asciidoc[tag=ml-nlp-adaptive-allocations]
-
 
 [discrete]
 [[terms-of-use-e5]]

@@ -66,6 +66,9 @@ more allocations or more threads per allocation, which requires bigger ML nodes.
 Autoscaling provides bigger nodes when required. If autoscaling is turned off, 
 you must provide suitably sized nodes yourself.
 
+Enabling trained model autoscaling for your ELSER deployment is recommended.
+Refer to <<ml-nlp-auto-scale>> to learn more.
+
 
 [discrete]
 [[elser-v2]]
@@ -449,13 +452,17 @@ To achieve the best results, it's recommended to clean the input text before gen
 The exact preprocessing you may need to do heavily depends on your text.
 For example, if your text contains HTML tags, use the {ref}/htmlstrip-processor.html[HTML strip processor] in an ingest pipeline to remove unnecessary elements.
 Always review and clean your input text before ingestion to eliminate any irrelevant entities that might affect the results.
- 
+
 
 [discrete]
-[[elser-adaptive-allocations]]
-== Adaptive allocations
+[[elser-recommendations]]
+== Recommendations for using ELSER
+
+To gain the biggest value out of ELSER trained models, consider to follow this list of recommendations.
 
-include::ml-nlp-shared.asciidoc[tag=ml-nlp-adaptive-allocations]
+* Use two ELSER {infer} endpoints: one optimized for ingest and one optimized for search.
+* If quick response time is important for your use case, keep {ml} resources available at all times by setting `min_allocations` to `1`.
+* Setting `min_allocations` to `0` can save on costs for non-critical use cases or testing environments.
 
 
 [discrete]

@@ -17,16 +17,4 @@ When you use ELSER for semantic search, only the first 512 extracted tokens from
 each field of the ingested documents that ELSER is applied to are taken into 
 account for the search process. If your data set contains long documents, divide 
 them into smaller segments before ingestion if you need the full text to be 
-searchable.
-
-
-[discrete]
-[[ml-nlp-elser-autoscale]]
-== ELSER deployments don't autoscale
-
-Currently, ELSER deployments do not scale up and down automatically depending on
-the resource requirements of the ELSER processes. If you want to configure
-available resources for your ELSER deployments, you can manually set the number
-of allocations and threads per allocation by using the Trained Models UI in
-{kib} or the 
-{ref}/update-trained-model-deployment.html[Update trained model deployment API].
+searchable.
@@ -1,22 +1,3 @@
-tag::ml-nlp-adaptive-allocations[]
-The numbers of threads and allocations you can set manually for a model remain constant even when not all the available resources are fully used or when the load on the model requires more resources.
-Instead of setting the number of allocations manually, you can enable adaptive allocations to set the number of allocations based on the load on the process. This can help you to manage performance and cost more easily.
-When adaptive allocations are enabled, the number of allocations of the model is set automatically based on the current load.
-When the load is high, a new model allocation is automatically created.
-When the load is low, a model allocation is automatically removed.
-
-You can enable adaptive allocations by using:
-
-* the Create inference endpoint API for {ref}/infer-service-elser.html[ELSER], {ref}/infer-service-elasticsearch.html[E5 and models uploaded through Eland] that are used as {infer} services.
-* the {ref}/start-trained-model-deployment.html[start trained model deployment] or {ref}/update-trained-model-deployment.html[update trained model deployment] APIs for trained models that are deployed on {ml} nodes.
-
-If the new allocations fit on the current {ml} nodes, they are immediately started.
-If more resource capacity is needed for creating new model allocations, then your {ml} node will be scaled up if {ml} autoscaling is enabled to provide enough resources for the new allocation.
-The number of model allocations cannot be scaled down to less than 1.
-And they cannot be scaled up to more than 32 allocations, unless you explicitly set the maximum number of allocations to more.
-Adaptive allocations must be set up independently for each deployment and {infer} endpoint.
-end::ml-nlp-adaptive-allocations[]
-
 tag::nlp-eland-clone-docker-build[]
 You can use the {eland-docs}[Eland client] to install the {nlp} model. Use the prebuilt  
 Docker image to run the Eland install model commands. Pull the latest image with:

@@ -14,6 +14,7 @@ predictions.
 
 * <<ml-nlp-overview>>
 * <<ml-nlp-deploy-models>>
+* <<<ml-nlp-auto-scale>>
 * <<ml-nlp-inference>>
 * <<ml-nlp-apis>>
 * <<ml-nlp-elser>>