-
Notifications
You must be signed in to change notification settings - Fork 155
[Serverless] Adds Trained model autoscaling page #139
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
kosabogi
merged 21 commits into
elastic:main
from
kosabogi:add-serverless-model-autoscaling-doc
Nov 12, 2024
Merged
Changes from 15 commits
Commits
Show all changes
21 commits
Select commit
Hold shift + click to select a range
c0564eb
Adds Trained model autoscaling page
kosabogi d8dde76
Update serverless/pages/ml-nlp-auto-scale.mdx
kosabogi f6b650d
Update serverless/pages/ml-nlp-auto-scale.mdx
kosabogi 4683ddc
Update serverless/pages/ml-nlp-auto-scale.mdx
kosabogi f91c731
Update serverless/pages/ml-nlp-auto-scale.mdx
kosabogi f17b2a4
Update serverless/pages/ml-nlp-auto-scale.mdx
kosabogi 26fb9e0
Update serverless/pages/ml-nlp-auto-scale.mdx
kosabogi c8b8b3b
Update serverless/pages/ml-nlp-auto-scale.mdx
kosabogi 55117e6
Update serverless/pages/ml-nlp-auto-scale.mdx
kosabogi ac3a201
Update serverless/pages/ml-nlp-auto-scale.mdx
kosabogi 18487de
Update serverless/pages/ml-nlp-auto-scale.mdx
kosabogi e430ce7
Update serverless/pages/ml-nlp-auto-scale.mdx
kosabogi 27b445c
Update serverless/pages/ml-nlp-auto-scale.mdx
kosabogi 8f7a9d2
Changes paragraph placement
kosabogi 1beb41c
Update serverless/pages/ml-nlp-auto-scale.mdx
kosabogi 031c91f
Update serverless/pages/ml-nlp-auto-scale.mdx
kosabogi 9fdee40
Updates document based on feedback
kosabogi ad99fd3
Merge remote-tracking branch 'upstream/main' into add-serverless-mode…
kosabogi 1b3caa8
Merge branch 'main' into add-serverless-model-autoscaling-doc
colleenmcginnis 37ee519
mdx to asciidoc
colleenmcginnis aaf7ae9
Updates table
kosabogi File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -63,6 +63,9 @@ | |
}, | ||
{ | ||
"slug": "/serverless/regions" | ||
}, | ||
{ | ||
"slug": "/serverless/general/ml-nlp-auto-scale" | ||
} | ||
] | ||
} |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,128 @@ | ||
--- | ||
slug: /serverless/general/ml-nlp-auto-scale | ||
title: Trained model autoscaling | ||
tags: ['serverless'] | ||
--- | ||
|
||
You can enable autoscaling for each of your trained model deployments. | ||
Autoscaling allows Elasticsearch to automatically adjust the resources the model deployment can use based on the workload demand. | ||
|
||
There are two ways to enable autoscaling: | ||
|
||
- through APIs by enabling adaptive allocations | ||
- in Kibana by enabling adaptive resources | ||
|
||
|
||
Trained model autoscaling is available for Search, Observability, and Security projects on serverless deployments. However, these projects handle processing power differently, which impacts their costs and resource limits. | ||
|
||
Security and Observability projects are only charged for data ingestion and retention. They are not charged for processing power (vCU usage), which is used for more complex operations, like running advanced search models. For example, in Search projects, models such as ELSER require significant processing power to provide more accurate search results. | ||
|
||
## Enabling autoscaling through APIs - adaptive allocations | ||
|
||
Model allocations are independent units of work for NLP tasks. | ||
If you set the numbers of threads and allocations for a model manually, they remain constant even when not all the available resources are fully used or when the load on the model requires more resources. | ||
kosabogi marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
Instead of setting the number of allocations manually, you can enable adaptive allocations to set the number of allocations based on the load on the process. | ||
This can help you to manage performance and cost more easily. | ||
(Refer to the [pricing calculator](https://cloud.elastic.co/pricing) to learn more about the possible costs.) | ||
|
||
When adaptive allocations are enabled, the number of allocations of the model is set automatically based on the current load. | ||
When the load is high, a new model allocation is automatically created. | ||
kosabogi marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
When the load is low, a model allocation is automatically removed. | ||
You can explicitely set the minimum and maximum number of allocations; autoscaling will occur within these limits. | ||
|
||
<DocCallOut color="primary" title="Note"> | ||
If you set the minimum number of allocations to 1, you will be charged even if the system is not using those resources. | ||
</DocCallOut> | ||
|
||
You can enable adaptive allocations by using: | ||
|
||
- the create inference endpoint API for [ELSER](https://www.elastic.co/guide/en/elasticsearch/reference/master/infer-service-elser.html), [E5 and models uploaded through Eland](https://www.elastic.co/guide/en/elasticsearch/reference/master/infer-service-elasticsearch.html) that are used as inference services. | ||
- the [start trained model deployment](https://www.elastic.co/guide/en/elasticsearch/reference/master/start-trained-model-deployment.html) or [update trained model deployment](https://www.elastic.co/guide/en/elasticsearch/reference/master/update-trained-model-deployment.html) APIs for trained models that are deployed on machine learning nodes. | ||
|
||
If the new allocations fit on the current machine learning nodes, they are immediately started. | ||
If more resource capacity is needed for creating new model allocations, then your machine learning node will be scaled up if machine learning autoscaling is enabled to provide enough resources for the new allocation. | ||
kosabogi marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
The number of model allocations can be scaled down to 0. | ||
They cannot be scaled up to more than 32 allocations, unless you explicitly set the maximum number of allocations to more. | ||
Adaptive allocations must be set up independently for each deployment and [inference endpoint](https://www.elastic.co/guide/en/elasticsearch/reference/master/put-inference-api.html). | ||
|
||
When you create inference endpoints on Serverless using Kibana, adaptive allocations are automatically turned on, and there is no option to disable them. | ||
|
||
### Optimizing for typical use cases | ||
|
||
You can optimize your model deployment for typical use cases, such as search and ingest. | ||
When you optimize for ingest, the throughput will be higher, which increases the number of inference requests that can be performed in parallel. | ||
When you optimize for search, the latency will be lower during search processes. | ||
|
||
- If you want to optimize for ingest, set the number of threads to `1` (`"threads_per_allocation": 1`). | ||
- If you want to optimize for search, set the number of threads to greater than `1`. | ||
Increasing the number of threads will make the search processes more performant. | ||
|
||
## Enabling autoscaling in Kibana - adaptive resources | ||
|
||
You can enable adaptive resources for your models when starting or updating the model deployment. | ||
Adaptive resources make it possible for Elasticsearch to scale up or down the available resources based on the load on the process. | ||
This can help you to manage performance and cost more easily. | ||
When adaptive resources are enabled, the number of vCUs that the model deployment uses is set automatically based on the current load. | ||
kosabogi marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
When the load is high, the number of vCUs that the process can use is automatically increased. | ||
When the load is low, the number of vCUs that the process can use is automatically decreased. | ||
|
||
You can choose from three levels of resource usage for your trained model deployment; autoscaling will occur within the selected level's range. | ||
|
||
Refer to the tables in the auto-scaling-matrix section to find out the setings for the level you selected. | ||
|
||
<DocImage size="xxl" url="../images/ml-nlp-deployment.png" alt="ML model deployment with adaptive resources enabled." /> | ||
|
||
Search projects are given access to more processing resources, while Security and Observability projects have lower limits. This difference is reflected in the UI configuration: Search projects have higher resource limits compared to Security and Observability projects to accommodate their more complex operations. | ||
|
||
On Serverless, adaptive allocations are automatically enabled for all project types. | ||
However, the "Adaptive resources" control is not displayed in Kibana for Observability and Security projects. | ||
|
||
## Model deployment resource matrix | ||
|
||
The used resources for trained model deployments depend on three factors: | ||
|
||
- your cluster environment (Serverless, Cloud, or on-premises) | ||
- the use case you optimize the model deployment for (ingest or search) | ||
- whether model autoscaling is enabled with adaptive allocations/resources to have dynamic resources, or disabled for static resources | ||
|
||
The following tables show you the number of allocations, threads, and vCUs available on Serverless when adaptive resources are enabled or disabled. | ||
|
||
### Deployments on serverless optimized for ingest | ||
|
||
In case of ingest-optimized deployments, we maximize the number of model allocations. | ||
|
||
#### Adaptive resources enabled | ||
|
||
| Level | Allocations | Threads | vCUs | | ||
kosabogi marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
|--------|------------------------------------------------------|---------|------------------------------------------------------| | ||
| Low | 0 to 2 dynamically | 1 | 0 to 2 dynamically | | ||
| Medium | 1 to 32 dynamically | 1 | 1 to 32 dynamically | | ||
| High | 1 to 512 for Search <br /> 1 to 128 for Security and Observability | 1 | 1 to 512 for Search <br /> 1 to 128 for Security and Observability | | ||
|
||
#### Adaptive resources disabled (Search only) | ||
|
||
| Level | Allocations | Threads | vCUs | | ||
|--------|------------------------------------------------------|---------|------------------------------------------------------| | ||
| Low | Exactly 2 | 1 | 2 | | ||
| Medium | Exactly 32 | 1 | 32 | | ||
| High | 512 for Search <br /> No static allocations for Security and Observability | 1 | 512 for Search <br /> No static allocations for Security and Observability | | ||
|
||
### Deployments on serverless optimized for search | ||
|
||
In case of search-optimized deployments, we maximize the number of threads. | ||
|
||
#### Adaptive resources enabled | ||
|
||
| Level | Allocations | Threads | vCUs | | ||
|--------|------------------------------------------------------|---------|------------------------------------------------------| | ||
| Low | 0 to 1 dynamically | Always 2 | 0 to 2 dynamically | | ||
| Medium | 1 to 2 (if threads=16), dinamically | Maximum (for example, 16) | 1 to 32 dynamically | | ||
| High | 1 to 32 (if threads=16), dinamically | Maximum (for example, 16) | 1 to 512 in Search <br /> 1 to 128 for Security and Observability | | ||
|
||
#### Adaptive resources disabled | ||
|
||
| Level | Allocations | Threads | vCUs | | ||
|--------|---------------------------------------------------------|------------------------|------------------------------------------------------| | ||
| Low | 1 statically | Always 2 | 2 | | ||
| Medium | 2 statically (if threads=16) | Maximum (for example, 16) | 32 | | ||
| High | 32 statically (if threads=16) for Search <br /> No static allocations for Security and Observability | Maximum (for example, 16) | 512 for Search <br /> No static allocations for Security and Observability | |
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.