-
Notifications
You must be signed in to change notification settings - Fork 254
[DOCS] Documents trained model auto-scaling #2795
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from 27 commits
Commits
Show all changes
34 commits
Select commit
Hold shift + click to select a range
5cfef7a
Removes ELSER auto-scaling limitation.
szabosteve 56873a0
[DOCS] Documents ELSER auto-scale.
szabosteve c62ebd5
[DOCS] Further edits.
szabosteve 51609ce
Adds intro text.
szabosteve 9d408db
Fixes bullet list.
szabosteve abc7f5a
Merge branch 'main' into elser-auto-scale
szabosteve e9b26e8
[DOCS] Fine-tunes adaptive resources docs.
szabosteve b2b075f
Merge branch 'elser-auto-scale' of github.com:szabosteve/stack-docs i…
szabosteve b6912d8
[DOCS] Adds screenshot.
szabosteve b36e365
[DOCS] Splits autoscaling content to new page.
szabosteve 59b924c
Adds reference to E5 page.
szabosteve 78116b1
[DOCS] Adds IDs.
szabosteve 54323dc
Merge branch 'main' into elser-auto-scale
elasticmachine 83ae90a
[DOCS] Adds link to pricing calculator.
szabosteve 7c2c455
Merge branch 'elser-auto-scale' of github.com:szabosteve/stack-docs i…
szabosteve 8b411c3
[DOCS] Adds available resources matrix.
szabosteve 7305bc9
[DOCS] Removes strings.
szabosteve d40bd92
[DOCS] Fixes typo.
szabosteve 81efc0f
[DOCS] Rephrases sentence.
szabosteve 5b3c377
[DOCS] Removes discrete flag.
szabosteve 64d0d4f
[DOCS] Rescales image.
szabosteve ebd9083
[DOCS] Fine-tunes phrasing.
szabosteve fd7293c
Update docs/en/stack/ml/nlp/ml-nlp-autoscaling.asciidoc
szabosteve e306f66
[DOCS] Fixes typo.
szabosteve 55f97df
[DOCS] Addresses feedback.
szabosteve 94e4b40
[DOCS] Fix hyphenation.
szabosteve 83e47eb
[DOCS] Fixes typo.
szabosteve 771e508
[DOCS] Fixes another typo.
szabosteve 20c1b57
[DOCS] Addresses feedback.
szabosteve 7e60639
Changes section title.
szabosteve f489618
Apply suggestions from code review
szabosteve e7bc129
Update docs/en/stack/ml/nlp/ml-nlp-autoscaling.asciidoc
szabosteve ffbafd0
Adds a Note about Obs and Sec project behavior.
szabosteve 486ad39
Apply suggestions from code review
szabosteve File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
Binary file modified
BIN
+33 KB
(120%)
docs/en/stack/ml/nlp/images/ml-nlp-deployment-id-elser-v2.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,153 @@ | ||
[[ml-nlp-auto-scale]] | ||
= Trained model autoscaling | ||
|
||
You can enable autoscaling for each of your trained model deployments. | ||
Autoscaling allows {es} to automatically adjust the resources the deployment can use based on the workload demand. | ||
|
||
There are two ways to enable autoscaling: | ||
|
||
* through APIs by enabling adaptive allocations | ||
* in {kib} by enabling adaptive resources | ||
|
||
IMPORTANT: To fully leverage model autoscaling, it is highly recommended to enable {cloud}/ec-autoscaling.html[deployment autoscaling]. | ||
|
||
|
||
[discrete] | ||
[[nlp-model-adaptive-allocations]] | ||
== Enabling autoscaling through APIs - adaptive allocations | ||
|
||
Model allocations are independent units of work for NLP tasks. | ||
szabosteve marked this conversation as resolved.
Show resolved
Hide resolved
|
||
If you set the numbers of threads and allocations for a model manually, they remain constant even when not all the available resources are fully used or when the load on the model requires more resources. | ||
Instead of setting the number of allocations manually, you can enable adaptive allocations to set the number of allocations based on the load on the process. | ||
This can help you to manage performance and cost more easily. | ||
(Refer to the https://cloud.elastic.co/pricing[pricing calculator] to learn more about the possible costs.) | ||
|
||
When adaptive allocations are enabled, the number of allocations of the model is set automatically based on the current load. | ||
When the load is high, a new model allocation is automatically created. | ||
When the load is low, a model allocation is automatically removed. | ||
|
||
You can enable adaptive allocations by using: | ||
|
||
* the create inference endpoint API for {ref}/infer-service-elser.html[ELSER], {ref}/infer-service-elasticsearch.html[E5 and models uploaded through Eland] that are used as {infer} services. | ||
* the {ref}/start-trained-model-deployment.html[start trained model deployment] or {ref}/update-trained-model-deployment.html[update trained model deployment] APIs for trained models that are deployed on {ml} nodes. | ||
szabosteve marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
If the new allocations fit on the current {ml} nodes, they are immediately started. | ||
If more resource capacity is needed for creating new model allocations, then your {ml} node will be scaled up if {ml} autoscaling is enabled to provide enough resources for the new allocation. | ||
The number of model allocations can be scaled down to 0. | ||
They cannot be scaled up to more than 32 allocations, unless you explicitly set the maximum number of allocations to more. | ||
Adaptive allocations must be set up independently for each deployment and {infer} endpoint. | ||
|
||
|
||
[discrete] | ||
[[optimize-use-case]] | ||
=== Optimize for typical use cases by using adaptive allocations | ||
szabosteve marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
|
||
You can optimize your model deplyoment for typical use cases, such as search and ingest. | ||
szabosteve marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
When you optimize for ingest, the throughput will be higher, which increases the number of {infer} requests that can be performed in parallel. | ||
When you optimize for search, the latency will be lower during search processes. | ||
|
||
* If you want to optimize for ingest, set the number of threads to `1` (`"num_threads": 1`). | ||
szabosteve marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
* If you want to optimize for search, set the number of threads to greater than `1`. | ||
Increasing the number of threads will make the search processes more performant. | ||
|
||
|
||
[discrete] | ||
[[nlp-model-adaptive-resources]] | ||
== Enabling autoscaling in {kib} - adaptive resources | ||
|
||
You can enable adaptive resources for your models when starting or updating the model deployment. | ||
Adaptive resources make it possible for {es} to scale up or down the available resources based on the load on the process. | ||
This can help you to manage performance and cost more easily. | ||
When adaptive resources are enabled, the number of vCPUs that the model deployment uses is set automatically based on the current load. | ||
When the load is high, the number of vCPUs that the process can use is automatically increased. | ||
When the load is low, the number of vCPUs that the process can use is automatically decreased. | ||
|
||
You can choose from three levels of resource usage for your trained model deployment. | ||
Refer to the tables in the <<auto-scaling-matrix>> section to find out the setings for the level you selected. | ||
|
||
|
||
[role="screenshot"] | ||
image::images/ml-nlp-deployment-id-elser-v2.png["ELSER deployment with adaptive resources enabled.",width=640] | ||
|
||
|
||
[discrete] | ||
[[auto-scaling-matrix]] | ||
== Model deployment resource matrix | ||
|
||
The used resources for trained model deployments depend on three factors: | ||
|
||
* your cluster environment (Serverless, Cloud, or on-premises) | ||
szabosteve marked this conversation as resolved.
Show resolved
Hide resolved
|
||
* the use case you optimize the model deployment for (ingest or search) | ||
* whether adaptive resources are enabled or disabled (dynamic or static resources) | ||
|
||
If you use {es} on-premises, adaptive resources behavior is fully dynamic and highly dependent on the hardware configuration. | ||
szabosteve marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
The following tables show you the number of allocations, threads, and vCPUs available in Cloud when adaptive resources are enabled or disabled. | ||
|
||
|
||
[discrete] | ||
=== Deployments in Cloud optimized for ingest | ||
|
||
szabosteve marked this conversation as resolved.
Show resolved
Hide resolved
|
||
In case of ingest-optimized deployments, we maximize the number of model allocations. | ||
|
||
|
||
[discrete] | ||
==== Adaptive resources enabled | ||
|
||
[cols="4*", options="header"] | ||
|========== | ||
| Level | Allocations | Threads | vCPUs | ||
| Low | 0 to 2 if available, dynamically | 1 | 0 to 2 if available, dynamically | ||
| Medium | 1 to 32 dynamically | 1 | 1 to the smaller of 32 or the limit set in the Cloud console, dynamically | ||
| High | 1 to limit set in the Cloud console ^*^, dynamically | 1 | 1 to limit set in the Cloud console, dynamically | ||
|========== | ||
|
||
^*^ The Cloud console doesn't directly set an allocations limit; it only sets a vCPU limit. | ||
This vCPU limit indirectly determines the number of allocations, calculated as the vCPU limit divided by the number of threads. | ||
|
||
[discrete] | ||
==== Adaptive resources disabled | ||
|
||
[cols="4*", options="header"] | ||
|========== | ||
| Level | Allocations | Threads | vCPUs | ||
| Low | 2 if available, otherwise 1, statically | 1 | 2 if available | ||
| Medium | the smaller of 32 or the limit set in the Cloud console, statically | 1 | 32 if available | ||
| High | Maximum available set in the Cloud console ^*^, statically | 1 | Maximum available set in the Cloud console, statically | ||
|========== | ||
|
||
^*^ The Cloud console doesn't directly set an allocations limit; it only sets a vCPU limit. | ||
This vCPU limit indirectly determines the number of allocations, calculated as the vCPU limit divided by the number of threads. | ||
|
||
[discrete] | ||
=== Deployments in Cloud optimized for search | ||
|
||
szabosteve marked this conversation as resolved.
Show resolved
Hide resolved
|
||
In case of search-optimized deployments, we maximize the number of threads. | ||
The maximum number of threads that can be claimed depends on the hardware your architecture has. | ||
|
||
[discrete] | ||
==== Adaptive resources enabled | ||
|
||
[cols="4*", options="header"] | ||
|========== | ||
| Level | Allocations | Threads | vCPUs | ||
| Low | 1 | 2 | 2 | ||
| Medium | 1 to 2 (if threads=16) dynamically | maximum that the hardware allows (for example, 16) | 1 to 32 dynamically | ||
| High | 1 to limit set in the Cloud console ^*^, dynamically| maximum that the hardware allows (for example, 16) | 1 to limit set in the Cloud console, dynamically | ||
|========== | ||
|
||
^*^ The Cloud console doesn't directly set an allocations limit; it only sets a vCPU limit. | ||
This vCPU limit indirectly determines the number of allocations, calculated as the vCPU limit divided by the number of threads. | ||
|
||
[discrete] | ||
==== Adaptive resources disabled | ||
|
||
[cols="4*", options="header"] | ||
|========== | ||
| Level | Allocations | Threads | vCPUs | ||
| Low | 1 if available, statically | 2 | 2 if available | ||
| Medium | 2 (if threads=16) statically | maximum that the hardware allows (for example, 16) | 32 if available | ||
| High | Maximum available set in the Cloud console ^*^, statically | maximum that the hardware allows (for example, 16) | Maximum available set in the Cloud console, statically | ||
|========== | ||
|
||
^*^ The Cloud console doesn't directly set an allocations limit; it only sets a vCPU limit. | ||
This vCPU limit indirectly determines the number of allocations, calculated as the vCPU limit divided by the number of threads. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.