Adds information about cooldown periods for trained model autoscaling in Serverless (#2498)

kosabogi · web-flow · commit 1eb144caa6c8 · 2025-08-25T14:13:25.000+02:00
This PR adds information about cooldown periods for trained model autoscaling in serverless projects. ### Changes - [Autoscaling]([deploy-manage/autoscaling.md](https://docs-v3-preview.elastic.dev/elastic/docs-content/pull/2498/deploy-manage/autoscaling)) - [Trained model autoscaling]([deploy-manage/autoscaling/trained-model-autoscaling.md](https://docs-v3-preview.elastic.dev/elastic/docs-content/pull/2498/deploy-manage/autoscaling/trained-model-autoscaling)) - [Elasticsearch billing dimensions]([deploy-manage/cloud-organization/billing/elasticsearch-billing-dimensions.md](https://docs-v3-preview.elastic.dev/elastic/docs-content/pull/2498/deploy-manage/cloud-organization/billing/elasticsearch-billing-dimensions)) Related issue: elastic/docs-content-internal#177
diff --git a/deploy-manage/autoscaling.md b/deploy-manage/autoscaling.md
@@ -39,7 +39,9 @@ Cluster autoscaling supports:
 The available resources of self-managed deployments are static, so trained model autoscaling is not applicable. However, available resources are still segmented based on the settings described in this section.
 :::
 
-Trained model autoscaling automatically adjusts the resources allocated to trained model deployments based on demand. This feature is available on all cloud deployments (ECE, ECK, ECH) and {{serverless-short}}. See [Trained model autoscaling](/deploy-manage/autoscaling/trained-model-autoscaling.md) for details.
+Trained model autoscaling automatically adjusts the resources allocated to trained model deployments based on demand. This feature is available on all cloud deployments (ECE, ECK, ECH) and {{serverless-short}}. Refer to [Trained model autoscaling](/deploy-manage/autoscaling/trained-model-autoscaling.md) for details. 
+
+To ensure availability and avoid unnecessary scaling, trained model deployments operate with defined [cooldown periods](/deploy-manage/autoscaling/trained-model-autoscaling.md#cooldown-periods).
 
 Trained model autoscaling supports:
 * Scaling trained model deployments
diff --git a/deploy-manage/autoscaling/trained-model-autoscaling.md b/deploy-manage/autoscaling/trained-model-autoscaling.md
@@ -22,7 +22,7 @@ There are two ways to enable autoscaling:
 * through APIs by enabling adaptive allocations
 * in {{kib}} by enabling adaptive resources
 
-For {{serverless-short}} projects, trained model autoscaling is automatically enabled and cannot be disabled.
+For {{serverless-short}} projects, trained model autoscaling is always enabled and cannot be turned off. 
 
 ::::{important}
 To fully leverage model autoscaling in {{ech}}, {{ece}}, and {{eck}}, it is highly recommended to enable [{{es}} deployment autoscaling](../../deploy-manage/autoscaling.md).
@@ -36,6 +36,16 @@ The available resources of self-managed deployments are static, so trained model
 
 {{serverless-full}} Security and Observability projects are only charged for data ingestion and retention. They are not charged for processing power (VCU usage), which is used for more complex operations, like running advanced search models. For example, in Search projects, models such as ELSER require significant processing power to provide more accurate search results.
 
+## Cooldown periods [cooldown-periods]
+
+Trained model deployments remain active for 24 hours after the last inference request. After that, they scale down to zero. When scaled up again, they stay active for 5 minutes before they can scale down. These cooldown periods prevent unnecessary scaling and ensure models are available when needed.
+
+::::{important}
+During these cooldown periods, you will continue to be billed for the active resources.
+::::
+
+For {{ech}}, {{eck}} and {{ece}} deployments, you can change the length of this period with the `xpack.ml.trained_models.adaptive_allocations.scale_to_zero_time` cluster setting (minimum 1 minute). For {{serverless-short}} projects, this period is fixed and cannot be changed.
+
 ## Enabling autoscaling through APIs - adaptive allocations [enabling-autoscaling-through-apis-adaptive-allocations]
 
 $$$nlp-model-adaptive-resources$$$
diff --git a/deploy-manage/cloud-organization/billing/elasticsearch-billing-dimensions.md b/deploy-manage/cloud-organization/billing/elasticsearch-billing-dimensions.md
@@ -44,6 +44,9 @@ You can control costs using the following strategies:
 
 * **Search Power setting:** [Search Power](../../deploy/elastic-cloud/project-settings.md#elasticsearch-manage-project-search-power-settings) controls the speed of searches against your data. With Search Power, you can improve search performance by adding more resources for querying, or you can reduce provisioned resources to cut costs.
 * **Search boost window**: By limiting the number of days of [time series data](../../../solutions/search/ingest-for-search.md#elasticsearch-ingest-time-series-data) that are available for caching, you can reduce the number of search VCUs required.
+* **Machine learning trained model autoscaling:** [Trained model autoscaling](/deploy-manage/autoscaling/trained-model-autoscaling.md) is always enabled and cannot be disabled, ensuring efficient resource usage, reduced costs, and optimal performance without manual configuration.
+
+  Trained model deployments automatically scale down to zero allocations after 24 hours without any inference requests. When they scale up again, they remain active for 5 minutes before they can scale down. During these cooldown periods, you will continue to be billed for the active resources.
  
 * **Indexing Strategies:** Consider your indexing strategies and how they might impact overall VCU usage and costs: