Skip to content

Commit 91d4dce

Browse files
Freshness updates.
1 parent 8330c33 commit 91d4dce

File tree

1 file changed

+61
-40
lines changed

1 file changed

+61
-40
lines changed
Lines changed: 61 additions & 40 deletions
Original file line numberDiff line numberDiff line change
@@ -1,50 +1,51 @@
11
---
22
title: Manage and optimize costs
33
titleSuffix: Azure Machine Learning
4-
description: Learn tips to optimize your cost when building machine learning models in Azure Machine Learning
4+
description: Use these tips to optimize your cost when you build machine learning models in Azure Machine Learning.
55
ms.reviewer: None
66
author: ssalgadodev
77
ms.author: ssalgado
88
ms.custom: subject-cost-optimization
99
ms.service: azure-machine-learning
1010
ms.subservice: core
1111
ms.topic: how-to
12-
ms.date: 08/06/2024
12+
ms.date: 09/02/2024
13+
#customer intent: As a data scientist or engineer, I want to optimize my cost for training learning modules.
1314
---
1415

1516
# Manage and optimize Azure Machine Learning costs
1617

17-
This article shows you how to manage and optimize costs when training and deploying machine learning models to Azure Machine Learning.
18+
This article shows you how to manage and optimize costs when you train and deploy machine learning models to Azure Machine Learning.
1819

1920
Use the following tips to help you manage and optimize your compute resource costs.
2021

22+
- Use Azure Machine Learning compute cluster
2123
- Configure your training clusters for autoscaling
2224
- Configure your managed online endpoints for autoscaling
2325
- Set quotas on your subscription and workspaces
2426
- Set termination policies on your training job
2527
- Use low-priority virtual machines (VM)
2628
- Schedule compute instances to shut down and start up automatically
2729
- Use an Azure Reserved VM Instance
28-
- Train locally
2930
- Parallelize training
3031
- Set data retention and deletion policies
3132
- Deploy resources to the same region
32-
- Delete failed deployments if computes are created for them
33+
- Delete failed deployments
3334

34-
For information on planning and monitoring costs, see the [plan to manage costs for Azure Machine Learning](concept-plan-manage-cost.md) guide.
35+
For information on planning and monitoring costs, see [Plan to manage costs for Azure Machine Learning](concept-plan-manage-cost.md).
3536

3637
> [!IMPORTANT]
3738
> Items marked (preview) in this article are currently in public preview.
38-
> The preview version is provided without a service level agreement, and it's not recommended for production workloads. Certain features might not be supported or might have constrained capabilities.
39+
> The preview version is provided without a service level agreement. We don't recommend preview versions for production workloads. Certain features might not be supported or might have constrained capabilities.
3940
> For more information, see [Supplemental Terms of Use for Microsoft Azure Previews](https://azure.microsoft.com/support/legal/preview-supplemental-terms/).
4041
41-
## Use Azure Machine Learning compute cluster (AmlCompute)
42+
## Use the Azure Machine Learning compute cluster
4243

43-
With constantly changing data, you need fast and streamlined model training and retraining to maintain accurate models. However, continuous training comes at a cost, especially for deep learning models on GPUs.
44+
With constantly changing data, you need fast and streamlined model training and retraining to maintain accurate models. However, continuous training comes at a cost, especially for deep learning models on GPUs.
4445

45-
Azure Machine Learning users can use the managed Azure Machine Learning compute cluster, also called AmlCompute. AmlCompute supports various GPU and CPU options. The AmlCompute is internally hosted on behalf of your subscription by Azure Machine Learning. It provides the same enterprise grade security, compliance, and governance at Azure IaaS cloud scale.
46+
Azure Machine Learning users can use the managed Azure Machine Learning compute cluster, also called *AmlCompute*. AmlCompute supports various GPU and CPU options. The AmlCompute is internally hosted on behalf of your subscription by Azure Machine Learning. It provides the same enterprise grade security, compliance, and governance at Azure IaaS cloud scale.
4647

47-
Because these compute pools are inside of Azure's IaaS infrastructure, you can deploy, scale, and manage your training with the same security and compliance requirements as the rest of your infrastructure. These deployments occur in your subscription and obey your governance rules. Learn more about [Azure Machine Learning compute](how-to-create-attach-compute-cluster.md).
48+
Because these compute pools are inside of Azure's IaaS infrastructure, you can deploy, scale, and manage your training with the same security and compliance requirements as the rest of your infrastructure. These deployments occur in your subscription and obey your governance rules. For more information, see [Plan to manage costs for Azure Machine Learning](concept-plan-manage-cost.md).
4849

4950
## Configure training clusters for autoscaling
5051

@@ -56,73 +57,93 @@ AmlCompute clusters are designed to scale dynamically based on your workload. Th
5657

5758
You can also configure the amount of time the node is idle before scale down. By default, idle time before scale down is set to 120 seconds.
5859

59-
+ If you perform less iterative experimentation, reduce this time to save costs.
60-
+ If you perform highly iterative dev/test experimentation, you might need to increase the time so you aren't paying for constant scaling up and down after each change to your training script or environment.
60+
- If you perform less iterative experimentation, reduce this time to save costs.
61+
- If you perform highly iterative dev/test experimentation, you might need to increase the time so that you don't pay for constant scaling up and down after each change to your training script or environment.
6162

62-
AmlCompute clusters can be configured for your changing workload requirements in Azure portal, using the [AmlCompute SDK class](/python/api/azure-ai-ml/azure.ai.ml.entities.amlcompute), [AmlCompute CLI](/cli/azure/ml/compute#az-ml-compute-create), with the [REST APIs](https://github.com/Azure/azure-rest-api-specs/tree/master/specification/machinelearningservices/resource-manager/Microsoft.MachineLearningServices/stable).
63+
You can configure AmlCompute clusters your changing workload requirements by using:
6364

64-
## Configure your managed online endpoints for autoscaling
65+
- The Azure portal
66+
- The [AmlCompute SDK class](/python/api/azure-ai-ml/azure.ai.ml.entities.amlcompute)
67+
- [AmlCompute CLI](/cli/azure/ml/compute#az-ml-compute-create)
68+
- [REST APIs](https://github.com/Azure/azure-rest-api-specs/tree/master/specification/machinelearningservices/resource-manager/Microsoft.MachineLearningServices/stable).
6569

66-
Autoscale automatically runs the right amount of resources to handle the load on your application. [Managed online endpoints](concept-endpoints-online.md) supports autoscaling through integration with the Azure Monitor autoscale feature.
70+
## Configure managed online endpoints for autoscaling
6771

68-
Azure Monitor autoscaling supports a rich set of rules. You can configure metrics-based scaling (for instance, CPU utilization >70%), schedule-based scaling (for example, scaling rules for peak business hours), or a combination. For more information, see [Autoscale online endpoints](how-to-autoscale-endpoints.md).
72+
Autoscale automatically runs the right amount of resources to handle the load on your application. Managed online endpoints support autoscaling through integration with the Azure Monitor autoscale feature. For more information, see [Online endpoints and deployments for real-time inference](concept-endpoints-online.md).
73+
74+
Azure Monitor autoscaling supports a rich set of rules:
75+
76+
- Metrics-based scaling, for instance, CPU utilization >70%
77+
- Schedule-based scaling, for example, scaling rules for peak business hours
78+
- A combination of the two
79+
80+
For more information, see [Autoscale online endpoints](how-to-autoscale-endpoints.md).
6981

7082
## Set quotas on resources
7183

72-
AmlCompute comes with a [quota (or limit) configuration](how-to-manage-quotas.md#azure-machine-learning-compute). This quota is by VM family (for example, Dv2 series, NCv3 series) and varies by region for each subscription. Subscriptions start with small defaults to get you going, but use this setting to control the amount of Amlcompute resources available to be spun up in your subscription.
84+
AmlCompute comes with a quota, or limit, configuration. This quota is by VM family, for example, Dv2 series, NCv3 series. The quota varies by region for each subscription. Subscriptions start with small defaults. Use this setting to control the amount of Amlcompute resources available to be spun up in your subscription. For more information, see [Azure Machine Learning Compute](how-to-manage-quotas.md#azure-machine-learning-compute).
85+
86+
Also, you can configure workspace level quota by VM family for each workspace within a subscription. This approach gives you more granular control on the costs that each workspace might incur and restricts certain VM families. For more information, see [Workspace-level quotas](how-to-manage-quotas.md#workspace-level-quotas).
7387

74-
Also configure [workspace level quota by VM family](how-to-manage-quotas.md#workspace-level-quotas), for each workspace within a subscription. Doing so allows you to have more granular control on the costs that each workspace might potentially incur and restrict certain VM families.
88+
To set quotas at the workspace level:
7589

76-
To set quotas at the workspace level, start in the [Azure portal](https://portal.azure.com). Select any workspace in your subscription, and select **Usages + quotas** in the left pane. Then select the **Configure quotas** tab to view the quotas. You need privileges at the subscription scope to set the quota, since it's a setting that affects multiple workspaces.
90+
1. Open the [Azure portal](https://portal.azure.com) and then select any workspace in your subscription.
91+
1. Select **Support + troubleshooting** > **Usages + quotas** in the workspace menu.
92+
1. Select **View quota** to view quotas in Azure Machine Learning studio.
93+
1. From this page, you can find your subscription and region in order to set quotas.
7794

78-
## Set job autotermination policies
95+
Because this setting affects multiple workspaces, you need privileges at the subscription scope to set the quota.
7996

80-
In some cases, you should configure your training runs to limit their duration or terminate them early. For example, when you're using Azure Machine Learning's built-in hyperparameter tuning or automated machine learning.
97+
## Set job termination policies
98+
99+
In some cases, you should configure your training runs to limit their duration or terminate them early. For example, when you use Azure Machine Learning's built-in hyperparameter tuning or automated machine learning.
81100

82101
Here are a few options that you have:
83-
* Define a parameter called `max_run_duration_seconds` in your RunConfiguration to control the maximum duration a run can extend to on the compute you choose (either local or remote cloud compute).
84-
* For [hyperparameter tuning](how-to-tune-hyperparameters.md#early-termination), define an early termination policy from a Bandit policy, a Median stopping policy, or a Truncation selection policy. To further control hyperparameter sweeps, use parameters such as `max_total_runs` or `max_duration_minutes`.
85-
* For [automated machine learning](how-to-configure-auto-train.md#exit-criteria), set similar termination policies using the `enable_early_stopping` flag. Also use properties such as `iteration_timeout_minutes` and `experiment_timeout_minutes` to control the maximum duration of a job or for the entire experiment.
86102

87-
## <a id="low-pri-vm"></a> Use low-priority VMs
103+
- Define a parameter called `max_run_duration_seconds` in your RunConfiguration to control the maximum duration a run can extend to on the compute you choose, either local or remote cloud compute.
104+
- For *hyperparameter tuning*, define an early termination policy from a Bandit policy, a Median stopping policy, or a Truncation selection policy. To further control hyperparameter sweeps, use parameters such as `max_total_runs` or `max_duration_minutes`. For more information, see [Specify early termination policy](how-to-tune-hyperparameters.md#early-termination).
105+
- For automated machine learning, set similar termination policies using the `enable_early_stopping` flag. Also use properties such as `iteration_timeout_minutes` and `experiment_timeout_minutes` to control the maximum duration of a job or for the entire experiment. For more information, see [Exit criteria](how-to-configure-auto-train.md#exit-criteria).
106+
107+
## <a id="low-pri-vm"></a>Use low-priority virtual machines
88108

89-
Azure allows you to use excess unutilized capacity as Low-Priority VMs across virtual machine scale sets, Batch, and the Machine Learning service. These allocations are pre-emptible but come at a reduced price compared to dedicated VMs. In general, we recommend using Low-Priority VMs for Batch workloads. You should also use them where interruptions are recoverable either through resubmits (for Batch Inferencing) or through restarts (for deep learning training with checkpointing).
109+
Azure allows you to use excess unused capacity as Low-Priority VMs across virtual machine scale sets, Batch, and the Machine Learning service. These allocations are preemptible but come at a reduced price compared to dedicated VMs. In general, we recommend that you use Low-Priority VMs for Batch workloads. You should also use them where interruptions are recoverable either through resubmits for Batch Inferencing or through restarts for deep learning training with checkpointing.
90110

91-
Low-Priority VMs have a single quota separate from the dedicated quota value, which is by VM family. Learn [more about AmlCompute quotas](how-to-manage-quotas.md).
111+
Low-Priority VMs have a single quota separate from the dedicated quota value, which is by VM family. For more information about more about AmlCompute quotas, see [Manage and increase quotas ](how-to-manage-quotas.md).
92112

93-
Low-Priority VMs don't work for compute instances, since they need to support interactive notebook experiences.
113+
Low-Priority VMs don't work for compute instances, since they need to support interactive notebook experiences.
94114

95115
## Schedule compute instances
96116

97117
When you create a [compute instance](concept-compute-instance.md), the VM stays on so it's available for your work.
98-
* [Enable idle shutdown (preview)](how-to-create-compute-instance.md#configure-idle-shutdown) to save on cost when the VM has been idle for a specified time period.
99-
* Or [set up a schedule](how-to-create-compute-instance.md#schedule-automatic-start-and-stop) to automatically start and stop the compute instance (preview) to save cost when you aren't planning to use it.
118+
119+
- Enable idle shutdown (preview) to save on cost when the VM is idle for a specified time period. See [Configure idle shutdown](how-to-create-compute-instance.md#configure-idle-shutdown).
120+
- Set up a schedule to automatically start and stop the compute instance (preview) when not in use to save cost. See [Schedule automatic start and stop](how-to-create-compute-instance.md#schedule-automatic-start-and-stop).
100121

101122
## Use reserved instances
102123

103124
Another way to save money on compute resources is Azure Reserved VM Instance. With this offering, you commit to one-year or three-year terms. These discounts range up to 72% of the pay-as-you-go prices and are applied directly to your monthly Azure bill.
104125

105-
Azure Machine Learning Compute supports reserved instances inherently. If you purchase a one-year or three-year reserved instance, we'll automatically apply discount against your Azure Machine Learning managed compute.
126+
Azure Machine Learning Compute supports reserved instances inherently. If you purchase a one-year or three-year reserved instance, we automatically apply discount against your Azure Machine Learning managed compute.
106127

107128
## Parallelize training
108129

109-
One of the key methods of optimizing cost and performance is by parallelizing the workload with the help of a parallel component in Azure Machine Learning. A parallel component allows you to use many smaller nodes to execute the task in parallel, hence allowing you to scale horizontally. There's an overhead for parallelization. Depending on the workload and the degree of parallelism that can be achieved, this may or may not be an option. For more details, follow this link for [ParallelComponent](/python/api/azure-ai-ml/azure.ai.ml.entities.parallelcomponent) documentation.
130+
One of the key methods to optimize cost and performance is to parallelize the workload with the help of a parallel component in Azure Machine Learning. A parallel component allows you to use many smaller nodes to run the task in parallel, which allows you to scale horizontally. There's an overhead for parallelization. Depending on the workload and the degree of parallelism that can be achieved, this approach might be an option. For more information, see [ParallelComponent Class](/python/api/azure-ai-ml/azure.ai.ml.entities.parallelcomponent).
110131

111-
## Set data retention & deletion policies
132+
## Set data retention and deletion policies
112133

113-
Every time a pipeline is executed, intermediate datasets are generated at each step. Over time, these intermediate datasets take up space in your storage account. Consider setting up policies to manage your data throughout its lifecycle to archive and delete your datasets. For more information, see [optimize costs by automating Azure Blob Storage access tiers](/azure/storage/blobs/lifecycle-management-overview).
134+
Every time a pipeline runs, intermediate datasets are generated at each step. Over time, these intermediate datasets take up space in your storage account. Consider setting up policies to manage your data throughout its lifecycle to archive and delete your datasets. For more information, see [Optimize costs by automatically managing the data lifecycle](/azure/storage/blobs/lifecycle-management-overview).
114135

115136
## Deploy resources to the same region
116137

117-
Computes located in different regions may experience network latency and increased data transfer costs. Azure network costs are incurred from outbound bandwidth from Azure data centers. To help reduce network costs, deploy all your resources in the region. Provisioning your Azure Machine Learning workspace and dependent resources in the same region as your data can help lower cost and improve performance.
138+
Computes located in different regions can experience network latency and increased data transfer costs. Azure network costs are incurred from outbound bandwidth from Azure data centers. To help reduce network costs, deploy all your resources in the region. Provisioning your Azure Machine Learning workspace and dependent resources in the same region as your data can help lower cost and improve performance.
118139

119-
For hybrid cloud scenarios like those using ExpressRoute, it can sometimes be more cost effective to move all resources to Azure to optimize network costs and latency.
140+
For hybrid cloud scenarios like those that use Azure ExpressRoute, it can sometimes be more cost effective to move all resources to Azure to optimize network costs and latency.
120141

121-
## Delete failed deployments if computes are created for them
142+
## Delete failed deployments
122143

123-
Managed online endpoint uses VMs for the deployments. If you submitted request to create an online deployment and it failed, it may have passed the stage when compute is created. In that case, the failed deployment would incur charges. If you finished debugging or investigation for the failure, you may delete the failed deployments to save the cost.
144+
Managed online endpoint uses VMs for the deployments. If you submitted request to create an online deployment and it failed, the request might have passed the stage when compute is created. In that case, the failed deployment would incur charges. When you finish debugging or investigation for the failure, delete the failed deployments to save the cost.
124145

125-
## Next steps
146+
## Related content
126147

127148
- [Plan to manage costs for Azure Machine Learning](concept-plan-manage-cost.md)
128149
- [Manage budgets, costs, and quota for Azure Machine Learning at organizational scale](/azure/cloud-adoption-framework/ready/azure-best-practices/optimize-ai-machine-learning-cost)

0 commit comments

Comments
 (0)