You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: articles/machine-learning/concept-plan-manage-cost.md
+19-16Lines changed: 19 additions & 16 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -44,7 +44,7 @@ The following screenshot shows the cost estimation by using the calculator:
44
44
45
45
As you add new resources to your workspace, return to this calculator and add the same resource here to update your cost estimates.
46
46
47
-
While the Enterprise edition is in preview, there is no ML surcharge. When the Enterprise edition becomes generally available, it will have a machine learning surcharge (for training and inferencing). For more details,[Azure Machine Learning pricing](https://azure.microsoft.com/pricing/details/machine-learning/).
47
+
While the Enterprise edition is in preview, there is no ML surcharge. When the Enterprise edition becomes generally available, it will have a surcharge (for training and inferencing). For more information, see[Azure Machine Learning pricing](https://azure.microsoft.com/pricing/details/machine-learning/).
48
48
49
49
## Get cost alerts
50
50
@@ -54,27 +54,30 @@ Create [budgets](../cost-management/tutorial-acm-create-budgets.md) to manage co
54
54
55
55
As you use resources with Azure Machine Learning, you incur costs. Azure resource usage unit costs vary by time intervals (seconds, minutes, hours, and days) or by request unit usage. As soon as usage of Azure Machine Learning starts, costs are incurred. View these costs in the [cost analysis](../cost-management/quick-acm-cost-analysis.md) pane in the Azure portal.
56
56
57
-
View costs in graphs and tables for different time intervals. Some examples are by day, current, prior month, and year. Also view costs against budgets and forecasted costs. Switching to longer views over time helps you identify spending trends and see where overspending might have occurred. If you've created budgets, see where they exceeded.
57
+
You can view costs in graphs and tables for different time intervals. You can also view costs against budgets and forecasted costs. Switching to longer views over time helps identify spending trends and see where overspending might have occurred. If you've created budgets, see where they exceeded.
58
58
59
59
You won't see a separate service area for Machine Learning. Instead you'll see the various resources you've added to your Machine Learning workspaces.
60
60
61
61
## Use Azure Machine Learning compute cluster (AmlCompute)
62
62
63
63
With constantly changing data, you need fast and streamlined model training and retraining to maintain accurate models. However, continuous training comes at a cost, especially for deep learning models on GPUs.
64
64
65
-
Azure Machine Learning users can use the managed Azure Machine Learning compute cluster, also called AmlCompute. AmlCompute supports a variety of GPU and CPU options. The AmlCompute is internally hosted on behalf of your subscription by Azure Machine Learning, but provides the same enterprise grade security, compliance and governance at Azure IaaS cloud scale.
65
+
Azure Machine Learning users can use the managed Azure Machine Learning compute cluster, also called AmlCompute. AmlCompute supports a variety of GPU and CPU options. The AmlCompute is internally hosted on behalf of your subscription by Azure Machine Learning. It provides the same enterprise grade security, compliance and governance at Azure IaaS cloud scale.
66
66
67
67
Because these compute pools are inside of Azure's IaaS infrastructure, you can deploy, scale, and manage your training with the same security and compliance requirements as the rest of your infrastructure. These deployments occur in your subscription and obey your governance rules. Learn more about [Azure Machine Learning Compute](how-to-set-up-training-targets.md#amlcompute).
68
68
69
69
## Configure training clusters for autoscaling
70
70
71
-
Autoscaling clusters based on the requirements of your workload helps reduce your costs so you only use what you need.
72
-
AmlCompute clusters are designed to autoscale dynamically based on the requirements of your workload. The cluster can be scaled up to the maximum number of nodes provisioned and within the quota designated for the subscription. As each run completes, the cluster will release nodes and autoscale to your designated minimum node count.
71
+
Autoscaling clusters based on the requirements of your workload helps reduce your costs so you only use what you need.
73
72
74
-
In addition to setting the minimum and maximum number of nodes, tweak the amount of time the node is idle before scale down. By default, idle time before scale down is set to 120 seconds.
73
+
AmlCompute clusters are designed to scale dynamically based on your workload. The cluster can be scaled up to the maximum number of nodes you configure. As each run completes, the cluster will release nodes and scale to your configured minimum node count.
You can also configure the amount of time the node is idle before scale down. By default, idle time before scale down is set to 120 seconds.
75
78
76
79
+ If you perform less iterative experimentation, reduce this time to save costs.
77
-
+ If you perform highly iterative dev/test experimentation, you might need to increase this so you aren't paying for constant scaling up and down after each change to your training script or environment.
80
+
+ If you perform highly iterative dev/test experimentation, you might need to increase the time so you aren't paying for constant scaling up and down after each change to your training script or environment.
78
81
79
82
AmlCompute clusters can be configured for your changing workload requirements in Azure portal, using the [AmlCompute SDK class](https://docs.microsoft.com/python/api/azureml-core/azureml.core.compute.amlcompute.amlcompute?view=azure-ml-py), [AmlCompute CLI](https://docs.microsoft.com/cli/azure/ext/azure-cli-ml/ml/computetarget/create?view=azure-cli-latest#ext-azure-cli-ml-az-ml-computetarget-create-amlcompute), with the [REST APIs](https://github.com/Azure/azure-rest-api-specs/tree/master/specification/machinelearningservices/resource-manager/Microsoft.MachineLearningServices/stable).
80
83
@@ -84,24 +87,24 @@ az ml computetarget create amlcompute --name testcluster --vm-size Standard_NC6
84
87
85
88
## Set quotas on resources
86
89
87
-
Much like other Azure compute resources, AmlCompute comes with an inherent[quota (or limit) configuration](how-to-manage-quotas.md#azure-machine-learning-compute). This quota is by VM family (for example, Dv2 series, NCv3 series) and varies by region for each subscription. Subscriptions start with small defaults to get you going, but use this setting to control the amount of Amlcompute resources available to be spun up in your subscription.
90
+
AmlCompute comes with a[quota (or limit) configuration](how-to-manage-quotas.md#azure-machine-learning-compute). This quota is by VM family (for example, Dv2 series, NCv3 series) and varies by region for each subscription. Subscriptions start with small defaults to get you going, but use this setting to control the amount of Amlcompute resources available to be spun up in your subscription.
88
91
89
-
Also configure [workspace level quota by VM family](how-to-manage-quotas.md#workspace-level-quota), for each workspace within a subscription. This allows you to have more granular control on the costs that each workspace might potentially incur and restrict certain VM families.
92
+
Also configure [workspace level quota by VM family](how-to-manage-quotas.md#workspace-level-quota), for each workspace within a subscription. Doing so allows you to have more granular control on the costs that each workspace might potentially incur and restrict certain VM families.
90
93
91
-
To set quotas at the workspace level, start in the [Azure portal](https://portal.azure.com). Select any workspace in your subscription, and select **Usages + quotas** in the left pane. Then select the **Configure quotas** tab to view the quotas. You need privileges at the subscription scope to set this quota, since it's a setting that affects multiple workspaces.
94
+
To set quotas at the workspace level, start in the [Azure portal](https://portal.azure.com). Select any workspace in your subscription, and select **Usages + quotas** in the left pane. Then select the **Configure quotas** tab to view the quotas. You need privileges at the subscription scope to set the quota, since it's a setting that affects multiple workspaces.
92
95
93
-
## Set run auto-termination policies
96
+
## Set run autotermination policies
94
97
95
-
Configure your training runs to limit their duration or to terminate them early in case of certain conditions especially when you are using Azure Machine Learning's built-in Hyperparameter Tuning or Automated Machine Learning capabilities.
98
+
In some cases, you should configure your training runs to limit their duration or terminate them early. For example, when you are using Azure Machine Learning's built-in hyperparameter tuning or automated machine learning.
96
99
97
100
Here are a few options that you have:
98
101
* Define a parameter called `max_run_duration_seconds` in your RunConfiguration to control the maximum duration a run can extend to on the compute you choose (either local or remote cloud compute).
99
-
* For [hyperparameter tuning](how-to-tune-hyperparameters.md#early-termination), define an early termination policy from a Bandit policy, a Median stopping policy or a Truncation selection policy. In addition, also use parameters such as `max_total_runs` or `max_duration_minutes` to further control the various hyperparameter sweeps.
102
+
* For [hyperparameter tuning](how-to-tune-hyperparameters.md#early-termination), define an early termination policy from a Bandit policy, a Median stopping policy, or a Truncation selection policy. To further control hyperparameter sweeps, use parameters such as `max_total_runs` or `max_duration_minutes`.
100
103
* For [automated machine learning](how-to-configure-auto-train.md#exit), set similar termination policies using the `enable_early_stopping` flag. Also use properties such as `iteration_timeout_minutes` and `experiment_timeout_minutes` to control the maximum duration of a run or for the entire experiment.
101
104
102
105
## Use low-priority VMs
103
106
104
-
Azure allows you to use excess unutilized capacity as Low-Priority VMs across virtual machine scale sets, Batch, and the Machine Learning service. These allocations are pre-emptible but come at a reduced price compared to dedicated VMs. In general, we recommend using Low-Priority VMs for Batch workloads or where interruptions are recoverable either through resubmits (for Batch Inferencing) or through restarts (for deep learning training with checkpointing).
107
+
Azure allows you to use excess unutilized capacity as Low-Priority VMs across virtual machine scale sets, Batch, and the Machine Learning service. These allocations are pre-emptible but come at a reduced price compared to dedicated VMs. In general, we recommend using Low-Priority VMs for Batch workloads. You should also use them where interruptions are recoverable either through resubmits (for Batch Inferencing) or through restarts (for deep learning training with checkpointing).
105
108
106
109
Low-Priority VMs have a single quota separate from the dedicated quota value, which is by VM family. Learn [more about AmlCompute quotas](how-to-manage-quotas.md).
107
110
@@ -127,9 +130,9 @@ Set the priority of your VM in any of these ways:
127
130
128
131
## Use reserved instances
129
132
130
-
Azure Reserved VM Instance provides another way to get huge savings on compute resources by committing to one-year or three-year terms. These discounts range up to 72% of the pay-as-you-go prices and are applied directly to your monthly Azure bill.
133
+
Another way to save money on compute resources is Azure Reserved VM Instance. With this offering, you commit to one-year or three-year terms. These discounts range up to 72% of the pay-as-you-go prices and are applied directly to your monthly Azure bill.
131
134
132
-
Azure Machine Learning Compute supports reserved instances inherently. So ifyou have purchased a one-year or three-year reserved instance we will automatically apply that reserved instance discount against the managed compute that is used within Azure Machine Learning without requiring any additional setup from your end.
135
+
Azure Machine Learning Compute supports reserved instances inherently. If you purchase a one-year or three-year reserved instance, we will automatically apply discount against your Azure Machine Learning managed compute.
With Azure Machine Learning, you can train your model on a variety of resources or environments, collectively referred to as [__compute targets__](concept-azure-machine-learning-architecture.md#compute-targets). A compute target can be a local machine or a cloud resource, such as an Azure Machine Learning Compute, Azure HDInsight or a remote virtual machine. You can also create compute targets for model deployment as described in ["Where and how to deploy your models"](how-to-deploy-and-where.md).
18
+
With Azure Machine Learning, you can train your model on a variety of resources or environments, collectively referred to as [__compute targets__](concept-azure-machine-learning-architecture.md#compute-targets). A compute target can be a local machine or a cloud resource, such as an Azure Machine Learning Compute, Azure HDInsight, or a remote virtual machine. You can also create compute targets for model deployment as described in ["Where and how to deploy your models"](how-to-deploy-and-where.md).
19
19
20
-
You can create and manage a compute target using the Azure Machine Learning SDK, Azure Machine Learning studio, Azure CLI or Azure Machine Learning VS Code extension. If you have compute targets that were created through another service (for example, an HDInsight cluster), you can use them by attaching them to your Azure Machine Learning workspace.
20
+
You can create and manage a compute target using the Azure Machine Learning SDK, Azure Machine Learning studio, Azure CLI, or Azure Machine Learning VS Code extension. If you have compute targets that were created through another service (for example, an HDInsight cluster), you can use them by attaching them to your Azure Machine Learning workspace.
21
21
22
22
In this article, you learn how to use various compute targets for model training. The steps for all compute targets follow the same workflow:
23
23
1.__Create__ a compute target if you don't already have one.
@@ -30,7 +30,7 @@ In this article, you learn how to use various compute targets for model training
30
30
31
31
## Compute targets for training
32
32
33
-
Azure Machine Learning has varying support across different compute targets. A typical model development lifecycle starts with dev/experimentation on a small amount of data. At this stage, we recommend using a local environment. For example, your local computer or a cloud-based VM. As you scale up your training on larger data sets, or do distributed training, we recommend using Azure Machine Learning Compute to create a single- or multi-node cluster that autoscales each time you submit a run. You can also attach your own compute resource, although support for various scenarios may vary as detailed below:
33
+
Azure Machine Learning has varying support across different compute targets. A typical model development lifecycle starts with dev/experimentation on a small amount of data. At this stage, we recommend using a local environment. For example, your local computer or a cloud-based VM. As you scale up your training on larger data sets, or perform distributed training, we recommend using Azure Machine Learning Compute to create a single- or multi-node cluster that autoscales each time you submit a run. You can also attach your own compute resource, although support for various scenarios may vary as detailed below:
@@ -58,7 +58,7 @@ For more information, see [Train ML Models with estimators](how-to-train-ml-mode
58
58
59
59
With ML pipelines, you can optimize your workflow with simplicity, speed, portability, and reuse. When building pipelines with Azure Machine Learning, you can focus on your expertise, machine learning, rather than on infrastructure and automation.
60
60
61
-
ML pipelines are constructed from multiple **steps**, which are distinct computational units in the pipeline. Each step can run independently and use isolated compute resources. This allows multiple data scientists to work on the same pipeline at the same time without over-taxing compute resources, and also makes it easy to use different compute types/sizes for each step.
61
+
ML pipelines are constructed from multiple **steps**, which are distinct computational units in the pipeline. Each step can run independently and use isolated compute resources. This approach allows multiple data scientists to work on the same pipeline at the same time without over-taxing compute resources, and also makes it easy to use different compute types/sizes for each step.
62
62
63
63
> [!TIP]
64
64
> ML Pipelines can use run configuration or estimators when training models.
@@ -94,10 +94,11 @@ You can use Azure Machine Learning Compute to distribute the training process ac
94
94
Azure Machine Learning Compute has default limits, such as the number of cores that can be allocated. For more information, see [Manage and request quotas for Azure resources](https://docs.microsoft.com/azure/machine-learning/how-to-manage-quotas).
95
95
96
96
> [!TIP]
97
-
> Clusters can generally scale upto 100 nodes as long as you have enough quota for the number of cores required. By default clusters are setup with inter-node communication enabled between the nodes of the cluster to support MPI jobs for example. However you can scale your clusters to 1000s of nodes by simply [raising a support ticket](https://portal.azure.com/#blade/Microsoft_Azure_Support/HelpAndSupportBlade/newsupportrequest), and requesting to whitelist your subscription, or workspace, or a specific cluster for disabling inter-node communication.
98
-
>
97
+
> Clusters can generally scale up to 100 nodes as long as you have enough quota for the number of cores required. By default clusters are setup with inter-node communication enabled between the nodes of the cluster to support MPI jobs for example. However you can scale your clusters to 1000s of nodes by simply [raising a support ticket](https://portal.azure.com/#blade/Microsoft_Azure_Support/HelpAndSupportBlade/newsupportrequest), and requesting to whitelist your subscription, or workspace, or a specific cluster for disabling inter-node communication.
98
+
99
+
Azure Machine Learning Compute can be reused across runs. The compute can be shared with other users in the workspace and is retained between runs, automatically scaling nodes up or down based on the number of runs submitted, and the max_nodes set on your cluster. The min_nodes setting controls the minimum nodes available.
99
100
100
-
Azure Machine Learning Compute can be reused across runs. The compute can be shared with other users in the workspace and is retained between runs, automatically scaling nodes up or down based on the number of runs submitted, and the max_nodes set on your cluster.
1.**Create and attach**: To create a persistent Azure Machine Learning Compute resource in Python, specify the **vm_size** and **max_nodes** properties. Azure Machine Learning then uses smart defaults for the other properties. The compute autoscales down to zero nodes when it isn't used. Dedicated VMs are created to run your jobs as needed.
103
104
@@ -478,7 +479,7 @@ az ml run submit-hyperdrive -e <experiment> -c <runconfig> --hyperdrive-configur
478
479
479
480
Note the *arguments* section in runconfig and*parameter space*in HyperDrive config. They contain the command-line arguments to be passed to training script. The value in runconfig stays the same for each iteration, while the rangein HyperDrive config is iterated over. Do not specify the same argument in both files.
480
481
481
-
For more details on these ```az ml```CLI commandsand full set of arguments, see
482
+
For more details on these ```az ml```CLI commands, see
0 commit comments