When you train a machine learning model in an Azure Machine Learning (AML) cluster, with or without AML pipelines, debugging can be challenging because part of the code runs on a remote machine. You could train models locally with AML SDK, but the code and behavior for local runs are not exactly same as remote runs. In this article, we describe a few scenarios of how to be productive when developing AML solutions locally.
Note: the best tool for AML local development is probably Azure Machine Learning VSCode extention. At the time of this writing, version 0.6.19 doesn't work on a Ubuntu machine or a Windows machine remote SSH into Ubuntu.
If the code can run entirely locally without having to communicate with a remote service or cluster, then you can debug it as usual.
For example, in the image-classification-tensorflow sample, train.py contains code to split the data for training and testing and train the model. It depends on Keras but not a remote service or cluster. Meanwhile train_aml.py accesses Azure ML Datastore and Dataset. You can debug train.py without any special setup.
You could still debug code that has remote dependency as documented for that sample. However, if the code doesn't fully support local run, it will run into errors that don't happen in remote run.
Azure ML supports local compute target for both training and inferencing. Here is an example to train a scikit-learn model locally using Azure ML SDK. Most of the code remain common for both local and remote training. However, local runs have limitations, for example:
- can't run AML pipelines locally
- can't mount AML Datastore or Datasets locally
- local run doesn't have context such as a parent or child
You can trigger an AML experiment to run remotely from a local machine, or you can trigger an AML pipeline which only runs in AML compute cluster. In this case, the training code runs in the remote cluster, but you can still debug the code before submitting the run and after it returns. For example, to publish the data preprocessing pipeline in the image-classification-tensorflow sample:
-
In a terminal, activate the Conda environment and go to the root directory of the sample
-
Ensure the variables for AML are set correctly in your local
.env -
Place a breakpoint in build_data_processing_pipeline.py, run
python -m debugpy --listen 5678 --wait-for-client ml_service/pipelines/build_data_processing_pipeline.py
-
In VSCode, create a launch configuration to attach to the debugger, and F5:
"configurations": [ { "name": "Python: Attach", "cwd": "${workspaceFolder}/samples/image-classification-tensorflow", "type": "python", "request": "attach", "connect": { "host": "localhost", "port": 5678 }, } ]
-
Place a breakpoint in run_data_processing_pipeline.py, replacing the
aml_pipeline_nameparameter with your own:python -m debugpy --listen 5678 --wait-for-client ml_service/pipelines/run_data_processing_pipeline.py --aml_pipeline_name flower-data-processing-pipeline
You can modify the code to supply additional AML pipeline parameters as shown in this example.
This scenario is not about local debugging, however, it could also help to make development more productive. When you train remotely in an AML compute cluster, the cluster automatically scales down when idle, it might take a while for the cluster to spin up everytime you submit a run. You can reduce this spin up time by attaching your own VM. If you use a VM for development, it's already running anyways, saving cost and time.
If you attach your own VM, you probably want to configure Conda and Docker to store environments and images on an attached data disk rather than OS disk because the OS disk is typically small. Additionally, AML pulls down environments in ~/.azureml, link this directory to a data disk as well.