Skip to content

Commit c322cc0

Browse files
authored
Merge pull request #105156 from mx-iao/minxia/training-troubleshooting
Updating Azure Machine Learning troubleshooting article
2 parents c2f8570 + 6893745 commit c322cc0

File tree

1 file changed

+70
-64
lines changed

1 file changed

+70
-64
lines changed

articles/machine-learning/resource-known-issues.md

Lines changed: 70 additions & 64 deletions
Original file line numberDiff line numberDiff line change
@@ -16,7 +16,75 @@ ms.date: 11/04/2019
1616

1717
This article helps you find and correct errors or failures encountered when using Azure Machine Learning.
1818

19-
## Outage: SR-IOV upgrade to NCv3 machines in AmlCompute
19+
## SDK installation issues
20+
21+
**Error message: Cannot uninstall 'PyYAML'**
22+
23+
Azure Machine Learning SDK for Python: PyYAML is a distutils installed project. Therefore, we cannot accurately determine which files belong to it if there is a partial uninstall. To continue installing the SDK while ignoring this error, use:
24+
25+
```Python
26+
pip install --upgrade azureml-sdk[notebooks,automl] --ignore-installed PyYAML
27+
```
28+
29+
**Error message: `ERROR: No matching distribution found for azureml-dataprep-native`**
30+
31+
Anaconda's Python 3.7.4 distribution has a bug that breaks azureml-sdk install. This issue is discussed in this [GitHub Issue](https://github.com/ContinuumIO/anaconda-issues/issues/11195)
32+
This can be worked around by creating a new Conda Environment using this command:
33+
```bash
34+
conda create -n <env-name> python=3.7.3
35+
```
36+
Which creates a Conda Environment using Python 3.7.3, which doesn't have the install issue present in 3.7.4.
37+
38+
## Training and experimentation issues
39+
40+
### Metric Document is too large
41+
Azure Machine Learning has internal limits on the size of metric objects that can be logged at once from a training run. If you encounter a "Metric Document is too large" error when logging a list-valued metric, try splitting the list into smaller chunks, for example:
42+
43+
```python
44+
run.log_list("my metric name", my_metric[:N])
45+
run.log_list("my metric name", my_metric[N:])
46+
```
47+
48+
Internally, Azure ML concatenates the blocks with the same metric name into a contiguous list.
49+
50+
### ModuleErrors (No module named)
51+
If you are running into ModuleErrors while submitting experiments in Azure ML, it means that the training script is expecting a package to be installed but it isn't added. Once you provide the package name, Azure ML will install the package in the environment used for your training run.
52+
53+
If you are using [Estimators](concept-azure-machine-learning-architecture.md#estimators) to submit experiments, you can specify a package name via `pip_packages` or `conda_packages` parameter in the estimator based on from which source you want to install the package. You can also specify a yml file with all your dependencies using `conda_dependencies_file`or list all your pip requirements in a txt file using `pip_requirements_file` parameter. If you have your own Azure ML Environment object that you want to override the default image used by the estimator, you can specify that environment via the `environment` parameter of the estimator constructor.
54+
55+
Azure ML also provides framework-specific estimators for Tensorflow, PyTorch, Chainer and SKLearn. Using these estimators will make sure that the core framework dependencies are installed on your behalf in the environment used for training. You have the option to specify extra dependencies as described above.
56+
57+
Azure ML maintained docker images and their contents can be seen in [AzureML Containers](https://github.com/Azure/AzureML-Containers).
58+
Framework-specific dependencies are listed in the respective framework documentation - [Chainer](https://docs.microsoft.com/python/api/azureml-train-core/azureml.train.dnn.chainer?view=azure-ml-py#remarks), [PyTorch](https://docs.microsoft.com/python/api/azureml-train-core/azureml.train.dnn.pytorch?view=azure-ml-py#remarks), [TensorFlow](https://docs.microsoft.com/python/api/azureml-train-core/azureml.train.dnn.tensorflow?view=azure-ml-py#remarks), [SKLearn](https://docs.microsoft.com/python/api/azureml-train-core/azureml.train.sklearn.sklearn?view=azure-ml-py#remarks).
59+
60+
> [!Note]
61+
> If you think a particular package is common enough to be added in Azure ML maintained images and environments please raise a GitHub issue in [AzureML Containers](https://github.com/Azure/AzureML-Containers).
62+
63+
### NameError (Name not defined), AttributeError (Object has no attribute)
64+
This exception should come from your training scripts. You can look at the log files from Azure portal to get more information about the specific name not defined or attribute error. From the SDK, you can use `run.get_details()` to look at the error message. This will also list all the log files generated for your run. Please make sure to take a look at your training script and fix the error before resubmitting your run.
65+
66+
### Horovod has been shut down
67+
In most cases if you encounter "AbortedError: Horovod has been shut down" this exception means there was an underlying exception in one of the processes that caused Horovod to shut down. Each rank in the MPI job gets it own dedicated log file in Azure ML. These logs are named `70_driver_logs`. In case of distributed training, the log names are suffixed with `_rank` to make it easier to differentiate the logs. To find the exact error that caused Horovod to shut down, go through all the log files and look for `Traceback` at the end of the driver_log files. One of these files will give you the actual underlying exception.
68+
69+
### SR-IOV availability on NCv3 machines in AmlCompute for distributed training
70+
Azure Compute has been rolling out an [SR-IOV upgrade](https://azure.microsoft.com/updates/sriov-availability-on-ncv3-virtual-machines-sku/) of NCv3 machines, which customers can leverage with Azure ML's managed compute offering (AmlCompute). The updates will enable the support of the entire MPI stack and the use of Infiniband RDMA network for improved multi-node distributed training performance, particularly for deep learning.
71+
72+
View the [update schedule](https://azure.microsoft.com/updates/sr-iov-availability-schedule-on-ncv3-virtual-machines-sku/) to see when support will be rolled out for your region.
73+
74+
### Run or experiment deletion
75+
Experiments can be archived by using the [Experiment.archive](https://docs.microsoft.com/python/api/azureml-core/azureml.core.experiment(class)?view=azure-ml-py#archive--)
76+
method, or from the Experiment tab view in Azure Machine Learning studio client via the "Archive experiment" button. This action hides the experiment from list queries and views, but does not delete it.
77+
78+
Permanent deletion of individual experiments or runs is not currently supported. For more information on deleting Workspace assets, see [Export or delete your Machine Learning service workspace data](how-to-export-delete-data.md).
79+
80+
## Azure Machine Learning Compute issues
81+
Known issues with using Azure Machine Learning Compute (AmlCompute).
82+
83+
### Trouble creating AmlCompute
84+
85+
There is a rare chance that some users who created their Azure Machine Learning workspace from the Azure portal before the GA release might not be able to create AmlCompute in that workspace. You can either raise a support request against the service or create a new workspace through the portal or the SDK to unblock yourself immediately.
86+
87+
### Outage: SR-IOV upgrade to NCv3 machines in AmlCompute
2088

2189
Azure Compute will be updating the NCv3 SKUs starting early November 2019 to support all MPI implementations and versions, and RDMA verbs for InfiniBand-equipped virtual machines. This will require a short downtime - [read more about the SR-IOV upgrade](https://azure.microsoft.com/updates/sriov-availability-on-ncv3-virtual-machines-sku).
2290

@@ -40,29 +108,6 @@ Before the fix, you can connect the dataset to any data transformation module (S
40108
Below image shows how:
41109
![visulize-data](./media/resource-known-issues/aml-visualize-data.png)
42110

43-
## SDK installation issues
44-
45-
**Error message: Cannot uninstall 'PyYAML'**
46-
47-
Azure Machine Learning SDK for Python: PyYAML is a distutils installed project. Therefore, we cannot accurately determine which files belong to it if there is a partial uninstall. To continue installing the SDK while ignoring this error, use:
48-
49-
```Python
50-
pip install --upgrade azureml-sdk[notebooks,automl] --ignore-installed PyYAML
51-
```
52-
53-
**Error message: `ERROR: No matching distribution found for azureml-dataprep-native`**
54-
55-
Anaconda's Python 3.7.4 distribution has a bug that breaks azureml-sdk install. This issue is discussed in this [GitHub Issue](https://github.com/ContinuumIO/anaconda-issues/issues/11195)
56-
This can be worked around by creating a new Conda Environment using this command:
57-
```bash
58-
conda create -n <env-name> python=3.7.3
59-
```
60-
Which creates a Conda Environment using Python 3.7.3, which doesn't have the install issue present in 3.7.4.
61-
62-
## Trouble creating Azure Machine Learning Compute
63-
64-
There is a rare chance that some users who created their Azure Machine Learning workspace from the Azure portal before the GA release might not be able to create Azure Machine Learning Compute in that workspace. You can either raise a support request against the service or create a new workspace through the Portal or the SDK to unblock yourself immediately.
65-
66111
## Image building failure
67112

68113
Image building failure when deploying web service. Workaround is to add "pynacl==1.2.1" as a pip dependency to Conda file for image configuration.
@@ -254,38 +299,6 @@ kubectl get secret/azuremlfessl -o yaml
254299
>[!Note]
255300
>Kubernetes stores the secrets in base-64 encoded format. You will need to base-64 decode the `cert.pem` and `key.pem` components of the secrets prior to providing them to `attach_config.enable_ssl`.
256301
257-
## Recommendations for error fix
258-
Based on general observation, here are Azure ML recommendations to fix some of the common errors in Azure ML.
259-
260-
### Metric Document is too large
261-
Azure Machine Learning has internal limits on the size of metric objects that can be logged at once from a training run. If you encounter "Metric Document is too large" error when logging a list-valued metric, try splitting the list into smaller chunks, for example:
262-
263-
```python
264-
run.log_list("my metric name", my_metric[:N])
265-
run.log_list("my metric name", my_metric[N:])
266-
```
267-
268-
Internally, the run history service concatenates the blocks with same metric name into a contiguous list.
269-
270-
### ModuleErrors (No module named)
271-
If you are running into ModuleErrors while submitting experiments in Azure ML, it means that the training script is expecting a package to be installed but it isn't added. Once you provide the package name, Azure ML will install the package in the environment used for your training.
272-
273-
If you are using [Estimators](concept-azure-machine-learning-architecture.md#estimators) to submit experiments, you can specify a package name via `pip_packages` or `conda_packages` parameter in the estimator based on from which source you want to install the package. You can also specify a yml file with all your dependencies using `conda_dependencies_file`or list all your pip requirements in a txt file using `pip_requirements_file` parameter.
274-
275-
Azure ML also provides framework-specific estimators for Tensorflow, PyTorch, Chainer and SKLearn. Using these estimators will make sure that the framework dependencies are installed on your behalf in the environment used for training. You have the option to specify extra dependencies as described above.
276-
277-
Azure ML maintained docker images and their contents can be seen in [AzureML Containers](https://github.com/Azure/AzureML-Containers).
278-
Framework-specific dependencies are listed in the respective framework documentation - [Chainer](https://docs.microsoft.com/python/api/azureml-train-core/azureml.train.dnn.chainer?view=azure-ml-py#remarks), [PyTorch](https://docs.microsoft.com/python/api/azureml-train-core/azureml.train.dnn.pytorch?view=azure-ml-py#remarks), [TensorFlow](https://docs.microsoft.com/python/api/azureml-train-core/azureml.train.dnn.tensorflow?view=azure-ml-py#remarks), [SKLearn](https://docs.microsoft.com/python/api/azureml-train-core/azureml.train.sklearn.sklearn?view=azure-ml-py#remarks).
279-
280-
> [!Note]
281-
> If you think a particular package is common enough to be added in Azure ML maintained images and environments please raise a GitHub issue in [AzureML Containers](https://github.com/Azure/AzureML-Containers).
282-
283-
### NameError (Name not defined), AttributeError (Object has no attribute)
284-
This exception should come from your training scripts. You can look at the log files from Azure portal to get more information about the specific name not defined or attribute error. From the SDK, you can use `run.get_details()` to look at the error message. This will also list all the log files generated for your run. Please make sure to take a look at your training script, fix the error before retrying.
285-
286-
### Horovod is shut down
287-
In most cases, this exception means there was an underlying exception in one of the processes that caused horovod to shut down. Each rank in the MPI job gets it own dedicated log file in Azure ML. These logs are named `70_driver_logs`. In case of distributed training, the log names are suffixed with `_rank` to make it easy to differentiate the logs. To find the exact error that caused horovod shutdown, go through all the log files and look for `Traceback` at the end of the driver_log files. One of these files will give you the actual underlying exception.
288-
289302
## Labeling projects issues
290303

291304
Known issues with labeling projects.
@@ -306,14 +319,7 @@ To load all labeled images, choose the **First** button. The **First** button wi
306319

307320
Delete the label by clicking on the cross mark next to it.
308321

309-
## Run or experiment deletion
310-
311-
Experiments can be archived by using [Experiment.archive](https://docs.microsoft.com/python/api/azureml-core/azureml.core.experiment(class)?view=azure-ml-py#archive--)
312-
method, or from Experiment tab view in Azure Machine Learning studio client. This action hides the experiment from list queries and views, but does not delete it.
313-
314-
Permanent deletion of individual experiments or runs is not currently supported. For more information on deleting Workspace assets, see [Export or delete your Machine Learning service workspace data](how-to-export-delete-data.md).
315-
316322
## Moving the workspace
317323

318324
> [!WARNING]
319-
> Moving your Azure Machine Learning workspace to a different subscription, or moving the owning subscription to a new tenant, is not supported. Doing so may cause errors.
325+
> Moving your Azure Machine Learning workspace to a different subscription, or moving the owning subscription to a new tenant, is not supported. Doing so may cause errors.

0 commit comments

Comments
 (0)