You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: articles/machine-learning/resource-known-issues.md
+70-64Lines changed: 70 additions & 64 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -16,7 +16,75 @@ ms.date: 11/04/2019
16
16
17
17
This article helps you find and correct errors or failures encountered when using Azure Machine Learning.
18
18
19
-
## Outage: SR-IOV upgrade to NCv3 machines in AmlCompute
19
+
## SDK installation issues
20
+
21
+
**Error message: Cannot uninstall 'PyYAML'**
22
+
23
+
Azure Machine Learning SDK for Python: PyYAML is a distutils installed project. Therefore, we cannot accurately determine which files belong to it if there is a partial uninstall. To continue installing the SDK while ignoring this error, use:
**Error message: `ERROR: No matching distribution found for azureml-dataprep-native`**
30
+
31
+
Anaconda's Python 3.7.4 distribution has a bug that breaks azureml-sdk install. This issue is discussed in this [GitHub Issue](https://github.com/ContinuumIO/anaconda-issues/issues/11195)
32
+
This can be worked around by creating a new Conda Environment using this command:
33
+
```bash
34
+
conda create -n <env-name> python=3.7.3
35
+
```
36
+
Which creates a Conda Environment using Python 3.7.3, which doesn't have the install issue present in 3.7.4.
37
+
38
+
## Training and experimentation issues
39
+
40
+
### Metric Document is too large
41
+
Azure Machine Learning has internal limits on the size of metric objects that can be logged at once from a training run. If you encounter a "Metric Document is too large" error when logging a list-valued metric, try splitting the list into smaller chunks, for example:
42
+
43
+
```python
44
+
run.log_list("my metric name", my_metric[:N])
45
+
run.log_list("my metric name", my_metric[N:])
46
+
```
47
+
48
+
Internally, Azure ML concatenates the blocks with the same metric name into a contiguous list.
49
+
50
+
### ModuleErrors (No module named)
51
+
If you are running into ModuleErrors while submitting experiments in Azure ML, it means that the training script is expecting a package to be installed but it isn't added. Once you provide the package name, Azure ML will install the package in the environment used for your training run.
52
+
53
+
If you are using [Estimators](concept-azure-machine-learning-architecture.md#estimators) to submit experiments, you can specify a package name via `pip_packages` or `conda_packages` parameter in the estimator based on from which source you want to install the package. You can also specify a yml file with all your dependencies using `conda_dependencies_file`or list all your pip requirements in a txt file using `pip_requirements_file` parameter. If you have your own Azure ML Environment object that you want to override the default image used by the estimator, you can specify that environment via the `environment` parameter of the estimator constructor.
54
+
55
+
Azure ML also provides framework-specific estimators for Tensorflow, PyTorch, Chainer and SKLearn. Using these estimators will make sure that the core framework dependencies are installed on your behalf in the environment used for training. You have the option to specify extra dependencies as described above.
56
+
57
+
Azure ML maintained docker images and their contents can be seen in [AzureML Containers](https://github.com/Azure/AzureML-Containers).
58
+
Framework-specific dependencies are listed in the respective framework documentation - [Chainer](https://docs.microsoft.com/python/api/azureml-train-core/azureml.train.dnn.chainer?view=azure-ml-py#remarks), [PyTorch](https://docs.microsoft.com/python/api/azureml-train-core/azureml.train.dnn.pytorch?view=azure-ml-py#remarks), [TensorFlow](https://docs.microsoft.com/python/api/azureml-train-core/azureml.train.dnn.tensorflow?view=azure-ml-py#remarks), [SKLearn](https://docs.microsoft.com/python/api/azureml-train-core/azureml.train.sklearn.sklearn?view=azure-ml-py#remarks).
59
+
60
+
> [!Note]
61
+
> If you think a particular package is common enough to be added in Azure ML maintained images and environments please raise a GitHub issue in [AzureML Containers](https://github.com/Azure/AzureML-Containers).
62
+
63
+
### NameError (Name not defined), AttributeError (Object has no attribute)
64
+
This exception should come from your training scripts. You can look at the log files from Azure portal to get more information about the specific name not defined or attribute error. From the SDK, you can use `run.get_details()` to look at the error message. This will also list all the log files generated for your run. Please make sure to take a look at your training script and fix the error before resubmitting your run.
65
+
66
+
### Horovod has been shut down
67
+
In most cases if you encounter "AbortedError: Horovod has been shut down" this exception means there was an underlying exception in one of the processes that caused Horovod to shut down. Each rank in the MPI job gets it own dedicated log file in Azure ML. These logs are named `70_driver_logs`. In case of distributed training, the log names are suffixed with `_rank` to make it easier to differentiate the logs. To find the exact error that caused Horovod to shut down, go through all the log files and look for `Traceback` at the end of the driver_log files. One of these files will give you the actual underlying exception.
68
+
69
+
### SR-IOV availability on NCv3 machines in AmlCompute for distributed training
70
+
Azure Compute has been rolling out an [SR-IOV upgrade](https://azure.microsoft.com/updates/sriov-availability-on-ncv3-virtual-machines-sku/) of NCv3 machines, which customers can leverage with Azure ML's managed compute offering (AmlCompute). The updates will enable the support of the entire MPI stack and the use of Infiniband RDMA network for improved multi-node distributed training performance, particularly for deep learning.
71
+
72
+
View the [update schedule](https://azure.microsoft.com/updates/sr-iov-availability-schedule-on-ncv3-virtual-machines-sku/) to see when support will be rolled out for your region.
73
+
74
+
### Run or experiment deletion
75
+
Experiments can be archived by using the [Experiment.archive](https://docs.microsoft.com/python/api/azureml-core/azureml.core.experiment(class)?view=azure-ml-py#archive--)
76
+
method, or from the Experiment tab view in Azure Machine Learning studio client via the "Archive experiment" button. This action hides the experiment from list queries and views, but does not delete it.
77
+
78
+
Permanent deletion of individual experiments or runs is not currently supported. For more information on deleting Workspace assets, see [Export or delete your Machine Learning service workspace data](how-to-export-delete-data.md).
79
+
80
+
## Azure Machine Learning Compute issues
81
+
Known issues with using Azure Machine Learning Compute (AmlCompute).
82
+
83
+
### Trouble creating AmlCompute
84
+
85
+
There is a rare chance that some users who created their Azure Machine Learning workspace from the Azure portal before the GA release might not be able to create AmlCompute in that workspace. You can either raise a support request against the service or create a new workspace through the portal or the SDK to unblock yourself immediately.
86
+
87
+
### Outage: SR-IOV upgrade to NCv3 machines in AmlCompute
20
88
21
89
Azure Compute will be updating the NCv3 SKUs starting early November 2019 to support all MPI implementations and versions, and RDMA verbs for InfiniBand-equipped virtual machines. This will require a short downtime - [read more about the SR-IOV upgrade](https://azure.microsoft.com/updates/sriov-availability-on-ncv3-virtual-machines-sku).
22
90
@@ -40,29 +108,6 @@ Before the fix, you can connect the dataset to any data transformation module (S
Azure Machine Learning SDK for Python: PyYAML is a distutils installed project. Therefore, we cannot accurately determine which files belong to it if there is a partial uninstall. To continue installing the SDK while ignoring this error, use:
**Error message: `ERROR: No matching distribution found for azureml-dataprep-native`**
54
-
55
-
Anaconda's Python 3.7.4 distribution has a bug that breaks azureml-sdk install. This issue is discussed in this [GitHub Issue](https://github.com/ContinuumIO/anaconda-issues/issues/11195)
56
-
This can be worked around by creating a new Conda Environment using this command:
57
-
```bash
58
-
conda create -n <env-name> python=3.7.3
59
-
```
60
-
Which creates a Conda Environment using Python 3.7.3, which doesn't have the install issue present in 3.7.4.
There is a rare chance that some users who created their Azure Machine Learning workspace from the Azure portal before the GA release might not be able to create Azure Machine Learning Compute in that workspace. You can either raise a support request against the service or create a new workspace through the Portal or the SDK to unblock yourself immediately.
65
-
66
111
## Image building failure
67
112
68
113
Image building failure when deploying web service. Workaround is to add "pynacl==1.2.1" as a pip dependency to Conda file for image configuration.
@@ -254,38 +299,6 @@ kubectl get secret/azuremlfessl -o yaml
254
299
>[!Note]
255
300
>Kubernetes stores the secrets in base-64 encoded format. You will need to base-64 decode the `cert.pem` and `key.pem` components of the secrets prior to providing them to `attach_config.enable_ssl`.
256
301
257
-
## Recommendations for error fix
258
-
Based on general observation, here are Azure ML recommendations to fix some of the common errors in Azure ML.
259
-
260
-
### Metric Document is too large
261
-
Azure Machine Learning has internal limits on the size of metric objects that can be logged at once from a training run. If you encounter "Metric Document is too large" error when logging a list-valued metric, try splitting the list into smaller chunks, for example:
262
-
263
-
```python
264
-
run.log_list("my metric name", my_metric[:N])
265
-
run.log_list("my metric name", my_metric[N:])
266
-
```
267
-
268
-
Internally, the run history service concatenates the blocks with same metric name into a contiguous list.
269
-
270
-
### ModuleErrors (No module named)
271
-
If you are running into ModuleErrors while submitting experiments in Azure ML, it means that the training script is expecting a package to be installed but it isn't added. Once you provide the package name, Azure ML will install the package in the environment used for your training.
272
-
273
-
If you are using [Estimators](concept-azure-machine-learning-architecture.md#estimators) to submit experiments, you can specify a package name via `pip_packages` or `conda_packages` parameter in the estimator based on from which source you want to install the package. You can also specify a yml file with all your dependencies using `conda_dependencies_file`or list all your pip requirements in a txt file using `pip_requirements_file` parameter.
274
-
275
-
Azure ML also provides framework-specific estimators for Tensorflow, PyTorch, Chainer and SKLearn. Using these estimators will make sure that the framework dependencies are installed on your behalf in the environment used for training. You have the option to specify extra dependencies as described above.
276
-
277
-
Azure ML maintained docker images and their contents can be seen in [AzureML Containers](https://github.com/Azure/AzureML-Containers).
278
-
Framework-specific dependencies are listed in the respective framework documentation - [Chainer](https://docs.microsoft.com/python/api/azureml-train-core/azureml.train.dnn.chainer?view=azure-ml-py#remarks), [PyTorch](https://docs.microsoft.com/python/api/azureml-train-core/azureml.train.dnn.pytorch?view=azure-ml-py#remarks), [TensorFlow](https://docs.microsoft.com/python/api/azureml-train-core/azureml.train.dnn.tensorflow?view=azure-ml-py#remarks), [SKLearn](https://docs.microsoft.com/python/api/azureml-train-core/azureml.train.sklearn.sklearn?view=azure-ml-py#remarks).
279
-
280
-
> [!Note]
281
-
> If you think a particular package is common enough to be added in Azure ML maintained images and environments please raise a GitHub issue in [AzureML Containers](https://github.com/Azure/AzureML-Containers).
282
-
283
-
### NameError (Name not defined), AttributeError (Object has no attribute)
284
-
This exception should come from your training scripts. You can look at the log files from Azure portal to get more information about the specific name not defined or attribute error. From the SDK, you can use `run.get_details()` to look at the error message. This will also list all the log files generated for your run. Please make sure to take a look at your training script, fix the error before retrying.
285
-
286
-
### Horovod is shut down
287
-
In most cases, this exception means there was an underlying exception in one of the processes that caused horovod to shut down. Each rank in the MPI job gets it own dedicated log file in Azure ML. These logs are named `70_driver_logs`. In case of distributed training, the log names are suffixed with `_rank` to make it easy to differentiate the logs. To find the exact error that caused horovod shutdown, go through all the log files and look for `Traceback` at the end of the driver_log files. One of these files will give you the actual underlying exception.
288
-
289
302
## Labeling projects issues
290
303
291
304
Known issues with labeling projects.
@@ -306,14 +319,7 @@ To load all labeled images, choose the **First** button. The **First** button wi
306
319
307
320
Delete the label by clicking on the cross mark next to it.
308
321
309
-
## Run or experiment deletion
310
-
311
-
Experiments can be archived by using [Experiment.archive](https://docs.microsoft.com/python/api/azureml-core/azureml.core.experiment(class)?view=azure-ml-py#archive--)
312
-
method, or from Experiment tab view in Azure Machine Learning studio client. This action hides the experiment from list queries and views, but does not delete it.
313
-
314
-
Permanent deletion of individual experiments or runs is not currently supported. For more information on deleting Workspace assets, see [Export or delete your Machine Learning service workspace data](how-to-export-delete-data.md).
315
-
316
322
## Moving the workspace
317
323
318
324
> [!WARNING]
319
-
> Moving your Azure Machine Learning workspace to a different subscription, or moving the owning subscription to a new tenant, is not supported. Doing so may cause errors.
325
+
> Moving your Azure Machine Learning workspace to a different subscription, or moving the owning subscription to a new tenant, is not supported. Doing so may cause errors.
0 commit comments