Skip to content

Commit 32f2dec

Browse files
authored
Update resource-known-issues.md
1 parent 2bc0b58 commit 32f2dec

File tree

1 file changed

+13
-5
lines changed

1 file changed

+13
-5
lines changed

articles/machine-learning/resource-known-issues.md

Lines changed: 13 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -66,14 +66,26 @@ This exception should come from your training scripts. You can look at the log f
6666
### Horovod has been shut down
6767
In most cases if you encounter "AbortedError: Horovod has been shut down" this exception means there was an underlying exception in one of the processes that caused Horovod to shut down. Each rank in the MPI job gets it own dedicated log file in Azure ML. These logs are named `70_driver_logs`. In case of distributed training, the log names are suffixed with `_rank` to make it easier to differentiate the logs. To find the exact error that caused Horovod to shut down, go through all the log files and look for `Traceback` at the end of the driver_log files. One of these files will give you the actual underlying exception.
6868

69+
### SR-IOV availability on NCv3 machines in AmlCompute for distributed training
70+
Azure Compute has been rolling out an [SR-IOV upgrade](https://azure.microsoft.com/en-us/updates/sriov-availability-on-ncv3-virtual-machines-sku/) of NCv3 machines, which customers can leverage with Azure ML's managed compute offering (AmlCompute). The updates will enable the support of the entire MPI stack and the use of Infiniband RDMA network for improved multi-node distributed training performance, particularly for deep learning.
71+
72+
View the [update schedule](https://azure.microsoft.com/en-us/updates/sr-iov-availability-schedule-on-ncv3-virtual-machines-sku/) to see when support will be rolled out for your region.
73+
6974
### Run or experiment deletion
7075

7176
Experiments can be archived by using the [Experiment.archive](https://docs.microsoft.com/python/api/azureml-core/azureml.core.experiment(class)?view=azure-ml-py#archive--)
7277
method, or from the Experiment tab view in Azure Machine Learning studio client via the "Archive experiment" button. This action hides the experiment from list queries and views, but does not delete it.
7378

7479
Permanent deletion of individual experiments or runs is not currently supported. For more information on deleting Workspace assets, see [Export or delete your Machine Learning service workspace data](how-to-export-delete-data.md).
7580

76-
## Outage: SR-IOV upgrade to NCv3 machines in AmlCompute
81+
## Azure Machine Learning Compute issues
82+
Known issues with using Azure Machine Learning Compute (AmlCompute).
83+
84+
### Trouble creating AmlCompute
85+
86+
There is a rare chance that some users who created their Azure Machine Learning workspace from the Azure portal before the GA release might not be able to create AmlCompute in that workspace. You can either raise a support request against the service or create a new workspace through the portal or the SDK to unblock yourself immediately.
87+
88+
### Outage: SR-IOV upgrade to NCv3 machines in AmlCompute
7789

7890
Azure Compute will be updating the NCv3 SKUs starting early November 2019 to support all MPI implementations and versions, and RDMA verbs for InfiniBand-equipped virtual machines. This will require a short downtime - [read more about the SR-IOV upgrade](https://azure.microsoft.com/updates/sriov-availability-on-ncv3-virtual-machines-sku).
7991

@@ -97,10 +109,6 @@ Before the fix, you can connect the dataset to any data transformation module (S
97109
Below image shows how:
98110
![visulize-data](./media/resource-known-issues/aml-visualize-data.png)
99111

100-
## Trouble creating Azure Machine Learning Compute
101-
102-
There is a rare chance that some users who created their Azure Machine Learning workspace from the Azure portal before the GA release might not be able to create Azure Machine Learning Compute in that workspace. You can either raise a support request against the service or create a new workspace through the Portal or the SDK to unblock yourself immediately.
103-
104112
## Image building failure
105113

106114
Image building failure when deploying web service. Workaround is to add "pynacl==1.2.1" as a pip dependency to Conda file for image configuration.

0 commit comments

Comments
 (0)