Merge pull request #183076 from MicrosoftDocs/repo_sync_working_branch

huypub · web-flow · commit 1edb0415f280 · 2021-12-16T14:37:29.000-08:00
Confirm merge from repo_sync_working_branch to master to sync with https://github.com/MicrosoftDocs/azure-docs (branch master)
diff --git a/articles/cognitive-services/Translator/containers/translator-container-configuration.md b/articles/cognitive-services/Translator/containers/translator-container-configuration.md
@@ -15,7 +15,7 @@ recommendations: false
 
 # Configure Translator Docker containers (preview)
 
-Cognitive Services provides each container with a common configuration framework.  You can easily configure your Translator containers and you to build Translator application architecture optimized for robust cloud capabilities and edge locality.
+Cognitive Services provides each container with a common configuration framework.  You can easily configure your Translator containers to build Translator application architecture optimized for robust cloud capabilities and edge locality.
 
 The **Translator** container runtime environment is configured using the `docker run` command arguments. This container has several required settings, along with a few optional settings. The container-specific settings are the billing settings.
 
diff --git a/articles/machine-learning/how-to-train-distributed-gpu.md b/articles/machine-learning/how-to-train-distributed-gpu.md
@@ -258,31 +258,61 @@ For single-node training (including single-node multi-GPU), you can run your cod
 - MASTER_PORT
 - NODE_RANK
 
-To run multi-node Lightning training on Azure ML, you can largely follow the [per-node-launch guide](#per-node-launch):
-
-- Define the `PyTorchConfiguration` and specify the `node_count`. Don't specify `process_count`, as Lightning internally handles launching the worker processes for each node.
-- For PyTorch jobs, Azure ML handles setting the MASTER_ADDR, MASTER_PORT, and NODE_RANK environment variables required by Lightning.
-- Lightning will handle computing the world size from the Trainer flags `--gpus` and `--num_nodes` and manage rank and local rank internally.
-
-```python
-from azureml.core import ScriptRunConfig, Experiment
-from azureml.core.runconfig import PyTorchConfiguration
-
-nnodes = 2
-args = ['--max_epochs', 50, '--gpus', 2, '--accelerator', 'ddp', '--num_nodes', nnodes]
-distr_config = PyTorchConfiguration(node_count=nnodes)
-
-run_config = ScriptRunConfig(
-  source_directory='./src',
-  script='train.py',
-  arguments=args,
-  compute_target=compute_target,
-  environment=pytorch_env,
-  distributed_job_config=distr_config,
-)
-
-run = Experiment(ws, 'experiment_name').submit(run_config)
-```
+To run multi-node Lightning training on Azure ML, follow the [per-node-launch](#per-node-launch) guidance, but note that currently, the `ddp` strategy works only when you run an experiment using multiple nodes, with one GPU per node.
+
+To run an experiment using multiple nodes with multiple GPUs:
+
+- Define `MpiConfiguration` and specify `node_count`. Don't specify `process_count` because Lightning internally handles launching the worker processes for each node.
+- For PyTorch jobs, Azure ML handles setting the MASTER_ADDR, MASTER_PORT, and NODE_RANK environment variables that Lightning requires:
+
+   ```python
+   import os
+
+   def set_environment_variables_for_nccl_backend(single_node=False, master_port=6105):
+       if not single_node:
+           master_node_params = os.environ["AZ_BATCH_MASTER_NODE"].split(":")
+           os.environ["MASTER_ADDR"] = master_node_params[0]
+
+           # Do not overwrite master port with that defined in AZ_BATCH_MASTER_NODE
+           if "MASTER_PORT" not in os.environ:
+               os.environ["MASTER_PORT"] = str(master_port)
+       else:
+           os.environ["MASTER_ADDR"] = os.environ["AZ_BATCHAI_MPI_MASTER_NODE"]
+           os.environ["MASTER_PORT"] = "54965"
+
+       os.environ["NCCL_SOCKET_IFNAME"] = "^docker0,lo"
+       try:
+           os.environ["NODE_RANK"] = os.environ["OMPI_COMM_WORLD_RANK"]
+           # additional variables
+           os.environ["MASTER_ADDRESS"] = os.environ["MASTER_ADDR"]
+           os.environ["LOCAL_RANK"] = os.environ["OMPI_COMM_WORLD_LOCAL_RANK"]
+           os.environ["WORLD_SIZE"] = os.environ["OMPI_COMM_WORLD_SIZE"]
+       except:
+           # fails when used with pytorch configuration instead of mpi
+           pass
+   ```
+
+- Lightning handles computing the world size from the Trainer flags `--gpus` and `--num_nodes` and manages rank and local rank internally:
+
+   ```python
+   from azureml.core import ScriptRunConfig, Experiment
+   from azureml.core.runconfig import MpiConfiguration
+
+   nnodes = 2
+   args = ['--max_epochs', 50, '--gpus', 2, '--accelerator', 'ddp_spawn', '--num_nodes', nnodes]
+   distr_config = MpiConfiguration(node_count=nnodes)
+
+   run_config = ScriptRunConfig(
+     source_directory='./src',
+     script='train.py',
+     arguments=args,
+     compute_target=compute_target,
+     environment=pytorch_env,
+     distributed_job_config=distr_config,
+   )
+
+   run = Experiment(ws, 'experiment_name').submit(run_config)
+   ```
 
 ### Hugging Face Transformers
 
diff --git a/articles/storage/blobs/storage-blobs-static-site-github-actions.md b/articles/storage/blobs/storage-blobs-static-site-github-actions.md
@@ -87,7 +87,7 @@ In the example above, replace the placeholders with your subscription ID and res
 
     on:
         push:
-            branches: [ master ]
+            branches: [ main ]
     ```
 
 1. Rename your workflow `Blob storage website CI` and add the checkout and login actions. These actions will checkout your site code and authenticate with Azure using the `AZURE_CREDENTIALS` GitHub secret you created earlier.
@@ -97,7 +97,7 @@ In the example above, replace the placeholders with your subscription ID and res
 
     on:
         push:
-            branches: [ master ]
+            branches: [ main ]
 
     jobs:
       build:
@@ -131,7 +131,7 @@ In the example above, replace the placeholders with your subscription ID and res
 
     on:
         push:
-            branches: [ master ]
+            branches: [ main ]
 
     jobs:
       build: