Skip to content

Commit 1edb041

Browse files
authored
Merge pull request #183076 from MicrosoftDocs/repo_sync_working_branch
Confirm merge from repo_sync_working_branch to master to sync with https://github.com/MicrosoftDocs/azure-docs (branch master)
2 parents 2423a3d + 94bdc8c commit 1edb041

File tree

3 files changed

+59
-29
lines changed

3 files changed

+59
-29
lines changed

articles/cognitive-services/Translator/containers/translator-container-configuration.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -15,7 +15,7 @@ recommendations: false
1515

1616
# Configure Translator Docker containers (preview)
1717

18-
Cognitive Services provides each container with a common configuration framework. You can easily configure your Translator containers and you to build Translator application architecture optimized for robust cloud capabilities and edge locality.
18+
Cognitive Services provides each container with a common configuration framework. You can easily configure your Translator containers to build Translator application architecture optimized for robust cloud capabilities and edge locality.
1919

2020
The **Translator** container runtime environment is configured using the `docker run` command arguments. This container has several required settings, along with a few optional settings. The container-specific settings are the billing settings.
2121

articles/machine-learning/how-to-train-distributed-gpu.md

Lines changed: 55 additions & 25 deletions
Original file line numberDiff line numberDiff line change
@@ -258,31 +258,61 @@ For single-node training (including single-node multi-GPU), you can run your cod
258258
- MASTER_PORT
259259
- NODE_RANK
260260
261-
To run multi-node Lightning training on Azure ML, you can largely follow the [per-node-launch guide](#per-node-launch):
262-
263-
- Define the `PyTorchConfiguration` and specify the `node_count`. Don't specify `process_count`, as Lightning internally handles launching the worker processes for each node.
264-
- For PyTorch jobs, Azure ML handles setting the MASTER_ADDR, MASTER_PORT, and NODE_RANK environment variables required by Lightning.
265-
- Lightning will handle computing the world size from the Trainer flags `--gpus` and `--num_nodes` and manage rank and local rank internally.
266-
267-
```python
268-
from azureml.core import ScriptRunConfig, Experiment
269-
from azureml.core.runconfig import PyTorchConfiguration
270-
271-
nnodes = 2
272-
args = ['--max_epochs', 50, '--gpus', 2, '--accelerator', 'ddp', '--num_nodes', nnodes]
273-
distr_config = PyTorchConfiguration(node_count=nnodes)
274-
275-
run_config = ScriptRunConfig(
276-
source_directory='./src',
277-
script='train.py',
278-
arguments=args,
279-
compute_target=compute_target,
280-
environment=pytorch_env,
281-
distributed_job_config=distr_config,
282-
)
283-
284-
run = Experiment(ws, 'experiment_name').submit(run_config)
285-
```
261+
To run multi-node Lightning training on Azure ML, follow the [per-node-launch](#per-node-launch) guidance, but note that currently, the `ddp` strategy works only when you run an experiment using multiple nodes, with one GPU per node.
262+
263+
To run an experiment using multiple nodes with multiple GPUs:
264+
265+
- Define `MpiConfiguration` and specify `node_count`. Don't specify `process_count` because Lightning internally handles launching the worker processes for each node.
266+
- For PyTorch jobs, Azure ML handles setting the MASTER_ADDR, MASTER_PORT, and NODE_RANK environment variables that Lightning requires:
267+
268+
```python
269+
import os
270+
271+
def set_environment_variables_for_nccl_backend(single_node=False, master_port=6105):
272+
if not single_node:
273+
master_node_params = os.environ["AZ_BATCH_MASTER_NODE"].split(":")
274+
os.environ["MASTER_ADDR"] = master_node_params[0]
275+
276+
# Do not overwrite master port with that defined in AZ_BATCH_MASTER_NODE
277+
if "MASTER_PORT" not in os.environ:
278+
os.environ["MASTER_PORT"] = str(master_port)
279+
else:
280+
os.environ["MASTER_ADDR"] = os.environ["AZ_BATCHAI_MPI_MASTER_NODE"]
281+
os.environ["MASTER_PORT"] = "54965"
282+
283+
os.environ["NCCL_SOCKET_IFNAME"] = "^docker0,lo"
284+
try:
285+
os.environ["NODE_RANK"] = os.environ["OMPI_COMM_WORLD_RANK"]
286+
# additional variables
287+
os.environ["MASTER_ADDRESS"] = os.environ["MASTER_ADDR"]
288+
os.environ["LOCAL_RANK"] = os.environ["OMPI_COMM_WORLD_LOCAL_RANK"]
289+
os.environ["WORLD_SIZE"] = os.environ["OMPI_COMM_WORLD_SIZE"]
290+
except:
291+
# fails when used with pytorch configuration instead of mpi
292+
pass
293+
```
294+
295+
- Lightning handles computing the world size from the Trainer flags `--gpus` and `--num_nodes` and manages rank and local rank internally:
296+
297+
```python
298+
from azureml.core import ScriptRunConfig, Experiment
299+
from azureml.core.runconfig import MpiConfiguration
300+
301+
nnodes = 2
302+
args = ['--max_epochs', 50, '--gpus', 2, '--accelerator', 'ddp_spawn', '--num_nodes', nnodes]
303+
distr_config = MpiConfiguration(node_count=nnodes)
304+
305+
run_config = ScriptRunConfig(
306+
source_directory='./src',
307+
script='train.py',
308+
arguments=args,
309+
compute_target=compute_target,
310+
environment=pytorch_env,
311+
distributed_job_config=distr_config,
312+
)
313+
314+
run = Experiment(ws, 'experiment_name').submit(run_config)
315+
```
286316
287317
### Hugging Face Transformers
288318

articles/storage/blobs/storage-blobs-static-site-github-actions.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -87,7 +87,7 @@ In the example above, replace the placeholders with your subscription ID and res
8787
8888
on:
8989
push:
90-
branches: [ master ]
90+
branches: [ main ]
9191
```
9292

9393
1. Rename your workflow `Blob storage website CI` and add the checkout and login actions. These actions will checkout your site code and authenticate with Azure using the `AZURE_CREDENTIALS` GitHub secret you created earlier.
@@ -97,7 +97,7 @@ In the example above, replace the placeholders with your subscription ID and res
9797
9898
on:
9999
push:
100-
branches: [ master ]
100+
branches: [ main ]
101101
102102
jobs:
103103
build:
@@ -131,7 +131,7 @@ In the example above, replace the placeholders with your subscription ID and res
131131
132132
on:
133133
push:
134-
branches: [ master ]
134+
branches: [ main ]
135135
136136
jobs:
137137
build:

0 commit comments

Comments
 (0)