Skip to content

Commit 8a8a18b

Browse files
authored
Fix DDP documentation and script bugs from conda-to-venv migration (#955)
- Remove --use-mlflow from TORCHRUN_ARGS in container sbatch (crashes torchrun) - Fix undefined ${ENROOT_IMAGE} variable in enroot image script - Fix Kubernetes template: rename fsdp→ddp, fix torchrun path, fix positional args - Update READMEs: replace stale conda/fsdp references with venv/ddp - Fix MLflow default URI documentation to match actual code default - Fix script filenames in READMEs to match actual files on disk
1 parent e12905d commit 8a8a18b

File tree

6 files changed

+30
-31
lines changed

6 files changed

+30
-31
lines changed

3.test_cases/pytorch/ddp/README.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@
22

33
Isolated environments are crucial for reproducible machine learning because they encapsulate specific software versions and dependencies, ensuring models are consistently retrainable, shareable, and deployable without compatibility issues.
44

5-
[Anaconda](https://www.anaconda.com/) leverages conda environments to create distinct spaces for projects, allowing different Python versions and libraries to coexist without conflicts by isolating updates to their respective environments. [Docker](https://www.docker.com/), a containerization platform, packages applications and their dependencies into containers, ensuring they run seamlessly across any Linux server by providing OS-level virtualization and encapsulating the entire runtime environment.
5+
Python [venv](https://docs.python.org/3/library/venv.html) creates lightweight virtual environments to isolate project dependencies, ensuring reproducibility without conflicts between different projects. [Docker](https://www.docker.com/), a containerization platform, packages applications and their dependencies into containers, ensuring they run seamlessly across any Linux server by providing OS-level virtualization and encapsulating the entire runtime environment.
66

77
This example showcases [PyTorch DDP](https://pytorch.org/tutorials/beginner/ddp_series_theory.html) environment setup utilizing these approaches for efficient environment management. The implementation supports both CPU and GPU computation:
88

@@ -42,7 +42,7 @@ To enable MLFlow logging, add the `--use_mlflow` flag when running the training
4242
torchrun --nproc_per_node=N ddp.py --total_epochs=10 --save_every=1 --batch_size=32 --use_mlflow
4343
```
4444

45-
By default, MLFlow will connect to `http://localhost:5000`. To use a different tracking server, specify the `--tracking_uri`:
45+
By default, MLFlow will log to `file://$HOME/mlruns`. To use a different tracking server, specify the `--tracking_uri`:
4646
```bash
4747
torchrun --nproc_per_node=N ddp.py --total_epochs=10 --save_every=1 --batch_size=32 --use_mlflow --tracking_uri=http://localhost:5000
4848
```
@@ -68,4 +68,4 @@ The MLFlow UI provides:
6868

6969
## Deployment
7070

71-
We provide guides for both Slurm and Kubernetes. However, please note that the Conda example is only compatible with Slurm. For detailed instructions, proceed to the [slurm](slurm) or [kubernetes](kubernetes) subdirectory.
71+
We provide guides for both Slurm and Kubernetes. However, please note that the venv example is only compatible with Slurm. For detailed instructions, proceed to the [slurm](slurm) or [kubernetes](kubernetes) subdirectory.

3.test_cases/pytorch/ddp/kubernetes/README.md

Lines changed: 13 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -24,34 +24,34 @@ Build the container image:
2424
export AWS_REGION=$(aws ec2 describe-availability-zones --output text --query 'AvailabilityZones[0].[RegionName]')
2525
export ACCOUNT=$(aws sts get-caller-identity --query Account --output text)
2626
export REGISTRY=${ACCOUNT}.dkr.ecr.${AWS_REGION}.amazonaws.com/
27-
docker build -t ${REGISTRY}fsdp:pytorch2.2-cpu ..
27+
docker build -t ${REGISTRY}ddp:latest ..
2828
```
2929

3030
Push the container image to the Elastic Container Registry in your account:
3131
```bash
3232
# Create registry if needed
33-
REGISTRY_COUNT=$(aws ecr describe-repositories | grep \"fsdp\" | wc -l)
33+
REGISTRY_COUNT=$(aws ecr describe-repositories | grep \"ddp\" | wc -l)
3434
if [ "$REGISTRY_COUNT" == "0" ]; then
35-
aws ecr create-repository --repository-name fsdp
35+
aws ecr create-repository --repository-name ddp
3636
fi
3737

3838
# Login to registry
3939
echo "Logging in to $REGISTRY ..."
4040
aws ecr get-login-password | docker login --username AWS --password-stdin $REGISTRY
4141

4242
# Push image to registry
43-
docker image push ${REGISTRY}fsdp:pytorch2.2-cpu
43+
docker image push ${REGISTRY}ddp:latest
4444
```
4545

4646
Create manifest and launch PyTorchJob:
4747
```bash
48-
export IMAGE_URI=${REGISTRY}fsdp:pytorch2.2-cpu
48+
export IMAGE_URI=${REGISTRY}ddp:latest
4949
export INSTANCE_TYPE=
5050
export NUM_NODES=2
5151
export CPU_PER_NODE=4
52-
cat fsdp.yaml-template | envsubst > fsdp.yaml
52+
cat ddp-custom-container.yaml-template | envsubst > ddp.yaml
5353

54-
kubectl apply -f ./fsdp.yaml
54+
kubectl apply -f ./ddp.yaml
5555
```
5656

5757
Check the status of your training job:
@@ -62,17 +62,17 @@ kubectl get pods
6262

6363
```text
6464
NAME STATE AGE
65-
fsdp Running 16s
65+
ddp Running 16s
6666
6767
NAME READY STATUS RESTARTS AGE
6868
etcd-7787559c74-w9gwx 1/1 Running 0 18s
69-
fsdp-worker-0 1/1 Running 0 18s
70-
fsdp-worker-1 1/1 Running 0 18s
69+
ddp-worker-0 1/1 Running 0 18s
70+
ddp-worker-1 1/1 Running 0 18s
7171
```
7272

7373
Each of the pods produces job logs.
7474
```bash
75-
kubectl logs fsdp-worker-0
75+
kubectl logs ddp-worker-0
7676
```
7777

7878
```text
@@ -102,7 +102,7 @@ Epoch 4990 | Training snapshot saved at /fsx/snapshot.pt
102102

103103
Stop the training job:
104104
```bash
105-
kubectl delete -f ./fsdp.yaml
105+
kubectl delete -f ./ddp.yaml
106106
```
107107

108-
Note: Prior to running a new job, please stop any currently running or completed fsdp job.
108+
Note: Prior to running a new job, please stop any currently running or completed ddp job.

3.test_cases/pytorch/ddp/kubernetes/ddp-custom-container.yaml-template

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -54,7 +54,7 @@ spec:
5454
apiVersion: "kubeflow.org/v1"
5555
kind: PyTorchJob
5656
metadata:
57-
name: fsdp
57+
name: ddp
5858
spec:
5959
elasticPolicy:
6060
rdzvBackend: etcd
@@ -77,7 +77,7 @@ spec:
7777
template:
7878
metadata:
7979
labels:
80-
app: fsdp
80+
app: ddp
8181
spec:
8282
volumes:
8383
- name: shmem
@@ -96,12 +96,12 @@ spec:
9696
image: ${IMAGE_URI}
9797
imagePullPolicy: Always
9898
command:
99-
- /opt/conda/bin/torchrun
99+
- torchrun
100100
- --nproc_per_node=$CPU_PER_NODE
101101
- --nnodes=$NUM_NODES
102102
- /workspace/ddp.py
103-
- "5000"
104-
- "10"
103+
- --total_epochs=5000
104+
- --save_every=10
105105
- --batch_size=32
106106
- --checkpoint_path=/fsx/snapshot.pt
107107
env:

3.test_cases/pytorch/ddp/slurm/2.create-enroot-image.sh

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,7 @@ set -ex
55
# SPDX-License-Identifier: MIT-0
66

77
# Remove old sqsh file if exists
8-
if [ -f ${ENROOT_IMAGE}.sqsh ] ; then
8+
if [ -f pytorch.sqsh ] ; then
99
rm pytorch.sqsh
1010
fi
1111

3.test_cases/pytorch/ddp/slurm/3.container-train.sbatch

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -23,7 +23,6 @@ declare -a TORCHRUN_ARGS=(
2323
--rdzv_id=$SLURM_JOB_ID
2424
--rdzv_backend=c10d
2525
--rdzv_endpoint=$(hostname)
26-
--use-mlflow
2726
)
2827

2928

3.test_cases/pytorch/ddp/slurm/README.md

Lines changed: 8 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -9,29 +9,29 @@ The guide assumes that you have the following:
99
We recommend that you setup a Slurm cluster using the templates in the architectures [directory](../../1.architectures).
1010

1111

12-
## 2. Submit training job using conda environment on Slurm
12+
## 2. Submit training job using virtual environment on Slurm
1313

14-
In this step, you will create PyTorch virtual environment using conda.
14+
In this step, you will create PyTorch virtual environment using Python venv.
1515
This method is only available on Slurm because it runs the training job without
1616
using a container.
1717

1818
```bash
19-
bash 0.create-conda-env.sh
19+
bash 0.create-venv.sh
2020
```
2121

22-
It will prepare `miniconda3` and `pt` `pt` includes `torchrun`
22+
It will create a Python virtual environment named `pt` that includes `torchrun`
2323

2424

2525
Submit DDP training job with:
2626

2727
```bash
28-
sbatch 1.conda-train.sbatch
28+
sbatch 1.venv-train.sbatch
2929
```
3030

3131
Output of the training job can be found in `logs` directory:
3232

3333
```bash
34-
# cat logs/cpu-ddp-conda_xxx.out
34+
# cat logs/ddp-venv_xxx.out
3535
Node IP: 10.1.96.108
3636
[2024-03-12 08:22:45,549] torch.distributed.run: [WARNING] master_addr is only used for static rdzv_backend and when rdzv_endpoint is not specified.
3737
[2024-03-12 08:22:45,549] torch.distributed.run: [WARNING]
@@ -90,13 +90,13 @@ It will pull `pytorch/pytorch` container, then create [squashfs](https://www.ker
9090
Submit DDP training job using the image with:
9191
9292
```bash
93-
sbatch 4.container-train.sbatch
93+
sbatch 3.container-train.sbatch
9494
```
9595
9696
Output of the training job can be found in `logs` directory:
9797
9898
```bash
99-
# cat logs/cpu-ddp-container.out
99+
# cat logs/ddp-container_xxx.out
100100
Node IP: 10.1.96.108
101101
[2024-03-12 08:22:45,549] torch.distributed.run: [WARNING] master_addr is only used for static rdzv_backend and when rdzv_endpoint is not specified.
102102
[2024-03-12 08:22:45,549] torch.distributed.run: [WARNING]

0 commit comments

Comments
 (0)