Fix DDP documentation and script bugs from conda-to-venv migration (#955)

KeitaW · web-flow · commit 8a8a18b166f6 · 2026-02-13T14:44:14.000Z
- Remove --use-mlflow from TORCHRUN_ARGS in container sbatch (crashes torchrun)
- Fix undefined ${ENROOT_IMAGE} variable in enroot image script
- Fix Kubernetes template: rename fsdp→ddp, fix torchrun path, fix positional args
- Update READMEs: replace stale conda/fsdp references with venv/ddp
- Fix MLflow default URI documentation to match actual code default
- Fix script filenames in READMEs to match actual files on disk
diff --git a/3.test_cases/pytorch/ddp/README.md b/3.test_cases/pytorch/ddp/README.md
@@ -2,7 +2,7 @@
 
 Isolated environments are crucial for reproducible machine learning because they encapsulate specific software versions and dependencies, ensuring models are consistently retrainable, shareable, and deployable without compatibility issues.
 
-[Anaconda](https://www.anaconda.com/) leverages conda environments to create distinct spaces for projects, allowing different Python versions and libraries to coexist without conflicts by isolating updates to their respective environments. [Docker](https://www.docker.com/), a containerization platform, packages applications and their dependencies into containers, ensuring they run seamlessly across any Linux server by providing OS-level virtualization and encapsulating the entire runtime environment.
+Python [venv](https://docs.python.org/3/library/venv.html) creates lightweight virtual environments to isolate project dependencies, ensuring reproducibility without conflicts between different projects. [Docker](https://www.docker.com/), a containerization platform, packages applications and their dependencies into containers, ensuring they run seamlessly across any Linux server by providing OS-level virtualization and encapsulating the entire runtime environment.
 
 This example showcases [PyTorch DDP](https://pytorch.org/tutorials/beginner/ddp_series_theory.html) environment setup utilizing these approaches for efficient environment management. The implementation supports both CPU and GPU computation:
 
@@ -42,7 +42,7 @@ To enable MLFlow logging, add the `--use_mlflow` flag when running the training
 torchrun --nproc_per_node=N ddp.py --total_epochs=10 --save_every=1 --batch_size=32 --use_mlflow
 ```
 
-By default, MLFlow will connect to `http://localhost:5000`. To use a different tracking server, specify the `--tracking_uri`:
+By default, MLFlow will log to `file://$HOME/mlruns`. To use a different tracking server, specify the `--tracking_uri`:
 ```bash
 torchrun --nproc_per_node=N ddp.py --total_epochs=10 --save_every=1 --batch_size=32 --use_mlflow --tracking_uri=http://localhost:5000
 ```
@@ -68,4 +68,4 @@ The MLFlow UI provides:
 
 ## Deployment
 
-We provide guides for both Slurm and Kubernetes. However, please note that the Conda example is only compatible with Slurm. For detailed instructions, proceed to the [slurm](slurm) or [kubernetes](kubernetes) subdirectory.
+We provide guides for both Slurm and Kubernetes. However, please note that the venv example is only compatible with Slurm. For detailed instructions, proceed to the [slurm](slurm) or [kubernetes](kubernetes) subdirectory.
diff --git a/3.test_cases/pytorch/ddp/kubernetes/README.md b/3.test_cases/pytorch/ddp/kubernetes/README.md
@@ -24,34 +24,34 @@ Build the container image:
 export AWS_REGION=$(aws ec2 describe-availability-zones --output text --query 'AvailabilityZones[0].[RegionName]')
 export ACCOUNT=$(aws sts get-caller-identity --query Account --output text)
 export REGISTRY=${ACCOUNT}.dkr.ecr.${AWS_REGION}.amazonaws.com/
-docker build -t ${REGISTRY}fsdp:pytorch2.2-cpu ..
+docker build -t ${REGISTRY}ddp:latest ..
 ```
 
 Push the container image to the Elastic Container Registry in your account:
 ```bash
 # Create registry if needed
-REGISTRY_COUNT=$(aws ecr describe-repositories | grep \"fsdp\" | wc -l)
+REGISTRY_COUNT=$(aws ecr describe-repositories | grep \"ddp\" | wc -l)
 if [ "$REGISTRY_COUNT" == "0" ]; then
-        aws ecr create-repository --repository-name fsdp
+        aws ecr create-repository --repository-name ddp
 fi
 
 # Login to registry
 echo "Logging in to $REGISTRY ..."
 aws ecr get-login-password | docker login --username AWS --password-stdin $REGISTRY
 
 # Push image to registry
-docker image push ${REGISTRY}fsdp:pytorch2.2-cpu
+docker image push ${REGISTRY}ddp:latest
 ```
 
 Create manifest and launch PyTorchJob:
 ```bash
-export IMAGE_URI=${REGISTRY}fsdp:pytorch2.2-cpu
+export IMAGE_URI=${REGISTRY}ddp:latest
 export INSTANCE_TYPE=
 export NUM_NODES=2
 export CPU_PER_NODE=4
-cat fsdp.yaml-template | envsubst > fsdp.yaml
+cat ddp-custom-container.yaml-template | envsubst > ddp.yaml
 
-kubectl apply -f ./fsdp.yaml
+kubectl apply -f ./ddp.yaml
 ```
 
 Check the status of your training job:
@@ -62,17 +62,17 @@ kubectl get pods
 
 ```text
 NAME   STATE     AGE
-fsdp   Running   16s
+ddp    Running   16s
 
 NAME                    READY   STATUS    RESTARTS   AGE
 etcd-7787559c74-w9gwx   1/1     Running   0          18s
-fsdp-worker-0           1/1     Running   0          18s
-fsdp-worker-1           1/1     Running   0          18s
+ddp-worker-0            1/1     Running   0          18s
+ddp-worker-1            1/1     Running   0          18s
 ```
 
 Each of the pods produces job logs. 
 ```bash
-kubectl logs fsdp-worker-0
+kubectl logs ddp-worker-0
 ```
 
 ```text
@@ -102,7 +102,7 @@ Epoch 4990 | Training snapshot saved at /fsx/snapshot.pt
 
 Stop the training job:
 ```bash
-kubectl delete -f ./fsdp.yaml
+kubectl delete -f ./ddp.yaml
 ```
 
-Note: Prior to running a new job, please stop any currently running or completed fsdp job.
+Note: Prior to running a new job, please stop any currently running or completed ddp job.
diff --git a/3.test_cases/pytorch/ddp/kubernetes/ddp-custom-container.yaml-template b/3.test_cases/pytorch/ddp/kubernetes/ddp-custom-container.yaml-template
@@ -54,7 +54,7 @@ spec:
 apiVersion: "kubeflow.org/v1"
 kind: PyTorchJob
 metadata:
-  name: fsdp
+  name: ddp
 spec:
   elasticPolicy:
     rdzvBackend: etcd
@@ -77,7 +77,7 @@ spec:
       template:
         metadata:
           labels:
-            app: fsdp
+            app: ddp
         spec:
           volumes:
             - name: shmem
@@ -96,12 +96,12 @@ spec:
               image: ${IMAGE_URI}
               imagePullPolicy: Always
               command: 
-                - /opt/conda/bin/torchrun
+                - torchrun
                 - --nproc_per_node=$CPU_PER_NODE
                 - --nnodes=$NUM_NODES
                 - /workspace/ddp.py
-                - "5000"
-                - "10"
+                - --total_epochs=5000
+                - --save_every=10
                 - --batch_size=32
                 - --checkpoint_path=/fsx/snapshot.pt
               env:
diff --git a/3.test_cases/pytorch/ddp/slurm/2.create-enroot-image.sh b/3.test_cases/pytorch/ddp/slurm/2.create-enroot-image.sh
@@ -5,7 +5,7 @@ set -ex
 # SPDX-License-Identifier: MIT-0
 
 # Remove old sqsh file if exists
-if [ -f ${ENROOT_IMAGE}.sqsh ] ; then
+if [ -f pytorch.sqsh ] ; then
     rm pytorch.sqsh
 fi
 
diff --git a/3.test_cases/pytorch/ddp/slurm/3.container-train.sbatch b/3.test_cases/pytorch/ddp/slurm/3.container-train.sbatch
@@ -23,7 +23,6 @@ declare -a TORCHRUN_ARGS=(
     --rdzv_id=$SLURM_JOB_ID
     --rdzv_backend=c10d
     --rdzv_endpoint=$(hostname)
-    --use-mlflow
 )
 
 
diff --git a/3.test_cases/pytorch/ddp/slurm/README.md b/3.test_cases/pytorch/ddp/slurm/README.md
@@ -9,29 +9,29 @@ The guide assumes that you have the following:
 We recommend that you setup a Slurm cluster using the templates in the architectures [directory](../../1.architectures). 
 
 
-## 2. Submit training job using conda environment on Slurm
+## 2. Submit training job using virtual environment on Slurm
 
-In this step, you will create PyTorch virtual environment using conda.
+In this step, you will create PyTorch virtual environment using Python venv.
 This method is only available on Slurm because it runs the training job without
 using a container.
 
 ```bash
-bash 0.create-conda-env.sh
+bash 0.create-venv.sh
 ```
 
-It will prepare `miniconda3` and `pt` `pt` includes `torchrun` 
+It will create a Python virtual environment named `pt` that includes `torchrun`
 
 
 Submit DDP training job with:
 
 ```bash
-sbatch 1.conda-train.sbatch
+sbatch 1.venv-train.sbatch
 ```
 
 Output of the training job can be found in `logs` directory:
 
 ```bash
-# cat logs/cpu-ddp-conda_xxx.out
+# cat logs/ddp-venv_xxx.out
 Node IP: 10.1.96.108
 [2024-03-12 08:22:45,549] torch.distributed.run: [WARNING] master_addr is only used for static rdzv_backend and when rdzv_endpoint is not specified.
 [2024-03-12 08:22:45,549] torch.distributed.run: [WARNING] 
@@ -90,13 +90,13 @@ It will pull `pytorch/pytorch` container, then create [squashfs](https://www.ker
 Submit DDP training job using the image with:
 
 ```bash
-sbatch 4.container-train.sbatch
+sbatch 3.container-train.sbatch
 ```
 
 Output of the training job can be found in `logs` directory:
 
 ```bash
-# cat logs/cpu-ddp-container.out
+# cat logs/ddp-container_xxx.out
 Node IP: 10.1.96.108
 [2024-03-12 08:22:45,549] torch.distributed.run: [WARNING] master_addr is only used for static rdzv_backend and when rdzv_endpoint is not specified.
 [2024-03-12 08:22:45,549] torch.distributed.run: [WARNING] 

Original file line number	Diff line number	Diff line change
`@@ -23,7 +23,6 @@ declare -a TORCHRUN_ARGS=(`
`23`	`23`	`--rdzv_id=$SLURM_JOB_ID`
`24`	`24`	`--rdzv_backend=c10d`
`25`	`25`	`--rdzv_endpoint=$(hostname)`
`26`		`- --use-mlflow`
`27`	`26`	`)`
`28`	`27`
`29`	`28`