Isolated environments are crucial for reproducible machine learning because they encapsulate specific software versions and dependencies, ensuring models are consistently retrainable, shareable, and deployable without compatibility issues.
Anaconda leverages conda environments to create distinct spaces for projects, allowing different Python versions and libraries to coexist without conflicts by isolating updates to their respective environments. Docker, a containerization platform, packages applications and their dependencies into containers, ensuring they run seamlessly across any Linux server by providing OS-level virtualization and encapsulating the entire runtime environment.
This example showcases PyTorch DDP environment setup utilizing these approaches for efficient environment management. The implementation supports both CPU and GPU computation:
- CPU Training: Uses the GLOO backend for distributed training on CPU nodes
- GPU Training: Automatically switches to NCCL backend when GPUs are available, providing optimized multi-GPU training
To run the training with GPUs, use torchrun with the appropriate number of GPUs:
torchrun --nproc_per_node=N ddp.py --total_epochs=10 --save_every=1 --batch_size=32where N is the number of GPUs you want to use.
This implementation includes MLFlow integration for experiment tracking and model management. MLFlow helps you track metrics, parameters, and artifacts during training, making it easier to compare different runs and manage model versions.
- Install MLFlow:
pip install mlflow- Start the MLFlow tracking server:
mlflow uiTo enable MLFlow logging, add the --use_mlflow flag when running the training script:
torchrun --nproc_per_node=N ddp.py --total_epochs=10 --save_every=1 --batch_size=32 --use_mlflowBy default, MLFlow will connect to http://localhost:5000. To use a different tracking server, specify the --tracking_uri:
torchrun --nproc_per_node=N ddp.py --total_epochs=10 --save_every=1 --batch_size=32 --use_mlflow --tracking_uri=http://localhost:5000MLFlow will track:
- Training metrics (loss per epoch)
- Model hyperparameters
- Model checkpoints
- Training configuration
- Open your browser and navigate to
http://localhost:5000(or your specified tracking URI)
The MLFlow UI provides:
- Experiment comparison
- Metric visualization
- Parameter tracking
- Model artifact management
- Run history
We provide guides for both Slurm and Kubernetes. However, please note that the Conda example is only compatible with Slurm. For detailed instructions, proceed to the slurm or kubernetes subdirectory.