The torchrun-hpc command is a wrapper script that launches and runs distributed PyTorch on HPC systems. It provides HPC-optimized functionality for PyTorch distributed training across various schedulers (SLURM, LSF, Flux) and handles the complexities of multi-node, multi-GPU training setups.
torchrun-hpc [options] command [args...]torchrun-hpc [-h] [--verbose] [-N NODES] [-n PROCS_PER_NODE] [--gpus-per-proc GPUS_PER_PROC]
[-q QUEUE] [-t TIME_LIMIT] [-g GPUS_AT_LEAST] [--gpumem-at-least GPUMEM_AT_LEAST]
[--exclusive] [--local] [--comm-backend JOB_COMM_PROTOCOL]
[-x KEY=VALUE [KEY=VALUE ...]] [--bg] [--batch-script BATCH_SCRIPT]
[--scheduler {local,flux,slurm,lsf}]
[-l [LAUNCH_DIR]] [-o OUTPUT_SCRIPT] [--setup-only] [--dry-run]
[--account ACCOUNT] [--dependency DEPENDENCY] [-J JOB_NAME]
[--reservation RESERVATION] [--save-hostlist]
[-p KEY=VALUE [KEY=VALUE ...]] [--out OUT_LOG_FILE] [--err ERR_LOG_FILE]
[--color-stderr] [-r RDV] [--fraction-max-gpu-mem FRACTION_MAX_GPU_MEM]
[-u] command [args...]| Argument | Description |
|---|---|
command |
Command to be executed (typically a Python script) |
args |
Arguments to pass to the command |
| Option | Short Form | Description |
|---|---|---|
--help |
-h |
Show help message and exit |
--verbose |
-v |
Run in verbose mode. Also save the hostlist as if --save-hostlist is set |
| Option | Short Form | Description | Values |
|---|---|---|---|
--rdv |
-r |
Specifies rendezvous protocol to use | mpi | tcp |
--fraction-max-gpu-mem |
Use torch.cuda.set_per_process_memory_fraction to limit GPU memory allocation |
Float (0.0-1.0) | |
--unswap-rocr-hip-vis-dev |
-u |
Undo moving ROCR_VISIBLE_DEVICES into HIP_VISIBLE_DEVICES env variable | Flag |
- Rendezvous (
--rdv): Controls how distributed processes discover and connect to each othermpi: Use MPI for rendezvous (good for HPC environments)tcp: Use TCP/IP for rendezvous (standard PyTorch default)
- GPU Memory Fraction: Useful for preventing OOM errors or sharing GPUs
- AMD GPU Support: The
-uflag improves behavior with HuggingFace Accelerate and TorchTitan on AMD GPUs
These options determine the number of nodes, accelerators, and ranks for the job.
| Option | Short Form | Description | Notes |
|---|---|---|---|
--nodes |
-N |
Specifies the number of requested nodes | |
--procs-per-node |
-n |
Specifies the number of requested processes per node | Mutually exclusive with -g |
--gpus-per-proc |
Specifies the number of requested GPUs per process | Default: 1 | |
--queue |
-q |
Specifies the queue to use | |
--time-limit |
-t |
Set a time limit for the job in minutes | |
--gpus-at-least |
-g |
Specifies the total number of accelerators requested | Mutually exclusive with -n and -N |
--gpumem-at-least |
Constraint for accelerator memory needed (in GB) | System must be registered with launcher | |
--exclusive |
Request exclusive access from the scheduler | ||
--local |
Run locally (one process without batch scheduler) | ||
--comm-backend |
Indicate primary communication protocol | Options: MPI, *CCL (NCCL, RCCL) | |
--xargs |
-x |
Specify scheduler and launch arguments | Format: KEY=VALUE |
- Will override any known key
- Use format:
--xargs k1=v1 k2=v2or--xargs k1=v1 --xargs k2=v2 - Double dash
--needed if this is the last argument - Arguments with leading tilde
~will be removed if found
Arguments that determine when a job will run.
| Option | Description | Notes |
|---|---|---|
--bg |
Run job in background | Launcher won't wait for job start; uses timestamped directory by default |
--batch-script |
Launch a user-provided batch script | |
--scheduler |
Override default batch scheduler | Options: None, local, LocalScheduler, flux, FluxScheduler, slurm, SlurmScheduler, lsf, LSFScheduler |
Batch scheduler script parameters.
| Option | Short Form | Description | Notes |
|---|---|---|---|
--launch-dir |
-l |
Control launch directory creation | See detailed behavior below |
--output-script |
-o |
Output job setup script file | Uses temporary file if not specified |
--setup-only |
Only write job setup script without scheduling | ||
--dry-run |
Output results without side-effects | ||
--account |
Specify account/bank for the job | ||
--dependency |
Specify scheduler dependency | ||
--job-name |
-J |
Specify job name | |
--reservation |
Add reservation argument | Typically for DAT runs | |
--save-hostlist |
Write hostlist to hpc_launcher_hostlist.txt |
- No argument: Creates timestamped launch directory
- With argument: Creates directory named
[LAUNCH_DIR] - Argument = ".": Creates launch script in current directory
- Not set + blocking job: Runs without creating files
- Not set + non-blocking job: Creates launch file and logs in current directory
- Note: Double dash
--needed if this is the last argument
Provide system parameters from CLI - overrides built-in system descriptions and autodetection.
| Option | Short Form | Description | Format |
|---|---|---|---|
--system-params |
-p |
Specify system parameters | KEY=VALUE pairs |
-p cores_per_node=128 gpus_per_node=8 gpu_arch=ampere mem_per_gpu=80 numa_domains=4 scheduler=slurmAvailable parameters:
cores_per_node: Integer value for CPU cores per nodegpus_per_node: Integer value for GPUs per nodegpu_arch: String value for GPU architecturemem_per_gpu: Float value for memory per GPUnuma_domains: Integer value for NUMA domainsscheduler: String value for scheduler type
Note: Double dash -- needed if this is the last argument
Control output and error logging.
| Option | Description |
|---|---|
--out |
Capture standard output to a log file (console only if not specified) |
--err |
Capture standard error to a log file (console only if not specified) |
--color-stderr |
Use terminal colors to color stderr in red (doesn't affect output files) |
# Single node, 4 GPUs
torchrun-hpc -N 1 -n 4 train.py --epochs 100
# Multi-node training (2 nodes, 4 GPUs each)
torchrun-hpc -N 2 -n 4 train.py --batch-size 256
# Local testing without scheduler
torchrun-hpc --local -N 2 -n 2 test_script.py# MPI rendezvous (recommended for HPC)
torchrun-hpc -r mpi -N 4 -n 8 train.py
# TCP rendezvous (standard PyTorch)
torchrun-hpc -r tcp -N 2 -n 4 train.py
# TCP is useful for cloud environments or mixed networks
torchrun-hpc --rdv tcp -N 2 -n 4 cloud_train.py# Limit each process to 80% of GPU memory
torchrun-hpc --fraction-max-gpu-mem 0.8 -N 2 -n 4 train.py
# Useful for avoiding OOM errors
torchrun-hpc --fraction-max-gpu-mem 0.75 -N 1 -n 4 large_model.py
# Share GPUs between multiple jobs
torchrun-hpc --fraction-max-gpu-mem 0.5 -N 1 -n 2 shared_gpu_train.py# For AMD GPUs favor using ROCR_VISIBLE_DEVICES instead of HIP_VISIBLE_DEVICES
torchrun-hpc -u -N 2 -n 4 accelerate_train.py# Request specific total GPU count
torchrun-hpc -g 16 distributed_train.py
# Request GPUs with minimum memory
torchrun-hpc --gpumem-at-least 80 large_model_train.py
# Exclusive node access for performance
torchrun-hpc --exclusive -N 4 -n 4 performance_critical.py# Submit to specific queue with time limit
torchrun-hpc -q gpu_queue -t 480 -N 4 -n 4 long_train.py
# Background job with custom name
torchrun-hpc --bg -J "BERT_finetune" -N 2 -n 4 bert_train.py
# Job with dependencies
torchrun-hpc --dependency afterok:12345 -N 2 -n 4 continue_train.py
# Use specific account
torchrun-hpc --account ml_research -N 8 -n 2 research_train.py
# DAT reservation
torchrun-hpc --reservation dat_2024 -N 16 -n 8 dat_experiment.py# NCCL for NVIDIA GPUs
torchrun-hpc --comm-backend NCCL -N 4 -n 4 nvidia_train.py
# RCCL for AMD GPUs
torchrun-hpc --comm-backend RCCL -N 4 -n 4 amd_train.py
# MPI backend
torchrun-hpc --comm-backend MPI -N 2 -n 4 mpi_train.py# Generate script without running
torchrun-hpc -l --setup-only -o torch_job.sh -N 2 -n 4 train.py
# Dry run to preview
torchrun-hpc --dry-run -N 8 -n 4 train.py --lr 0.001
# Custom launch directory
torchrun-hpc -l experiment_001 -N 2 -n 4 experiment.py
# Save hostlist for debugging
torchrun-hpc -l --save-hostlist -N 4 -n 4 debug_train.py# Override GPU detection
torchrun-hpc -p gpus_per_node=4 gpu_arch=a100 -N 2 train.py
# Custom system configuration
torchrun-hpc -p cores_per_node=128 mem_per_gpu=80 -N 2 -n 2 custom_train.py
# Force specific scheduler
torchrun-hpc --scheduler slurm -N 2 -n 4 train.py# Separate output and error logs
torchrun-hpc -l --out output.log --err error.log -N 2 -n 4 train.py
# Colored error output for debugging
torchrun-hpc --color-stderr --verbose -N 1 -n 4 debug_train.py
# Full logging setup
torchrun-hpc \
-l \
--verbose \
--save-hostlist \
--out train_out.log \
--err train_err.log \
-N 2 -n 4 train.py# Full production training setup
torchrun-hpc \
--verbose \
-N 16 \
-n 4 \
--gpus-per-proc 1 \
-r mpi \
--fraction-max-gpu-mem 0.9 \
--comm-backend NCCL \
-q production \
-t 1440 \
--exclusive \
--bg \
-l production_run_$(date +%Y%m%d_%H%M%S) \
--account ai_training \
-J "GPT_Training" \
--save-hostlist \
-p gpu_arch=a100 mem_per_gpu=80 \
--out gpt_out.log \
--err gpt_err.log \
train_gpt.py \
--model-size 13B \
--batch-size 2048 \
--learning-rate 1e-4 \
--warmup-steps 1000 \
--max-steps 100000The command sets standard PyTorch distributed environment variables:
| Variable | Description |
|---|---|
WORLD_SIZE |
Total number of processes |
RANK |
Global rank of the process |
LOCAL_RANK |
Local rank on the node |
MASTER_ADDR |
Address of rank 0 node |
MASTER_PORT |
Port for communication |
NODE_RANK |
Rank of the current node |
Your PyTorch script should handle distributed initialization:
import torch
import torch.distributed as dist
import sys
import socket
import os
def main():
args = sys.argv[1:]
torch_dist_initialized = dist.is_initialized()
for e in ["CUDA_VISIBLE_DEVICES", "ROCR_VISIBLE_DEVICES", "HIP_VISIBLE_DEVICES"]:
if os.getenv(e):
gpus = os.getenv(e)
if gpus:
avail_gpus = gpus.split(",")
local_rank = int(os.environ['LOCAL_RANK'])
print(f"Local Rank: {local_rank}")
if torch_dist_initialized:
print(
f"Device mesh: rank={dist.get_rank()} and local rank is {local_rank} and avail_gpus = {avail_gpus},",
)
print(f"{socket.gethostname()} reporting it is rank {dist.get_rank()} of {dist.get_world_size()}")
else:
print(f"{socket.gethostname()} reporting it is rank 0 of 1")
# Set the device
if torch.cuda.is_available():
torch.cuda.set_device(local_rank)
# Create model and move to device
model = YourModel().cuda(local_rank)
# Wrap with DDP
model = torch.nn.parallel.DistributedDataParallel(
model,
device_ids=[local_rank],
output_device=local_rank
)
# Your training code here
train(model)
if __name__ == "__main__":
main()-
NCCL Errors: Try setting NCCL debug environment variables
torchrun-hpc -x NCCL_DEBUG=INFO -N 2 -n 4 train.py
-
OOM Errors: Use memory fraction limiting
torchrun-hpc --fraction-max-gpu-mem 0.8 -N 2 -n 4 train.py
-
Rendezvous Failures: Switch between MPI and TCP
# Try TCP if MPI fails torchrun-hpc -r tcp -N 2 -n 4 train.py -
AMD GPU Issues: Use the unswap flag to set ROCR_VISIBLE_DEVICES vs HIP_VISIBLE_DEBICES
torchrun-hpc -u -N 2 -n 4 train.py
- Use MPI rendezvous (
-r mpi) for stable HPC environments - Match processes to GPUs: Set
-nequal to GPUs per node - Test locally first: Use
--localflag for debugging - Save setup scripts: Use
--setup-onlyto review job configuration - Monitor GPU memory: Use
--fraction-max-gpu-memto prevent OOM - Use exclusive nodes for performance-critical training
- Enable verbose mode (
-v) for debugging distributed issues - Save hostlists for multi-node debugging
- Set appropriate time limits to avoid job termination
- Use dry-run (
--dry-run) to verify complex commands
- HPC Scheduler Integration: Native support for SLURM, LSF, Flux
- Rendezvous Options: Choice between MPI and TCP
- Resource Management: HPC-specific resource allocation
- GPU Memory Control: Built-in memory fraction limiting
- AMD GPU Support: Special handling for ROCm environments
launch- General purpose HPC job launcher- PyTorch Distributed Documentation: https://pytorch.org/docs/stable/distributed.html
- HPC-launcher Repository: https://github.com/LBANN/HPC-launcher
- LBANN Documentation: https://lbann.readthedocs.io
Generated from torchrun-hpc -h output