HPC-launcher `torchrun-hpc` Command Documentation

Overview

The torchrun-hpc command is a wrapper script that launches and runs distributed PyTorch on HPC systems. It provides HPC-optimized functionality for PyTorch distributed training across various schedulers (SLURM, LSF, Flux) and handles the complexities of multi-node, multi-GPU training setups.

Synopsis

torchrun-hpc [options] command [args...]

Command Structure

torchrun-hpc [-h] [--verbose] [-N NODES] [-n PROCS_PER_NODE] [--gpus-per-proc GPUS_PER_PROC]
             [-q QUEUE] [-t TIME_LIMIT] [-g GPUS_AT_LEAST] [--gpumem-at-least GPUMEM_AT_LEAST]
             [--exclusive] [--local] [--comm-backend JOB_COMM_PROTOCOL]
             [-x KEY=VALUE [KEY=VALUE ...]] [--bg] [--batch-script BATCH_SCRIPT]
             [--scheduler {local,flux,slurm,lsf}]
             [-l [LAUNCH_DIR]] [-o OUTPUT_SCRIPT] [--setup-only] [--dry-run]
             [--account ACCOUNT] [--dependency DEPENDENCY] [-J JOB_NAME]
             [--reservation RESERVATION] [--save-hostlist]
             [-p KEY=VALUE [KEY=VALUE ...]] [--out OUT_LOG_FILE] [--err ERR_LOG_FILE]
             [--color-stderr] [-r RDV] [--fraction-max-gpu-mem FRACTION_MAX_GPU_MEM]
             [-u] command [args...]

Positional Arguments

Argument	Description
`command`	Command to be executed (typically a Python script)
`args`	Arguments to pass to the command

Optional Arguments

General Options

Option	Short Form	Description
`--help`	`-h`	Show help message and exit
`--verbose`	`-v`	Run in verbose mode. Also save the hostlist as if `--save-hostlist` is set

PyTorch-Specific Options

Option	Short Form	Description	Values
`--rdv`	`-r`	Specifies rendezvous protocol to use	`mpi` \| `tcp`
`--fraction-max-gpu-mem`		Use `torch.cuda.set_per_process_memory_fraction` to limit GPU memory allocation	Float (0.0-1.0)
`--unswap-rocr-hip-vis-dev`	`-u`	Undo moving ROCR_VISIBLE_DEVICES into HIP_VISIBLE_DEVICES env variable	Flag

Notes on PyTorch Options:

Rendezvous (--rdv): Controls how distributed processes discover and connect to each other
- mpi: Use MPI for rendezvous (good for HPC environments)
- tcp: Use TCP/IP for rendezvous (standard PyTorch default)
GPU Memory Fraction: Useful for preventing OOM errors or sharing GPUs
AMD GPU Support: The -u flag improves behavior with HuggingFace Accelerate and TorchTitan on AMD GPUs

Job Size Options

These options determine the number of nodes, accelerators, and ranks for the job.

Option	Short Form	Description	Notes
`--nodes`	`-N`	Specifies the number of requested nodes
`--procs-per-node`	`-n`	Specifies the number of requested processes per node	Mutually exclusive with `-g`
`--gpus-per-proc`		Specifies the number of requested GPUs per process	Default: 1
`--queue`	`-q`	Specifies the queue to use
`--time-limit`	`-t`	Set a time limit for the job in minutes
`--gpus-at-least`	`-g`	Specifies the total number of accelerators requested	Mutually exclusive with `-n` and `-N`
`--gpumem-at-least`		Constraint for accelerator memory needed (in GB)	System must be registered with launcher
`--exclusive`		Request exclusive access from the scheduler
`--local`		Run locally (one process without batch scheduler)
`--comm-backend`		Indicate primary communication protocol	Options: MPI, *CCL (NCCL, RCCL)
`--xargs`	`-x`	Specify scheduler and launch arguments	Format: `KEY=VALUE`

Notes on `--xargs`:

Will override any known key
Use format: --xargs k1=v1 k2=v2 or --xargs k1=v1 --xargs k2=v2
Double dash -- needed if this is the last argument
Arguments with leading tilde ~ will be removed if found

Schedule Options

Arguments that determine when a job will run.

Option	Description	Notes
`--bg`	Run job in background	Launcher won't wait for job start; uses timestamped directory by default
`--batch-script`	Launch a user-provided batch script
`--scheduler`	Override default batch scheduler	Options: None, local, LocalScheduler, flux, FluxScheduler, slurm, SlurmScheduler, lsf, LSFScheduler

Script Options

Batch scheduler script parameters.

Option	Short Form	Description	Notes
`--launch-dir`	`-l`	Control launch directory creation	See detailed behavior below
`--output-script`	`-o`	Output job setup script file	Uses temporary file if not specified
`--setup-only`		Only write job setup script without scheduling
`--dry-run`		Output results without side-effects
`--account`		Specify account/bank for the job
`--dependency`		Specify scheduler dependency
`--job-name`	`-J`	Specify job name
`--reservation`		Add reservation argument	Typically for DAT runs
`--save-hostlist`		Write hostlist to `hpc_launcher_hostlist.txt`

`--launch-dir` Behavior:

No argument: Creates timestamped launch directory
With argument: Creates directory named [LAUNCH_DIR]
Argument = ".": Creates launch script in current directory
Not set + blocking job: Runs without creating files
Not set + non-blocking job: Creates launch file and logs in current directory
Note: Double dash -- needed if this is the last argument

System Options

Provide system parameters from CLI - overrides built-in system descriptions and autodetection.

Option	Short Form	Description	Format
`--system-params`	`-p`	Specify system parameters	`KEY=VALUE` pairs

System Parameter Examples:

-p cores_per_node=128 gpus_per_node=8 gpu_arch=ampere mem_per_gpu=80 numa_domains=4 scheduler=slurm

Available parameters:

cores_per_node: Integer value for CPU cores per node
gpus_per_node: Integer value for GPUs per node
gpu_arch: String value for GPU architecture
mem_per_gpu: Float value for memory per GPU
numa_domains: Integer value for NUMA domains
scheduler: String value for scheduler type

Note: Double dash -- needed if this is the last argument

Logging Options

Control output and error logging.

Option	Description
`--out`	Capture standard output to a log file (console only if not specified)
`--err`	Capture standard error to a log file (console only if not specified)
`--color-stderr`	Use terminal colors to color stderr in red (doesn't affect output files)

Usage Examples

Basic PyTorch Training

# Single node, 4 GPUs
torchrun-hpc -N 1 -n 4 train.py --epochs 100

# Multi-node training (2 nodes, 4 GPUs each)
torchrun-hpc -N 2 -n 4 train.py --batch-size 256

# Local testing without scheduler
torchrun-hpc --local -N 2 -n 2 test_script.py

Rendezvous Configuration

# MPI rendezvous (recommended for HPC)
torchrun-hpc -r mpi -N 4 -n 8 train.py

# TCP rendezvous (standard PyTorch)
torchrun-hpc -r tcp -N 2 -n 4 train.py

# TCP is useful for cloud environments or mixed networks
torchrun-hpc --rdv tcp -N 2 -n 4 cloud_train.py

GPU Memory Management

# Limit each process to 80% of GPU memory
torchrun-hpc --fraction-max-gpu-mem 0.8 -N 2 -n 4 train.py

# Useful for avoiding OOM errors
torchrun-hpc --fraction-max-gpu-mem 0.75 -N 1 -n 4 large_model.py

# Share GPUs between multiple jobs
torchrun-hpc --fraction-max-gpu-mem 0.5 -N 1 -n 2 shared_gpu_train.py

AMD GPU Support - ROCR vs HIP visible devices

# For AMD GPUs favor using ROCR_VISIBLE_DEVICES instead of HIP_VISIBLE_DEVICES
torchrun-hpc -u -N 2 -n 4 accelerate_train.py

Resource Specification

# Request specific total GPU count
torchrun-hpc -g 16 distributed_train.py

# Request GPUs with minimum memory
torchrun-hpc --gpumem-at-least 80 large_model_train.py

# Exclusive node access for performance
torchrun-hpc --exclusive -N 4 -n 4 performance_critical.py

Job Scheduling

# Submit to specific queue with time limit
torchrun-hpc -q gpu_queue -t 480 -N 4 -n 4 long_train.py

# Background job with custom name
torchrun-hpc --bg -J "BERT_finetune" -N 2 -n 4 bert_train.py

# Job with dependencies
torchrun-hpc --dependency afterok:12345 -N 2 -n 4 continue_train.py

# Use specific account
torchrun-hpc --account ml_research -N 8 -n 2 research_train.py

# DAT reservation
torchrun-hpc --reservation dat_2024 -N 16 -n 8 dat_experiment.py

Communication Backends

# NCCL for NVIDIA GPUs
torchrun-hpc --comm-backend NCCL -N 4 -n 4 nvidia_train.py

# RCCL for AMD GPUs
torchrun-hpc --comm-backend RCCL -N 4 -n 4 amd_train.py

# MPI backend
torchrun-hpc --comm-backend MPI -N 2 -n 4 mpi_train.py

Script and Directory Management

# Generate script without running
torchrun-hpc -l --setup-only -o torch_job.sh -N 2 -n 4 train.py

# Dry run to preview
torchrun-hpc --dry-run -N 8 -n 4 train.py --lr 0.001

# Custom launch directory
torchrun-hpc -l experiment_001 -N 2 -n 4 experiment.py

# Save hostlist for debugging
torchrun-hpc -l --save-hostlist -N 4 -n 4 debug_train.py

System Overrides

# Override GPU detection
torchrun-hpc -p gpus_per_node=4 gpu_arch=a100 -N 2 train.py

# Custom system configuration
torchrun-hpc -p cores_per_node=128 mem_per_gpu=80 -N 2 -n 2 custom_train.py

# Force specific scheduler
torchrun-hpc --scheduler slurm -N 2 -n 4 train.py

Logging Configuration

# Separate output and error logs
torchrun-hpc -l --out output.log --err error.log -N 2 -n 4 train.py

# Colored error output for debugging
torchrun-hpc --color-stderr --verbose -N 1 -n 4 debug_train.py

# Full logging setup
torchrun-hpc \
  -l \
  --verbose \
  --save-hostlist \
  --out train_out.log \
  --err train_err.log \
  -N 2 -n 4 train.py

Complex Example

# Full production training setup
torchrun-hpc \
  --verbose \
  -N 16 \
  -n 4 \
  --gpus-per-proc 1 \
  -r mpi \
  --fraction-max-gpu-mem 0.9 \
  --comm-backend NCCL \
  -q production \
  -t 1440 \
  --exclusive \
  --bg \
  -l production_run_$(date +%Y%m%d_%H%M%S) \
  --account ai_training \
  -J "GPT_Training" \
  --save-hostlist \
  -p gpu_arch=a100 mem_per_gpu=80 \
  --out gpt_out.log \
  --err gpt_err.log \
  train_gpt.py \
    --model-size 13B \
    --batch-size 2048 \
    --learning-rate 1e-4 \
    --warmup-steps 1000 \
    --max-steps 100000

Environment Variables Set by torchrun-hpc

The command sets standard PyTorch distributed environment variables:

Variable	Description
`WORLD_SIZE`	Total number of processes
`RANK`	Global rank of the process
`LOCAL_RANK`	Local rank on the node
`MASTER_ADDR`	Address of rank 0 node
`MASTER_PORT`	Port for communication
`NODE_RANK`	Rank of the current node

PyTorch Script Requirements

Your PyTorch script should handle distributed initialization:

import torch
import torch.distributed as dist

import sys
import socket
import os


def main():
    args = sys.argv[1:]
    torch_dist_initialized = dist.is_initialized()
    for e in ["CUDA_VISIBLE_DEVICES", "ROCR_VISIBLE_DEVICES", "HIP_VISIBLE_DEVICES"]:
        if os.getenv(e):
            gpus = os.getenv(e)

    if gpus:
        avail_gpus = gpus.split(",")

    local_rank = int(os.environ['LOCAL_RANK'])
    print(f"Local Rank: {local_rank}")

    if torch_dist_initialized:
        print(
            f"Device mesh: rank={dist.get_rank()} and local rank is {local_rank} and avail_gpus = {avail_gpus},",
        )

        print(f"{socket.gethostname()} reporting it is rank {dist.get_rank()} of {dist.get_world_size()}")
    else:
        print(f"{socket.gethostname()} reporting it is rank 0 of 1")

    # Set the device
    if torch.cuda.is_available():
        torch.cuda.set_device(local_rank)

    # Create model and move to device
    model = YourModel().cuda(local_rank)

    # Wrap with DDP
    model = torch.nn.parallel.DistributedDataParallel(
        model,
        device_ids=[local_rank],
        output_device=local_rank
    )

    # Your training code here
    train(model)

if __name__ == "__main__":
    main()

Troubleshooting

Common Issues

NCCL Errors: Try setting NCCL debug environment variables
```
torchrun-hpc -x NCCL_DEBUG=INFO -N 2 -n 4 train.py
```

OOM Errors: Use memory fraction limiting

torchrun-hpc --fraction-max-gpu-mem 0.8 -N 2 -n 4 train.py

Rendezvous Failures: Switch between MPI and TCP

# Try TCP if MPI fails
torchrun-hpc -r tcp -N 2 -n 4 train.py

AMD GPU Issues: Use the unswap flag to set ROCR_VISIBLE_DEVICES vs HIP_VISIBLE_DEBICES
```
torchrun-hpc -u -N 2 -n 4 train.py
```

Tips and Best Practices

Use MPI rendezvous (-r mpi) for stable HPC environments
Match processes to GPUs: Set -n equal to GPUs per node
Test locally first: Use --local flag for debugging
Save setup scripts: Use --setup-only to review job configuration
Monitor GPU memory: Use --fraction-max-gpu-mem to prevent OOM
Use exclusive nodes for performance-critical training
Enable verbose mode (-v) for debugging distributed issues
Save hostlists for multi-node debugging
Set appropriate time limits to avoid job termination
Use dry-run (--dry-run) to verify complex commands

Differences from Standard torchrun

HPC Scheduler Integration: Native support for SLURM, LSF, Flux
Rendezvous Options: Choice between MPI and TCP
Resource Management: HPC-specific resource allocation
GPU Memory Control: Built-in memory fraction limiting
AMD GPU Support: Special handling for ROCm environments

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HPC-launcher `torchrun-hpc` Command Documentation

Overview

Synopsis

Command Structure

Positional Arguments

Optional Arguments

General Options

PyTorch-Specific Options

Notes on PyTorch Options:

Job Size Options

Notes on `--xargs`:

Schedule Options

Script Options

`--launch-dir` Behavior:

System Options

System Parameter Examples:

Logging Options

Usage Examples

Basic PyTorch Training

Rendezvous Configuration

GPU Memory Management

AMD GPU Support - ROCR vs HIP visible devices

Resource Specification

Job Scheduling

Communication Backends

Script and Directory Management

System Overrides

Logging Configuration

Complex Example

Environment Variables Set by torchrun-hpc

PyTorch Script Requirements

Troubleshooting

Common Issues

Tips and Best Practices

Differences from Standard torchrun

See Also

FilesExpand file tree

torchrun-hpc_cli.md

Latest commit

History

torchrun-hpc_cli.md

File metadata and controls

HPC-launcher torchrun-hpc Command Documentation

Overview

Synopsis

Command Structure

Positional Arguments

Optional Arguments

General Options

PyTorch-Specific Options

Notes on PyTorch Options:

Job Size Options

Notes on --xargs:

Schedule Options

Script Options

--launch-dir Behavior:

System Options

System Parameter Examples:

Logging Options

Usage Examples

Basic PyTorch Training

Rendezvous Configuration

GPU Memory Management

AMD GPU Support - ROCR vs HIP visible devices

Resource Specification

Job Scheduling

Communication Backends

Script and Directory Management

System Overrides

Logging Configuration

Complex Example

Environment Variables Set by torchrun-hpc

PyTorch Script Requirements

Troubleshooting

Common Issues

Tips and Best Practices

Differences from Standard torchrun

See Also

HPC-launcher `torchrun-hpc` Command Documentation

Notes on `--xargs`:

`--launch-dir` Behavior: