Guide to ensure sufficient GPU utilization over NYU Torch cluster

This script prevents cloud/cluster GPU instances from being reclaimed during low-utilization periods (e.g., between training steps or during data loading). Rather than generating constant dummy load, it monitors live GPU utilization via nvidia-smi and only performs matrix multiplications when utilization drops below a configurable threshold (default: 50%). This minimizes impact on training throughput (SPS).

This python file, along with the Script file can be added to your project directory and can be used while launching sbatch commands for training runs.

These files serve as sample examples which can be followed. Please modify the files to suit your own custom directory settings.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Guide to ensure sufficient GPU utilization over NYU Torch cluster

Uh oh!

Uh oh!

Clone this wiki locally