Skip to content

Guide to ensure sufficient GPU utilization over NYU Torch cluster

Aditya Gupta edited this page Mar 19, 2026 · 1 revision

This script prevents cloud/cluster GPU instances from being reclaimed during low-utilization periods (e.g., between training steps or during data loading). Rather than generating constant dummy load, it monitors live GPU utilization via nvidia-smi and only performs matrix multiplications when utilization drops below a configurable threshold (default: 50%). This minimizes impact on training throughput (SPS).

This python file, along with the Script file can be added to your project directory and can be used while launching sbatch commands for training runs.

These files serve as sample examples which can be followed. Please modify the files to suit your own custom directory settings.

Clone this wiki locally