forked from PufferAI/PufferLib
-
Notifications
You must be signed in to change notification settings - Fork 24
Guide to ensure sufficient GPU utilization over NYU Torch cluster
Aditya Gupta edited this page Mar 19, 2026
·
1 revision
This script prevents cloud/cluster GPU instances from being reclaimed during low-utilization periods (e.g., between training steps or during data loading). Rather than generating constant dummy load, it monitors live GPU utilization via nvidia-smi and only performs matrix multiplications when utilization drops below a configurable threshold (default: 50%). This minimizes impact on training throughput (SPS).
This python file, along with the Script file can be added to your project directory and can be used while launching sbatch commands for training runs.
These files serve as sample examples which can be followed. Please modify the files to suit your own custom directory settings.