-
Notifications
You must be signed in to change notification settings - Fork 1.3k
Description
GPU training performance severely degrades due to incorrectly set KMP_AFFINITY.
Symptoms
- GPU idles ~80% of the time
- No bottlenecks in IO or CPU utilization - 7 out of 8 CPU cores are idle
- Training is inexplicably slow despite hardware being underutilized
Environment
accelerate env output
- `Accelerate` version: 1.12.0
- Platform: Linux-6.14.0-1010-aws-x86_64-with-glibc2.39
- `accelerate` bash location: /home/ubuntu/prj/s2g-lmdm/.pixi/envs/default/bin/accelerate
- Python version: 3.10.0
- Numpy version: 1.26.4
- PyTorch version: 2.7.1
- PyTorch accelerator: CUDA
- System RAM: 61.94 GB
- GPU type: NVIDIA L40S
- `Accelerate` default config:
- compute_environment: LOCAL_MACHINE
- distributed_type: NO
- mixed_precision: no
- use_cpu: False
- debug: False
- num_processes: 1
- machine_rank: 0
- num_machines: 1
- gpu_ids: all
- rdzv_backend: static
- same_network: True
- main_training_function: main
- enable_cpu_affinity: True
- downcast_bf16: no
- tpu_use_cluster: False
- tpu_use_sudo: False
- tpu_env: []
The data is on local fast nvme disk.
Reason
Main training process and all DataLoader workers are pinned to CPU core 0 due to KMP_AFFINITY being incorrectly set.
The Bug
In src/accelerate/utils/launch.py, ACCELERATE_USE_CPU is set as a string:
current_env["ACCELERATE_USE_CPU"] = str(args.cpu or args.use_cpu)Then on line 164, it's checked directly:
if current_env["ACCELERATE_USE_CPU"]:
current_env["KMP_AFFINITY"] = "granularity=fine,compact,1,0"The problem: str(False) returns the string "False", and in Python, all non-empty strings are truthy:
>>> bool("False")
TrueThis means KMP_AFFINITY is always set, even for GPU training.
Impact
This affects GPU training with MKL-enabled PyTorch builds since commit 4f3abb7. KMP_AFFINITY is read by Intel OpenMP (libiomp5), which is bundled with MKL. Most conda/mamba PyTorch installations use MKL by default.
PyTorch builds without MKL (some pip builds using OpenBLAS/GNU OpenMP) are likely not affected.
Reproduction
# Start any GPU training with accelerate
accelerate launch examples/cv_example.py --data_dir images/
# In another terminal, check CPU affinity of training processes
for pid in $(pgrep -f "cv_example.py"); do taskset -cp $pid; done
# Expected: 0-7 (or however many cores you have)
# Actual: 0 (all except the accelerate lauch processes pinned to core 0)Suggested Fix
The codebase already has the right utility for this. In state.py, the same variable is correctly parsed using parse_flag_from_env().
The same pattern should be used in launch.py.
Pull request is submitted.