Skip to content

Severe GPU training performance degradation due to incorrectly set KMP_AFFINITYΒ #3911

@hexfaker

Description

@hexfaker

GPU training performance severely degrades due to incorrectly set KMP_AFFINITY.

Symptoms

  • GPU idles ~80% of the time
  • No bottlenecks in IO or CPU utilization - 7 out of 8 CPU cores are idle
  • Training is inexplicably slow despite hardware being underutilized

Environment

accelerate env output
- `Accelerate` version: 1.12.0
- Platform: Linux-6.14.0-1010-aws-x86_64-with-glibc2.39
- `accelerate` bash location: /home/ubuntu/prj/s2g-lmdm/.pixi/envs/default/bin/accelerate
- Python version: 3.10.0
- Numpy version: 1.26.4
- PyTorch version: 2.7.1
- PyTorch accelerator: CUDA
- System RAM: 61.94 GB
- GPU type: NVIDIA L40S
- `Accelerate` default config:
        - compute_environment: LOCAL_MACHINE
        - distributed_type: NO
        - mixed_precision: no
        - use_cpu: False
        - debug: False
        - num_processes: 1
        - machine_rank: 0
        - num_machines: 1
        - gpu_ids: all
        - rdzv_backend: static
        - same_network: True
        - main_training_function: main
        - enable_cpu_affinity: True
        - downcast_bf16: no
        - tpu_use_cluster: False
        - tpu_use_sudo: False
        - tpu_env: []

The data is on local fast nvme disk.

Reason

Main training process and all DataLoader workers are pinned to CPU core 0 due to KMP_AFFINITY being incorrectly set.

The Bug

In src/accelerate/utils/launch.py, ACCELERATE_USE_CPU is set as a string:

current_env["ACCELERATE_USE_CPU"] = str(args.cpu or args.use_cpu)

Then on line 164, it's checked directly:

if current_env["ACCELERATE_USE_CPU"]:
    current_env["KMP_AFFINITY"] = "granularity=fine,compact,1,0"

The problem: str(False) returns the string "False", and in Python, all non-empty strings are truthy:

>>> bool("False")
True

This means KMP_AFFINITY is always set, even for GPU training.

Impact

This affects GPU training with MKL-enabled PyTorch builds since commit 4f3abb7. KMP_AFFINITY is read by Intel OpenMP (libiomp5), which is bundled with MKL. Most conda/mamba PyTorch installations use MKL by default.

PyTorch builds without MKL (some pip builds using OpenBLAS/GNU OpenMP) are likely not affected.

Reproduction

# Start any GPU training with accelerate
accelerate launch examples/cv_example.py --data_dir images/

# In another terminal, check CPU affinity of training processes
for pid in $(pgrep -f "cv_example.py"); do taskset -cp $pid; done

# Expected: 0-7 (or however many cores you have)
# Actual: 0 (all except the accelerate lauch processes pinned to core 0)

Suggested Fix

The codebase already has the right utility for this. In state.py, the same variable is correctly parsed using parse_flag_from_env().
The same pattern should be used in launch.py.

Pull request is submitted.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions