Severe GPU training performance degradation due to incorrectly set `KMP_AFFINITY`

GPU training performance severely degrades due to incorrectly set `KMP_AFFINITY`.

# Symptoms

- GPU idles ~80% of the time
- No bottlenecks in IO or CPU utilization - 7 out of 8 CPU cores are idle
- Training is inexplicably slow despite hardware being underutilized

# Environment
<details>
<summary>accelerate env output </summary>

```
- `Accelerate` version: 1.12.0
- Platform: Linux-6.14.0-1010-aws-x86_64-with-glibc2.39
- `accelerate` bash location: /home/ubuntu/prj/s2g-lmdm/.pixi/envs/default/bin/accelerate
- Python version: 3.10.0
- Numpy version: 1.26.4
- PyTorch version: 2.7.1
- PyTorch accelerator: CUDA
- System RAM: 61.94 GB
- GPU type: NVIDIA L40S
- `Accelerate` default config:
        - compute_environment: LOCAL_MACHINE
        - distributed_type: NO
        - mixed_precision: no
        - use_cpu: False
        - debug: False
        - num_processes: 1
        - machine_rank: 0
        - num_machines: 1
        - gpu_ids: all
        - rdzv_backend: static
        - same_network: True
        - main_training_function: main
        - enable_cpu_affinity: True
        - downcast_bf16: no
        - tpu_use_cluster: False
        - tpu_use_sudo: False
        - tpu_env: []
  ```
The data is on local fast nvme disk.
</details>

# Reason

Main training process and all DataLoader workers are pinned to CPU core 0 due to `KMP_AFFINITY` being incorrectly set.

# The Bug

In [`src/accelerate/utils/launch.py`](https://github.com/huggingface/accelerate/blob/v1.12.0/src/accelerate/utils/launch.py#L133), `ACCELERATE_USE_CPU` is set as a string:

```python
current_env["ACCELERATE_USE_CPU"] = str(args.cpu or args.use_cpu)
```

Then [on line 164](https://github.com/huggingface/accelerate/blob/v1.12.0/src/accelerate/utils/launch.py#L164), it's checked directly:

```python
if current_env["ACCELERATE_USE_CPU"]:
    current_env["KMP_AFFINITY"] = "granularity=fine,compact,1,0"
```

**The problem**: `str(False)` returns the string `"False"`, and in Python, all non-empty strings are truthy:

```python
>>> bool("False")
True
```

This means `KMP_AFFINITY` is always set, even for GPU training.

# Impact

This affects GPU training with MKL-enabled PyTorch builds since [commit 4f3abb7](https://github.com/huggingface/accelerate/commit/4f3abb73a722f6275197c060346dd2f385039afc). `KMP_AFFINITY` is read by Intel OpenMP (`libiomp5`), which is bundled with MKL. Most conda/mamba PyTorch installations use MKL by default.

PyTorch builds without MKL (some pip builds using OpenBLAS/GNU OpenMP) are likely not affected.


# Reproduction

```bash
# Start any GPU training with accelerate
accelerate launch examples/cv_example.py --data_dir images/

# In another terminal, check CPU affinity of training processes
for pid in $(pgrep -f "cv_example.py"); do taskset -cp $pid; done

# Expected: 0-7 (or however many cores you have)
# Actual: 0 (all except the accelerate lauch processes pinned to core 0)
```

# Suggested Fix

The codebase already has the right utility for this. In [`state.py`](https://github.com/huggingface/accelerate/blob/v1.12.0/src/accelerate/state.py#L898), the same variable is correctly parsed using `parse_flag_from_env()`. 
The same pattern should be used in `launch.py`.

Pull request is submitted.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Severe GPU training performance degradation due to incorrectly set `KMP_AFFINITY` #3911

Symptoms

Environment

Reason

The Bug

Impact

Reproduction

Suggested Fix

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Severe GPU training performance degradation due to incorrectly set KMP_AFFINITY #3911

Description

Symptoms

Environment

Reason

The Bug

Impact

Reproduction

Suggested Fix

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Severe GPU training performance degradation due to incorrectly set `KMP_AFFINITY` #3911