PaliGemma fine-tuning - error with distributed training

### Bug description

I'm having an issue while adapting the fine-tuning logic from this HF tutorial:

https://github.com/NielsRogge/Transformers-Tutorials/blob/master/PaliGemma/Fine_tune_PaliGemma_for_image_%3EJSON.ipynb

I don't seem to be able to run distributed training on multiple gpus, when I run the training script with a config that includes gpus 0 and 1, I'm getting a Segmentation fault (core dumped) error. I am using Q-Lora also.

Please advise.

### What version are you seeing the problem on?

master

### How to reproduce the bug

```python
# Create trainer
    trainer = L.Trainer(
        accelerator="gpu",
        devices=[0,1],  # Use devices from config
        strategy="ddp",
        ...
    )
```


### Error messages and logs

```
`low_cpu_mem_usage` was None, now default to True since model is quantized.
Downloading shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 9709.04it/s]
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:02<00:00,  1.17s/it]
Initializing distributed: GLOBAL_RANK: 1, MEMBER: 2/2
----------------------------------------------------------------------------------------------------
distributed_backend=nccl
All distributed processes registered. Starting with 2 processes
----------------------------------------------------------------------------------------------------

Segmentation fault (core dumped)
```


### Environment

pyproject.toml:

transformers = "^4.44.2"
torch = "^2.4.1"
lightning = "^2.4.0"
peft = "^0.13.2"
accelerate = "^1.1.1"
bitsandbytes = "^0.45.0"

### More info

_No response_

cc @justusschock @lantiga

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

PaliGemma fine-tuning - error with distributed training #20496

Bug description

What version are you seeing the problem on?

How to reproduce the bug

Error messages and logs

Environment

More info

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

PaliGemma fine-tuning - error with distributed training #20496

Description

Bug description

What version are you seeing the problem on?

How to reproduce the bug

Error messages and logs

Environment

More info

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions