-
Notifications
You must be signed in to change notification settings - Fork 3.6k
Open
Labels
bugSomething isn't workingSomething isn't workingstrategy: ddpDistributedDataParallelDistributedDataParallelver: 2.4.x
Description
Bug description
I'm having an issue while adapting the fine-tuning logic from this HF tutorial:
I don't seem to be able to run distributed training on multiple gpus, when I run the training script with a config that includes gpus 0 and 1, I'm getting a Segmentation fault (core dumped) error. I am using Q-Lora also.
Please advise.
What version are you seeing the problem on?
master
How to reproduce the bug
# Create trainer
trainer = L.Trainer(
accelerator="gpu",
devices=[0,1], # Use devices from config
strategy="ddp",
...
)Error messages and logs
`low_cpu_mem_usage` was None, now default to True since model is quantized.
Downloading shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 9709.04it/s]
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:02<00:00, 1.17s/it]
Initializing distributed: GLOBAL_RANK: 1, MEMBER: 2/2
----------------------------------------------------------------------------------------------------
distributed_backend=nccl
All distributed processes registered. Starting with 2 processes
----------------------------------------------------------------------------------------------------
Segmentation fault (core dumped)
Environment
pyproject.toml:
transformers = "^4.44.2"
torch = "^2.4.1"
lightning = "^2.4.0"
peft = "^0.13.2"
accelerate = "^1.1.1"
bitsandbytes = "^0.45.0"
More info
No response
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't workingstrategy: ddpDistributedDataParallelDistributedDataParallelver: 2.4.x