Qlora + Deepspeed Zero3 #1637
AnirudhVIyer
started this conversation in
General
Replies: 1 comment
-
It's very hard to say what the issue could be based on the info you provide. This thread contains a couple of tips from users who encountered the same issue. Most notably, if you're using float16 (aka fp16, half precision) training, that is a likely culprit. Otherwise, you would have to try some other tips or test out different hyper-parameters. |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
While running a finetuning job with Qlora + Zero3, my training begins but stops abruptly with the following error:
[2024-04-09 19:38:43,632] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 2, reducing to 1
9%|▉ | 128/1402 [03:23<29:15, 1.38s/it]
9%|▉ | 128/1402 [03:23<29:15, 1.38s/it]#15 9%|▉ | 128/1402 [03:23<29:14, 1.38s/it]
9%|▉ | 128/1402 [03:23<29:09, 1.37s/it]
9%|▉ | 129/1402 [03:24<29:16, 1.38s/it]#15 9%|▉ | 129/1402 [03:25<29:16, 1.38s/it]
9%|▉ | 129/1402 [03:25<29:16, 1.38s/it]
9%|▉ | 129/1402 [03:25<29:17, 1.38s/it]
9%|▉ | 130/1402 [03:26<28:59, 1.37s/it]#15 9%|▉ | 130/1402 [03:26<28:58, 1.37s/it]
9%|▉ | 130/1402 [03:26<28:59, 1.37s/it]
9%|▉ | 130/1402 [03:26<29:02, 1.37s/it]
9%|▉ | 131/1402 [03:28<30:56, 1.46s/it]
9%|▉ | 131/1402 [03:27<30:56, 1.46s/it]
9%|▉ | 131/1402 [03:28<30:57, 1.46s/it]
9%|▉ | 131/1402 [03:28<30:59, 1.46s/it]
9%|▉ | 131/1402 [03:29<33:51, 1.60s/it]
Traceback (most recent call last):
File "/opt/ml/code/train.py", line 397, in
main(args)
File "/opt/ml/code/train.py", line 318, in main
accelerator.backward(loss)
File "/opt/conda/lib/python3.9/site-packages/accelerate/accelerator.py", line 1995, in backward
self.deepspeed_engine_wrapped.backward(loss, **kwargs)
File "/opt/conda/lib/python3.9/site-packages/accelerate/utils/deepspeed.py", line 175, in backward
self.engine.step()
File "/opt/conda/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 2169, in step
self._take_model_step(lr_kwargs)
File "/opt/conda/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 2075, in _take_model_step
self.optimizer.step()
File "/opt/conda/lib/python3.9/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
ret_val = func(*args, **kwargs)
File "/opt/conda/lib/python3.9/site-packages/deepspeed/runtime/zero/stage3.py", line 2024, in step
if self._overflow_check_and_loss_scale_update():
File "/opt/conda/lib/python3.9/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
ret_val = func(*args, **kwargs)
File "/opt/conda/lib/python3.9/site-packages/deepspeed/runtime/zero/stage3.py", line 1972, in _overflow_check_and_loss_scale_update
self._update_scale(self.overflow)
File "/opt/conda/lib/python3.9/site-packages/deepspeed/runtime/zero/stage3.py", line 2359, in _update_scale
self.loss_scaler.update_scale(has_overflow)
File "/opt/conda/lib/python3.9/site-packages/deepspeed/runtime/fp16/loss_scaler.py", line 175, in update_scale
raise Exception(
Exception: Current loss scale already at minimum - cannot decrease scale anymore. Exiting run.
Has anyone faced this issue? I am using python 3.9 torch 2.2
torch==2.2.2
transformers==4.39.0
accelerate==0.28.0
deepspeed==0.14.0
datasets==2.10.1
bitsandbytes==0.43.0
trl==0.8.1
nltk
evaluate
peft==0.10.0
rouge-score
tensorboard
Beta Was this translation helpful? Give feedback.
All reactions