-
Notifications
You must be signed in to change notification settings - Fork 2.1k
Fix: Multiple PEFT methods have issues with models loaded in float16 or bfloat16 #2433
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix: Multiple PEFT methods have issues with models loaded in float16 or bfloat16 #2433
Conversation
As a user, it should be possible to manually cast the base model to a lower precision dtype, float16 or bfloat16, and still have the different PEFT methods work correctly. Currently, this is not the case for many PEFT methods, as can be replicated by the added tests. To understand the problem, it helps to take a step back. By default, PEFT will treat the adapter weights with high precision, i.e. with float32. When the base model is lower precision, the user needs to pass inputs in lower precision too, as otherwise self.base_layer(x) would fail. However, this low precision input clashes with the high precision adapter weights. The solution implemented in this PR is to cast the input to a higher dtype [1]. That way, the whole adapter operation is conducted in high precision. Only once that has finished will the final result be cast to the original dtype. This should lead to better results, but it may require more memory. Note that this is how LoRA is implemented, so the changes in this PR bring the other methods more in line with what LoRA does. If the user does not want the adapter to be in float32, they can always pass autocast_adapter_dtype=False when calling get_peft_model or PeftModel.from_pretrained. This is also tested. Besides adjusting the forward method to account for these changes, the merge and unmerge methods also often had to be adjusted, as they did not correctly account for the base model dtype. Now, those methods should always conserve the original dtype of the base model. Note that if, for whatever reason, the input casting in [1] is not desired, users can use the disable_input_dtype_casting context manager to disable it (more context information on this feature can be found in PR huggingface#2353). I updated the corresponding code to be agnostic to the specific PEFT method (beforehand, it was only for LoRA).
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
Several more or less random ops are not implemented for MacOS.
There is probably no reason to have this code at all.
|
Pinging the contributors of the different methods if they could be so kind as to review the changes that affect them:
If I forgot anyone, please feel free to ping them. If there is general agreement, my plan is to merge this shortly after the next PEFT release (v0.15.0), so aiming for next week. That way, if this does introduce regressions, we have more time to fix them. |
|
@BenjaminBossan AFAIK the changes seems okay for LoKr 🤗. Also is precision loss the reason to convert the inputs to float32 rather than downcasting the adapter weights 🤔 |
|
Hi @BenjaminBossan, I recently found an issue with DoRA training where the magnitude vector isn’t being updated when training on A100, although it works fine on H100. The torchtune team also found a similar issue. Could this be related to the casting issue? Also could you point me to any test cases to verify this pull request? Additionally, I recently ran into a problem using DoRA with DeepSpeed ZeRO-3. I’m not sure if this is related as well, but I’ll open an issue soon with steps to reproduce the error. Thanks a lot! |
Thanks for checking.
Exactly, although I'm not sure how much of a difference it really makes. It's just been like that for LoRA from the start and I think it makes sense to consolidate the other methods.
I haven't worked with those GPUs, so I can't say if there could be a relationship.
Is it this one: meta-pytorch/torchtune#2250? If yes, and if it's indeed an underflow issue, then this PR could indeed fix it.
The tests I added are here: https://github.com/huggingface/peft/pull/2433/files#diff-8e23036752ec7b8a68889069e72c29171c53652df5e7d390afa4d223f22c77d2 You should be able to check out my branch and then call
to check them. Note, however, that those are very simple tests, no multi GPU etc.
Okay, feel free to ping me once the issue is done. |
|
Hi @BenjaminBossan, I have ran the test on DoRA using the test file you wrote on H100, and the following cases did not pass: I also wanted to verify if the magnitude vector is being updated correctly, so I tried modifying the function def test_training_works:, Am I missing something when running pytest tests/test_custom_models.py -k "test_training_works" -v -s, or could you help me write a simple test function to check whether the magnitude vector is being updated correctly? Thank you! |
Hmm, strange, does the error only occur on H100 or was that the only device you tested? Could you please also show the full error message?
Please note that this a very specific test inside the
To check this, I think the test here and/or the tests below it are best suited: peft/tests/test_gpu_examples.py Line 855 in e5e7b73
You can run it with:
Just to do a quick check, I added these lines of code: ...
model = get_peft_model(model, config)
mv_before = model.base_model.model.model.decoder.layers[0].self_attn.v_proj.lora_magnitude_vector["default"].weight.data.clone()
...
trainer.train()
mv_after = model.base_model.model.model.decoder.layers[0].self_attn.v_proj.lora_magnitude_vector["default"].weight.data
assert not torch.allclose(mv_before, mv_after, atol=1e-5, rtol=1e-5)WDYT? |
This specific test used a learning rate that is too high, resulting in nan weights. Then, when weights are compared to assert that they're different, the test passes trivially because nan != nan. The lr is now reduced and there is a sanity check that none of the weights contain non-finite values. See discussion in huggingface#2433 (comment) ff.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@BenjaminBossan The changes for (IA)^3 seem fine!
|
@BenjaminBossan I have a question regarding Lora. After PR: Model is bfloat16 and lora is stored as bfloat16. Peft would now upcast, by default the x input to float32 and the lora to float32 correct? Before PR: Peft did not do any up/down casting and used as is, by default. Is above correct summation? Also, can you do some prelim tests to see what the performance costs are for the autocast to float32 and lora adapter apply if any? If there is no underflow issue for some cases of adapter, it would be good to let user know in code/doc that autocast resolves the accuracy issue but at XX cost so users are a little bit more informed. |
No, for LoRA there is no change in this PR (a few small changes around merging and MHA, but otherwise no change). The situation for LoRA has always been: By default, the LoRA weights are loaded in float32. If the input is lower precision, it is upcast to float32 when the LoRA part is calculated during peft/src/peft/tuners/tuners_utils.py Line 321 in e5e7b73
peft/src/peft/tuners/lora/layer.py Lines 724 to 744 in e5e7b73
This PR just makes it so that this behavior is consistent between the different layer types and PEFT methods. The reason why I worked on this is because I found many PEFT methods break when loading the model in lower dtype.
I did an experiment with
(edit: updated values above after finishing full run) So there are small but noticeable differences, with lower LoRA precision resulting in slightly faster training, less memory, and smaller file size, at the cost of slightly higher loss. Of course, results will vary a lot depending on the use case, and longer/bigger training runs could be more prone to under/overflow. At the end of the day, users can easily change the behavior by passing Regarding the documentation, yes, we can certainly highlight this option better. I'll check if I can find a good place in the docs to put this. Edit: Added some documentation. |
|
@BenjaminBossan The structure of Bone is very simple, so it shouldn't have much impact. Do I only need to test it with |
This specific test used a learning rate that is too high, resulting in nan weights. Then, when weights are compared to assert that they're different, the test passes trivially because nan != nan. The lr is now reduced and there is a sanity check that none of the weights contain non-finite values. See discussion in #2433 (comment) ff.
No need for you to test, just a quick check if the code changes for Bone look right to you. We already run the tests and I also did a full training run with Bone and a bfloat16 base model and it worked. |
|
Hi, The changes for LN Tuning seems fine by me. |
Note that model.merge_adapter(safe_merge=True) did not work so far, even though the argument was documented it was not actually there. This is now fixed.
For some reason, the model used in config tests caused issues, move to a different one. This new one is from peft-internal, so that's better anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, minor nitpicks
src/peft/tuners/boft/layer.py
Outdated
| ) | ||
|
|
||
| self.get_base_layer().weight.data = orig_weight | ||
| self.get_base_layer().weight.data = orig_weight.to(orig_dtype) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| self.get_base_layer().weight.data = orig_weight.to(orig_dtype) | |
| base_layer.weight.data = orig_weight.to(orig_dtype) |
src/peft/tuners/bone/layer.py
Outdated
| self.base_layer.weight.data = orig_weight.to(orig_dtype) | ||
| else: | ||
| if self.bone_fn == "bat": | ||
| delta_weight = self.get_delta_weight(active_adapter, self.base_layer.weight.data) | ||
| self.base_layer.weight.data += delta_weight | ||
| self.base_layer.weight.data += delta_weight.to(orig_dtype) | ||
| else: | ||
| delta_weight = self.get_delta_weight_bone(active_adapter, self.base_layer.weight.data) | ||
| self.base_layer.weight.data = delta_weight | ||
| self.base_layer.weight.data = delta_weight.to(orig_dtype) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can use base_layer here as well, right?
|
Thanks everyone for your feedback. The PR is now merged, but the next PEFT release will not be very soon. So if you happen to find an issue, please report it so that we can fix it before the next release. |
This specific test used a learning rate that is too high, resulting in nan weights. Then, when weights are compared to assert that they're different, the test passes trivially because nan != nan. The lr is now reduced and there is a sanity check that none of the weights contain non-finite values. See discussion in huggingface#2433 (comment) ff.
…or bfloat16 (huggingface#2433) As a user, it should be possible to manually cast the base model to a lower precision dtype, float16 or bfloat16, and still have the different PEFT methods work correctly. Currently, this is not the case for many PEFT methods, as can be replicated by the added tests. To understand the problem, it helps to take a step back. By default, PEFT will treat the adapter weights with high precision, i.e. with float32. When the base model is lower precision, the user needs to pass inputs in lower precision too, as otherwise self.base_layer(x) would fail. However, this low precision input clashes with the high precision adapter weights. The solution implemented in this PR is to cast the input to a higher dtype [1]. That way, the whole adapter operation is conducted in high precision. Only once that has finished will the final result be cast to the original dtype. This should lead to better results, but it may require more memory. Note that this is how LoRA is implemented, so the changes in this PR bring the other methods more in line with what LoRA does. If the user does not want the adapter to be in float32, they can always pass autocast_adapter_dtype=False when calling get_peft_model or PeftModel.from_pretrained. This is also tested. Besides adjusting the forward method to account for these changes, the merge and unmerge methods also often had to be adjusted, as they did not correctly account for the base model dtype. Now, those methods should always conserve the original dtype of the base model. Note that if, for whatever reason, the input casting in [1] is not desired, users can use the disable_input_dtype_casting context manager to disable it (more context information on this feature can be found in PR huggingface#2353). I updated the corresponding code to be agnostic to the specific PEFT method (beforehand, it was only for LoRA). Note that model.merge_adapter(safe_merge=True) did not work so far, even though the argument was documented it was not actually there. This is now fixed.
This specific test used a learning rate that is too high, resulting in nan weights. Then, when weights are compared to assert that they're different, the test passes trivially because nan != nan. The lr is now reduced and there is a sanity check that none of the weights contain non-finite values. See discussion in huggingface#2433 (comment) ff.
…or bfloat16 (huggingface#2433) As a user, it should be possible to manually cast the base model to a lower precision dtype, float16 or bfloat16, and still have the different PEFT methods work correctly. Currently, this is not the case for many PEFT methods, as can be replicated by the added tests. To understand the problem, it helps to take a step back. By default, PEFT will treat the adapter weights with high precision, i.e. with float32. When the base model is lower precision, the user needs to pass inputs in lower precision too, as otherwise self.base_layer(x) would fail. However, this low precision input clashes with the high precision adapter weights. The solution implemented in this PR is to cast the input to a higher dtype [1]. That way, the whole adapter operation is conducted in high precision. Only once that has finished will the final result be cast to the original dtype. This should lead to better results, but it may require more memory. Note that this is how LoRA is implemented, so the changes in this PR bring the other methods more in line with what LoRA does. If the user does not want the adapter to be in float32, they can always pass autocast_adapter_dtype=False when calling get_peft_model or PeftModel.from_pretrained. This is also tested. Besides adjusting the forward method to account for these changes, the merge and unmerge methods also often had to be adjusted, as they did not correctly account for the base model dtype. Now, those methods should always conserve the original dtype of the base model. Note that if, for whatever reason, the input casting in [1] is not desired, users can use the disable_input_dtype_casting context manager to disable it (more context information on this feature can be found in PR huggingface#2353). I updated the corresponding code to be agnostic to the specific PEFT method (beforehand, it was only for LoRA). Note that model.merge_adapter(safe_merge=True) did not work so far, even though the argument was documented it was not actually there. This is now fixed.
This specific test used a learning rate that is too high, resulting in nan weights. Then, when weights are compared to assert that they're different, the test passes trivially because nan != nan. The lr is now reduced and there is a sanity check that none of the weights contain non-finite values. See discussion in huggingface/peft#2433 (comment) ff.



As a user, it should be possible to manually cast the base model to a lower precision dtype, float16 or bfloat16, and still have the different PEFT methods work correctly. Currently, this is not the case for many PEFT methods, as can be replicated by the added tests.
To understand the problem, it helps to take a step back. By default, PEFT will treat the adapter weights with high precision, i.e. with float32. When the base model is lower precision, the user needs to pass inputs in lower precision too, as otherwise
self.base_layer(x)would fail. However, this low precision input clashes with the high precision adapter weights.The solution implemented in this PR is to cast the input to a higher dtype [1]. That way, the whole adapter operation is conducted in high precision. Only once that has finished will the final result be cast to the original dtype if necessary. This should lead to better results thanks to the high precision, but it may require more memory. Note that this is how LoRA is implemented, so the changes in this PR bring the other methods more in line with what LoRA does.
If the user does not want the adapter to be in float32, they can always pass
autocast_adapter_dtype=Falsewhen callingget_peft_modelorPeftModel.from_pretrainedetc. This is also tested.Besides adjusting the forward method to account for these changes, the
mergeandunmergemethods also often had to be adjusted, as they did not correctly account for the base model dtype. Now, those methods should always conserve the original dtype of the base model.Note that if, for whatever reason, the input casting in [1] is not desired, users can use the
disable_input_dtype_castingcontext manager to disable it (more context information on this feature can be found in PR #2353). I updated the corresponding code to be agnostic to the specific PEFT method (beforehand, it was only for LoRA). This is why I had to move the_cast_input_dtypemethod from LoRA to theBaseTunerLayer.An independent bug in VeRA was discovered through the new tests, where the weights for
vera_lambda_dandvera_lambda_bwere re-assigned after merging in an incorrect shape. This is now also fixed.EDIT
While working on this, I made a few more changes:
model.merge_adapter(safe_merge=...)was not supported, even though thesafe_mergeargument was documented. I added the argument now.safe_merge, this is now added (but it's a no-op, as there is no merging, just swapping parameters)lewtun/tiny-random-OPTForCausalLM-deltamodel, I now moved topeft-internal-testing/tiny-opt-lora-revisionfor those tests. This is better anyway, since we want to rely less on models from private accounts.