-
Notifications
You must be signed in to change notification settings - Fork 0
Orthogonal subspace learning #2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
src/peft/utils/svd_utils.py
Outdated
| return optimizer | ||
|
|
||
|
|
||
| def wrap_model_with_svd(model: nn.Module, svd_config: dict[str, int] | None = None) -> nn.Module: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be renamed to wrap_model_with_osf as well?
src/peft/utils/svd_utils.py
Outdated
| return config | ||
|
|
||
|
|
||
| def create_svd_model_class(base_cls: type) -> type: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this one may need to be updated as well
src/peft/utils/__init__.py
Outdated
| "set_peft_model_state_dict", | ||
| "shift_tokens_right", | ||
| "transpose", | ||
| "wrap_model_with_svd", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You may need to update all instances of SVD across the PR
tests/test_svd_utils.py
Outdated
| @@ -0,0 +1,39 @@ | |||
| import torch | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This file seems like it may also need to be updated.
src/peft/tuners/osf/model.py
Outdated
| import torch.nn as nn | ||
|
|
||
| from peft.tuners.tuners_utils import BaseTuner | ||
| from peft.utils.osf_utils import ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you updated the module name, but the file itself is still svd_utils.
src/peft/utils/svd_utils.py
Outdated
| dV.copy_(local_dV) | ||
|
|
||
|
|
||
| def auto_generate_target_svd_config(model: nn.Module) -> dict[str, int]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems like you might have updated this name to auto_generate_target_osf_config but the change didn't make it into your PR.
| __all__ = ["OSFConfig", "OSFModel"] | ||
|
|
||
| register_peft_method( | ||
| name="osf", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You might also need to register your method as a new PEFT type in peft.utils.peft_types.PeftType, otherwise this won't work
src/peft/utils/svd_utils.py
Outdated
|
|
||
| def auto_generate_target_svd_config(model: nn.Module) -> dict[str, int]: | ||
| """Create a mapping from parameter names to ``top_k`` based on layer size.""" | ||
| target_patterns = [ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We will need to refactor this out similar to how we did it in the original PR: https://github.com/Red-Hat-AI-Innovation-Team/mini_trainer/pull/1/files#diff-09721c27e1a636c47222f5c7994cccbad3067007fe4c454e43a04e9bd3bd8b67R504
2d435a5 to
372a375
Compare
Makes it easier to track rate limiting issues.
- The warning message was missing spaces between sentences. - Added ' around strings for clarity - For one warning, which extended another warning, put it at the start instead of the end, because the other warning can be quite long, leading to users missing the addition For more context on this warning, see huggingface#2254
- default - mini - bat Results are pretty close to the corresponding experiments with Bone, which is what we expected.
…ce#2763) Explain how to use multiple adapters (e.g. 2 LoRA adapters) at the same time, as the API is not quite intuitive and there are some footguns around trainable parameters. This question has come up multiple times in the past (for recent examples, check huggingface#2749 and huggingface#2756). Thus it's a good idea to properly document this. --------- Co-authored-by: Steven Liu <[email protected]>
Resolves huggingface#2783. Most PEFT layers (BaseTunerLayers) expose the in_features and out_features attributes. Therefore, other packages like diffusers may expect this attribute to exist. However, there were a few PEFT methods where these attributes were missing: - LoHa - LoKr - LN Tuning - Trainable Tokens The layers of these methods now also expose the attributes. Implementation To avoid code duplication, I factored out the whole code block in LoRA layers that extracts these attributes, since LoRA has the most exhaustive list of checks. The new utility function has the exact same functionality and can now be used by other PEFT methods. I updated the four PEFT methods mentioned above to use this new function, but I did not update PEFT methods that already handled it, as there wasn't really a need (they check one or two layer types at most, so there is little duplication).
Right now, get_model_status() and get_layer_status() only report on BaseTunerLayers, but it would be helpful if they could also report auxiliary modules. This PR now includes those. To facilitate this, a few attributes and methods were added to AuxiliaryTrainingWrapper and subclasses to make them more similar to BaseTunerLayer (e.g. the adapter_layer_names attribute). These attributes and methods were assumed to be present in the code that determines the model and layer status.
Discussed internally
This PR adds the PEFT version to the adapter_config.json. This can be useful in the future -- for instance when we change the state dict format of a PEFT method, we can convert it in a backwards compatible way based on the PEFT version being used. It can also be useful for debugging by providing an easy way to see the PEFT version that was used to train a PEFT adapter. Notes: In huggingface#2038, we made a change to PEFT configs to make it so that even if new arguments are added to a config, it can still be loaded with older PEFT versions (forward compatibility). Before that change, adding the PEFT version would have been quite disruptive, as it would make all PEFT configs incompatible with older PEFT versions. Said PR was included in the 0.14.0 release from Dec 2024, so we can expect the vast majority of PEFT users to use this version or a more recent one. If the PEFT version is a dev version, the version tag is ambiguous. Therefore, I added some code to try to determine the commit hash. This works if users installed PEFT with git+...@<HASH>. Unit testing that the function to determine the hash works with these types of installs is not trivial. Therefore, I just patched the function to return a fixed hash. I did, however, test it locally and it works: python -m pip install git+https://github.com/huggingface/diffusers.git@5e181eddfe7e44c1444a2511b0d8e21d177850a0 python -c "from peft.config import _get_commit_hash; print(_get_commit_hash('diffusers'))" Also note that I tried to make the retrieval of the hash super robust by adding a broad try ... except. If there is an error there, e.g. due to a busted install path, we never want this to fail, but rather just accept that the hash cannot be determined (we add @unknown in this case). If users installed a dev version of PEFT in different way, e.g. using git clone && pip install ., the commit hash will not be detected. I think this is fine, I really don't want to start shelling out with git just for this purpose.
Resolves huggingface#2772 Fixes several edge cases with unusual layer names or target modules. 1. As huggingface#2772 stated, if "weight" is part of a layer name, it would be treated incorrectly when creating the PEFT state_dict. 2. Similarly, when the adapter name itself is part of a layer name. Some of these errors would pass silently, which is especially bad (e.g. a weight not being loaded but no error raised). I also added some tests that were not failing before, but to cover some yet uncovered cases or to lay out some basic functionality. While working on this, I also noticed that it was possible to target a BaseTunerLayer with modules_to_save and trainable_token_indices (e.g. the lora_A and lora_B nn.Linear would be replaced with ModulesToSaveWrapper). I don't think this is ever desired, so we now raise an error if this is detected.
Add `<Tip>`s converted to new syntax to docstrings. --------- Co-authored-by: nemo <[email protected]>
The reset_sessions function is removed but it's also no longer necessary to call it for the purpose we used it. Moreover, the deprecated use_auth_token argument is fully removed now, so everywhere we used to pass it, it is now removed, unless a user passes it explicitly. Also, remove the deprecated local_dir_use_symlinks argument.
Implements the paper "Exploring Sparsity for Parameter Efficient Fine Tuning Using Wavelets" (https://arxiv.org/abs/2505.12532). WaveFT enables fine-grained control over the number of trainable parameters by directly learning a sparse set of coefficients in the wavelet domain of residual matrices. Experiments show that it works well in the text-to-image generation space.
When using add_weighted_adapter, so far, there was an implicit assumption that all weights are positive. This PR allows negative weights to be passed. --------- Co-authored-by: Valentin Teutschbein <[email protected]>
A seed was accidentally chosen that results in a test failing with XPU. Signed-off-by: jiqing-feng <[email protected]>
While memory usage correlates with the number of trainable params, having this number directly makes it easier to see that methods are using similar numbers of trainable params and outliers can be inspected easily.
Check if PEFT triggers transformers FutureWarning or DeprecationWarning by converting these warnings into failures.
This PR adds the set_requires_grad method to PEFT models (both PeftModel and BaseTuner). As the name suggests, this is a method to set the requires_grad attribute of the specified PEFT adapters. For more general context, this is mostly relevant when dealing with multiple adapters. As is, users can already set the active adapter(s) with set_adapter, which automatically adjust the requires_grad attribute too, so that only the active adapters will have grads enabled. However, there can be situations where activity status and requires grad may differ. Right now, users would need to manually set requires_grad to deal with that, which is error prone (e.g. forgetting modules_to_save). This PR closes this gap in the API. As this functionality is quite general purpose, I added a set_requires_grad function to functional.py for easier integration. Note: The set_requires_grad method will raise an error when called with prompt learning methods like prompt tuning. This is because these methods don't have a universal base class (BaseTuner and BaseTunerLayer) that would allow to add this API. Moreover, they only support a single adapter at a time, hence there is not much need to have this method in the first place. A side effect of not supporting prompt learning is that on the PeftModel, we are free to allow set_requires_grad to accept more than one adapter, which would normally be difficult, because prompt learning only allows one adapter.
A new initialization method was added to prompt tuning in huggingface#2815. This PR adds an experiment config for this method to the MetaMathQA benchmark. Testing locally, this got a test accuracy of 36%, compared to 25% with random initialization.
Resolves huggingface#2809 Some models like Gemma3 apply a scalar to the embedding output. It needs to be taken into account when using trainable tokens or LoRA applied to the embedding layer.
This is to fix an oversight from huggingface#2797, where the LoftQ test was sligthly refactored but one test was not updated accordingly.
372a375 to
00073fe
Compare
Note: Diffusers is left as is for now, might need an update later.
The "LoRA Without Regret" blog post (https://thinkingmachines.ai/blog/lora/) mentions that targeting the MLP part of the transformer is more effective than targeting the attention modules. This experiment tests this by targeting: ["gate_proj", "up_proj", "down_proj"] instead of the default layers (["q_proj", "v_proj"]). I chose a rank to match the parameter count we would get when targeting the attention modules with rank 32, which is rank 10. Testing on my machine, there is indeed a nice improvement in the test score: | metric | target attention | target MLP | |----------------------|------------------|------------| | test accuracy | 48.2% | 51.3% | | # trainable params | 9175040 | 9461760 | | peak memory reserved | 20.74 GB | 23.02 GB | There is, however, also a marked increase in memory usage, despite matching parameter count. Since the operations are different, this may not be a surprise, but let's wait for the final verdict once this experiment runs on our AWS instance. Note: I also tested higher and lower ranks when targeting the MLP. The effect on memory usage was negligible, but it did improve the score: | metric | rank 8 | rank 10 | rank 12 | rank 32 | |--------------------|---------|---------|----------|----------| | test accuracy | 50.3% | 51.3% | 52.2% | 54.8% | | # trainable params | 7569408 | 9461760 | 11354112 | 30277632 | In the end, I chose only to add the rank 10 experiment to match the number of trainable parameters.
Implements DeLoRA: "Decoupling Angles and Strength in Low-rank Adaptation" (https://huggingface.co/papers/2503.18225). Similar to DoRA, DeLoRA decouples the angular learning from the adaptation strength, but it also allows to limit the norm of the change. This way, DeLoRA promises to reduce the risk of catastrophic forgetting and to be more robust to hyper-parameter settings such as the learning rate.
Adds an option to the LoRA config, ensure_weight_tying, which, if enabled, ensures that if the embedding and LM head are tied, they share the ModulesToSaveWrapper. This ensures that their weights work correctly even after merging them.
rebasing to make use of simplified basetuner implementation and adding more experiment results fixing style, quality, etc in the code Make style fixing CI and other test cases
89c3113 to
2418375
Compare
Summary
svd_utilsimplementing SVD-based orthogonal subspace learning utilitiesTesting
make qualitymake style