Add Orthogonal Subspace Fine-Tuning (OSF) Tuner for Parameter-Efficient Continual Learning #2685

NikhilNayak-debug · 2025-07-31T13:47:54Z

Summary

This PR adds a new parameter-efficient fine-tuning method called Orthogonal Subspace Fine-Tuning (OSF) to the PEFT library. OSF enables continual learning in LLMs by freezing the high-rank subspace of weight matrices and fine-tuning only the low-rank directions. This approach constrains updates to be orthogonal to previously important directions, thereby mitigating catastrophic forgetting without increasing parameter count.

Issue for this PR on PEFT repository

Tracked in PEFT Issue #2648

Key Features

Implements a new OSFConfig, OSFModel, and tuner class under src/peft/tuners/osf/ following PEFT's standard API

Integrates seamlessly with the get_peft_model API:

from peft import OSFConfig, get_peft_model
peft_model = get_peft_model(base_model, OSFConfig(target_modules=[...]))

Adds utility functions for:
- Weight matrix decomposition using SVD
- Gradient projection onto the low-rank subspace via backward gradient hooks
Automatically enforces orthogonality constraints during training without requiring optimizer wrapping
Will include tests for saving, loading, and applying the OSF adapter in tests/test_custom_models.py
Exports relevant modules at the package level for easier use with other PEFT components

Notes

The current implementation does not include layerwise importance-based rank estimation (e.g., cosine similarity of inputs and activations), but can be added in future iterations
Merging/unmerging is not supported, as the original weights are decomposed and modified in-place
Compared to LoRA, OSF performs a constrained update over the original weight matrix without introducing new trainable parameters, maintaining exact model architecture post-training

Background

This implementation is based on the method described in our paper:
Sculpting Subspaces: Constrained Full Fine-Tuning in LLMs for Continual Learning
Paper on arXiv · Project Repository

githubnemo

Nice! Thanks for the thorough update, that's a good step forward.
A minor nit: Several files are missing the copyright notice, please make sure to include them in new source files (also make sure that they are not outdated, i.e. include the current year).

I like that you already implemented several (custom) tests, I think that's super helpful. Let's also add some tests to test_decoder_models.py and test_encoder_decoder_models.py similar to the test in test_custom_models.py when you think the implementation can move forward in testing. Let's move the skips for convolutions to testing_common.py, there are already similar exceptions in place.
Two bigger topics:

ModelWithOSF seems to re-invent PEFT functionality inside PEFT, specifically the layer targeting + replacement portion. Let's streamline OSF with other tuners, i.e. have implementations for specific layers and by implementing inject_adapter, _create_new_module and _create_and_replace to make it easier to branch out to other layer types / quantizations. The LoRA implementation maybe helpful, e.g. peft.tuners.lora.layers.LoraLayer contains specific layers for Linear and Conv*d specifics (no need to implement Conv now, of course). I can see that this conflicts with using a dict for specifying the top-k ranks per module. How about using target_modules and a singular value for the topk rank (e.g., config.topk_r) which can default to None (-> uses 50% of min(shape)). Every targeted module gets that topk rank or an automatic 50% one. We could also add something like rank_pattern from LoRA to define exceptions (see lora.model.py -> _create_and_replace). WDYT?
Example config:

OSFConfig(
  target_modules='all-linear',
  topk_r=None,
  rank_pattern={
    'q_proj': 10,
  }
)

It's not possible to use more than one adapter of OSF since the base model is modified and we therefore cannot switch between adapters (could be handy in pipeline scenarios where one model is used at several places with different adapters, for example). I left a comment at decompose_weight_matrix to discuss this.

Once we're done with the general implementation I think it'd be super if we could add an experiment to the MetaMathQA comparison suite so that we can compare OSF directly to other implementations.

examples/orthogonal_subspace_learning/README.md

src/peft/__init__.py

src/peft/utils/osf_utils.py

NikhilNayak-debug · 2025-08-12T20:01:48Z

Once we're done with the general implementation I think it'd be super if we could add an experiment to the MetaMathQA comparison suite so that we can compare OSF directly to other implementations.

Awesome will definitely evaluate our method once the implementation is complete to benchmark OSF against other methods in PEFT.

NikhilNayak-debug · 2025-08-14T16:04:32Z

ModelWithOSF seems to re-invent PEFT functionality inside PEFT, specifically the layer targeting + replacement portion. Let's streamline OSF with other tuners, i.e. have implementations for specific layers and by implementing inject_adapter, _create_new_module and _create_and_replace to make it easier to branch out to other layer types / quantizations. The LoRA implementation maybe helpful, e.g. peft.tuners.lora.layers.LoraLayer contains specific layers for Linear and Conv*d specifics (no need to implement Conv now, of course). I can see that this conflicts with using a dict for specifying the top-k ranks per module. How about using target_modules and a singular value for the topk rank (e.g., config.topk_r) which can default to None (-> uses 50% of min(shape)). Every targeted module gets that topk rank or an automatic 50% one. We could also add something like rank_pattern from LoRA to define exceptions (see lora.model.py -> _create_and_replace). WDYT?
Example config:
OSFConfig(
  target_modules='all-linear',
  topk_r=None,
  rank_pattern={
    'q_proj': 10,
  }
)

@githubnemo great suggestion in response to the first bigger topic raised I have implemented the minimal PEFT integration changes:

What we implemented:

✅ OSF layer classes (OSFLayer, Linear) similar to LoRA's structure
✅ _create_and_replace method for proper layer replacement following PEFT patterns
✅ Updated config to use target_modules and effective_rank (renamed from topk_r)
✅ Added rank_pattern support for per-module rank exceptions, just like LoRA

Scope decisions we made:

Only implemented _create_and_replace (not inject_adapter or _create_new_module) since OSF's use case only requires layer replacement as of now
Kept existing functionality intact - all SVD decomposition, gradient projection, and hook management preserved as is

Key files changed:

src/peft/tuners/osf/layer.py - New OSF layer classes
src/peft/tuners/osf/model.py - Added _create_and_replace method
src/peft/tuners/osf/config.py - Updated config format
src/peft/utils/constants.py - Added TRANSFORMERS_MODELS_TO_OSF_TARGET_MODULES_MAPPING

These changes integrate the OSF method modularly into PEFT.

githubnemo

Thanks for the detailed feedback and your changes.

I think that the re-structuring of OSFModel is almost complete and most of the comments are rather minor. As far as I can see the adhoc ModelWithOSF is replaced by OSFModel and OSFLayer and can be removed - good progress!

I think this is a good time remove outdated code, to merge with main, run make style and run the tests to see if there's still something going horribly wrong.

Let's discuss whether we want to implement the importance score now or leave it up for implementation later. If I'm not mistaken I think that the importance score can technically be added later since it would compute the effective rank of layers based on two new hyper-parameters, so in that sense it is modular. Since it is quite a crucial part of the paper and is touted to improve multi-task learning (arguably one of the big selling points of OSF) I wonder if it should be included from the get-go. What's your opinion on that?

Regardless, I think we can a MetaMathQA experiment rather soon and check if there are major problems with memory consumption or runtime.

docs/source/package_reference/osf.md

githubnemo · 2025-08-22T09:55:01Z

examples/orthogonal_subspace_learning/README.md

+- Complete continual learning scenario with multiple tasks
+- Demonstration of OSF's catastrophic forgetting prevention
+- Configuration examples (target_modules, effective_rank, rank_pattern)
+- Performance comparison with baseline methods


I think the performance comparison with baseline methods - at least for single tasks - is best done in the PEFT method comparison (MetaMathQA). Of course, feel free to provide a comparison with methods for support multi-task learning if it fits into the example without too much effort.

src/peft/tuners/osf/layer.py

src/peft/tuners/osf/model.py

src/peft/tuners/osf/layer.py

docs/source/package_reference/osf.md

githubnemo · 2025-08-25T12:31:55Z

src/peft/utils/osf_utils.py

+    svd = {
+        "U_high": U[:, :k].contiguous().detach().to(device=device_local, dtype=orig_dtype),
+        "S_high": S[:k].contiguous().detach().to(device=device_local, dtype=orig_dtype),
+        "V_high": Vt[:k, :].contiguous().detach().to(device=device_local, dtype=orig_dtype),
+        "U_low": nn.Parameter(U[:, k:].contiguous().detach().to(device=device_local, dtype=orig_dtype)),
+        "S_low": nn.Parameter(S[k:].contiguous().detach().to(device=device_local, dtype=orig_dtype)),
+        "V_low": nn.Parameter(Vt[k:, :].contiguous().detach().to(device=device_local, dtype=orig_dtype)),
+        "rank_high": k,
+    }


Thank you for the detailed explanation!

The sequential dependency of later added adapters to previous adapters removes a lot of the convenience gained by being able to remove individual adapters, I agree.

I'm OK with not implementing this.

NikhilNayak-debug · 2025-09-16T17:38:16Z

@githubnemo added MetaMathQA experiment results. OSF achieves the highest accuracy at 55.72% among all PEFT methods in the benchmark! 😊

Top results for comparison:

osf--llama-3.2-3B-default: 0.5572
lora--llama-3.2-3B-rank64-rslora: 0.5299
bone--llama-3.2-3B-bat: 0.5171
bone--llama-3.2-3B-default: 0.5080
randlora--llama-3.2-3B-default: 0.5072

Memory consumption and runtime look okay thus far as well.

githubnemo · 2025-09-22T13:40:23Z

@NikhilNayak-debug very nice results! :)

Is this ready for review from your side? If so, could you merge main and resolve the merge conflicts? This saves one review cycle.

NikhilNayak-debug · 2025-09-23T23:09:04Z

@githubnemo thank you. I have rebased the branch on top of the latest upstream main and resolved the conflicts. This is ready for review now.

githubnemo

Sorry for the late reply, I was at a conference.

The changes look very good! There was quite a large PR merged in the mean time that refactored a good portion of the BaseTuner infrastructure (#2771) which means that you need a lot less code now - I hope I highlighted all occurrences.

I'm currently in the process of reproducing the MetaMathQA results you posted. One thing I noticed is that there are more layers targeted and the default effective rank (min(shape) // 2) is used which is using way more parameters than other methods. While it is certainly good to see that OSF is better than full fine-tuning it would be a fairer comparison to match the trainable parameter counts of the other methods.

docs/source/package_reference/osf.md

src/peft/tuners/osf/model.py

tests/testing_common.py

src/peft/tuners/osf/model.py

src/peft/tuners/osf/utils.py

src/peft/tuners/osf/model.py

src/peft/tuners/osf/layer.py

NikhilNayak-debug · 2025-10-16T03:13:12Z

Sorry for the late reply, I was at a conference.

The changes look very good! There was quite a large PR merged in the mean time that refactored a good portion of the BaseTuner infrastructure (#2771) which means that you need a lot less code now - I hope I highlighted all occurrences.

I'm currently in the process of reproducing the MetaMathQA results you posted. One thing I noticed is that there are more layers targeted and the default effective rank (min(shape) // 2) is used which is using way more parameters than other methods. While it is certainly good to see that OSF is better than full fine-tuning it would be a fairer comparison to match the trainable parameter counts of the other methods.

@githubnemo no worries thank you so much for the comments. I have rebased and simplified the code as you suggested. I also reran the experiments using rank 128 applied to the MLP and attention matrices, the results are available in method_comparison/MetaMathQA/results/osf--llama-3.2-3B-rank128.json.

We do need a higher rank with OSF because the most important directions are intentionally fixed to reduce forgetting. Since learning occurs in the lower-singular-value subspace, a higher trainable rank is required to capture meaningful updates. The upside is that this approach retains performance on previous tasks, unlike LoRA for example which updates those high importance directions and tends to cause more forgetting.

githubnemo · 2025-10-16T14:55:11Z

Thanks for the updated experiments and the fixes!

I think this can be merged after these few nits are fixed. It would be good to have a runnable example but if you want we can merge this first and do the example in a separate PR, whichever suits you best. It could also be a chance to showcase the continual learning benefits of OSF to contrast the weaker parameter efficiency.

Before merging, let's remove the MetaMathQA results since we're running them on our runner after merging anyway.
Let's also remove the big MetaMathQA experiment since it won't run on our runner and is not comparable to the other experiments and would, therefore, rot far away from others as a lone outlier.

Don't forget to run make style so that we can run the CI and check if there are any open issues from there.

NikhilNayak-debug · 2025-10-16T18:51:15Z

Thanks for the updated experiments and the fixes!

I think this can be merged after these few nits are fixed. It would be good to have a runnable example but if you want we can merge this first and do the example in a separate PR, whichever suits you best. It could also be a chance to showcase the continual learning benefits of OSF to contrast the weaker parameter efficiency.

Before merging, let's remove the MetaMathQA results since we're running them on our runner after merging anyway. Let's also remove the big MetaMathQA experiment since it won't run on our runner and is not comparable to the other experiments and would, therefore, rot far away from others as a lone outlier.

Don't forget to run make style so that we can run the CI and check if there are any open issues from there.

@githubnemo thanks! I will add a small runnable OSF continual-learning example in a follow‑up PR (sequential tasks, increasing effective_rank using model.unload(), and with a Full Finetuning baseline for contrast). For this PR: I removed the MetaMathQA results and the large MetaMathQA experiment as requested, keeping only the rank 128 experiment.

NikhilNayak-debug · 2025-10-16T19:04:43Z

@githubnemo this is ready for review again. Let me know if there are any other changes that need to be made!

githubnemo

Nice :) Only a few comments left.

Please check the CI as there seem to be some errors left.

docs/source/_toctree.yml

src/peft/tuners/osf/config.py

rebasing to make use of simplified basetuner implementation and adding more experiment results fixing style, quality, etc in the code Make style fixing CI and other test cases

NikhilNayak-debug · 2025-10-20T21:05:51Z

@githubnemo this is ready for review. There are 36 test cases failing, but they are also failing in the upstream main branch so they are not related to the OSF changes. All CI related issues should be addressed now.

HuggingFaceDocBuilderDev · 2025-10-21T14:28:32Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

NikhilNayak-debug · 2025-10-21T16:09:50Z

@githubnemo could you start the workflow again, fixed a couple of remaining test case errors!

githubnemo · 2025-10-22T09:55:35Z

Sure thing. You can also run this using pytest tests, you can also select for OSF using pytest -k osf tests

NikhilNayak-debug · 2025-10-22T15:02:30Z

Sure thing. You can also run this using pytest tests, you can also select for OSF using pytest -k osf tests

@githubnemo okay yeah OSF related tests are passing locally and I don't see any remaining failing tests in the CI. The one failing check seems to be unrelated. How do we proceed from here?

githubnemo

Thank you! This looks good to me :)

Let's merge this and work on the example in a new PR. I think this is a nice addition, thanks again for your continued efforts!

NikhilNayak-debug mentioned this pull request Jul 31, 2025

Feature Request: Add Adaptive Singular Value Decomposition based Orthogonal Subspace Fine-Tuning #2648

Open

githubnemo reviewed Aug 7, 2025

View reviewed changes

githubnemo reviewed Aug 25, 2025

View reviewed changes

NikhilNayak-debug force-pushed the orthogonal-subspace-learning branch from 2d435a5 to 372a375 Compare September 23, 2025 23:02

githubnemo reviewed Oct 7, 2025

View reviewed changes

NikhilNayak-debug force-pushed the orthogonal-subspace-learning branch from 372a375 to 00073fe Compare October 15, 2025 15:21

githubnemo reviewed Oct 17, 2025

View reviewed changes

docs/source/_toctree.yml Outdated Show resolved Hide resolved

src/peft/tuners/osf/config.py Outdated Show resolved Hide resolved

Orthogonal Subspace Learning: changes for the OSF method

2418375

rebasing to make use of simplified basetuner implementation and adding more experiment results fixing style, quality, etc in the code Make style fixing CI and other test cases

NikhilNayak-debug force-pushed the orthogonal-subspace-learning branch from 89c3113 to 2418375 Compare October 20, 2025 20:58

fixing couple of test errors

fabbf33

githubnemo approved these changes Oct 23, 2025

View reviewed changes

githubnemo merged commit 546927d into huggingface:main Oct 23, 2025
12 of 13 checks passed

Add Orthogonal Subspace Fine-Tuning (OSF) Tuner for Parameter-Efficient Continual Learning #2685

Add Orthogonal Subspace Fine-Tuning (OSF) Tuner for Parameter-Efficient Continual Learning #2685

Conversation

NikhilNayak-debug commented Jul 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

githubnemo left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

NikhilNayak-debug commented Aug 12, 2025

Uh oh!

NikhilNayak-debug commented Aug 14, 2025

Uh oh!

githubnemo left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

githubnemo Aug 22, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

githubnemo Aug 25, 2025

Choose a reason for hiding this comment

Uh oh!

NikhilNayak-debug commented Sep 16, 2025

Uh oh!

githubnemo commented Sep 22, 2025

Uh oh!

NikhilNayak-debug commented Sep 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

githubnemo left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

NikhilNayak-debug commented Oct 16, 2025

Uh oh!

githubnemo commented Oct 16, 2025

Uh oh!

NikhilNayak-debug commented Oct 16, 2025

Uh oh!

NikhilNayak-debug commented Oct 16, 2025

Uh oh!

githubnemo left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

NikhilNayak-debug commented Oct 20, 2025

Uh oh!

HuggingFaceDocBuilderDev commented Oct 21, 2025

Uh oh!

NikhilNayak-debug commented Oct 21, 2025

Uh oh!

githubnemo commented Oct 22, 2025

NikhilNayak-debug commented Jul 31, 2025 •

edited

Loading

NikhilNayak-debug commented Sep 23, 2025 •

edited

Loading