Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 0 additions & 4 deletions docs/source/_toctree.yml
Original file line number Diff line number Diff line change
Expand Up @@ -108,8 +108,6 @@
title: P-tuning
- local: package_reference/prefix_tuning
title: Prefix tuning
- local: package_reference/cartridges
title: Cartridges
- local: package_reference/prompt_tuning
title: Prompt tuning
- local: package_reference/layernorm_tuning
Expand Down Expand Up @@ -155,7 +153,5 @@
title: Hotswapping adapters
- local: package_reference/functional
title: Functions for PEFT integration
- local: package_reference/lora_conversion
title: Converting non-LoRA adapters to LoRA
title: Utilities
title: API reference
2 changes: 0 additions & 2 deletions docs/source/package_reference/hotswap.md
Original file line number Diff line number Diff line change
Expand Up @@ -69,8 +69,6 @@ Hotswapping works with transformers models and diffusers models. However, there
- It only works for the same PEFT method, so no swapping LoRA and LoHa, for example.
- The adapter that is being swapped in must target the same layers as the previous adapter or a subset of those layers. It cannot target new layers. Therefore, if possible, start with the adapter that targets most layers.

## API

[[autodoc]] utils.hotswap.hotswap_adapter
- all

Expand Down
31 changes: 1 addition & 30 deletions docs/source/package_reference/prefix_tuning.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,35 +18,6 @@ rendered properly in your Markdown viewer.

[Prefix tuning](https://hf.co/papers/2101.00190) prefixes a series of task-specific vectors to the input sequence that can be learned while keeping the pretrained model frozen. The prefix parameters are inserted in all of the model layers.

## Initialize from a KV cache prefix

By default, prefix tuning is randomly initialized.

PEFT also provides utilities to initialize a prefix-tuning adapter from an existing KV cache prefix (for example, from
the first `p` tokens of a prompt/corpus). This is only supported when `prefix_projection=False` (the default), because
in that case the learned parameters are the KV prefix itself.

```py
from transformers import AutoModelForCausalLM, AutoTokenizer

from peft import PrefixTuningConfig, get_peft_model, initialize_kv_prefix_from_text

base = AutoModelForCausalLM.from_pretrained("gpt2")
tok = AutoTokenizer.from_pretrained("gpt2")

peft_cfg = PrefixTuningConfig(task_type="CAUSAL_LM", num_virtual_tokens=20, prefix_projection=False)
model = get_peft_model(base, peft_cfg)

initialize_kv_prefix_from_text(
model,
tok,
text="...a long context with at least num_virtual_tokens tokens...",
use_chat_template=False,
)
```

Make sure the text is long enough to produce at least `num_virtual_tokens` tokens, otherwise initialization will fail.

The abstract from the paper is:

*Fine-tuning is the de facto way to leverage large pretrained language models to perform downstream tasks. However, it modifies all the language model parameters and therefore necessitates storing a full copy for each task. In this paper, we propose prefix-tuning, a lightweight alternative to fine-tuning for natural language generation tasks, which keeps language model parameters frozen, but optimizes a small continuous task-specific vector (called the prefix). Prefix-tuning draws inspiration from prompting, allowing subsequent tokens to attend to this prefix as if it were "virtual tokens". We apply prefix-tuning to GPT-2 for table-to-text generation and to BART for summarization. We find that by learning only 0.1\% of the parameters, prefix-tuning obtains comparable performance in the full data setting, outperforms fine-tuning in low-data settings, and extrapolates better to examples with topics unseen during training*.
Expand All @@ -57,4 +28,4 @@ The abstract from the paper is:

## PrefixEncoder

[[autodoc]] tuners.prefix_tuning.model.PrefixEncoder
[[autodoc]] tuners.prefix_tuning.model.PrefixEncoder
187 changes: 187 additions & 0 deletions examples/adamss_finetuning/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,187 @@
# AdaMSS Fine-tuning

## Introduction

AdaMSS (Adaptive Matrix Decomposition with Subspace Selection) is a parameter-efficient fine-tuning method that decomposes weight matrices using SVD into low-rank subspaces. It uses only **~0.07%** of original trainable parameters (e.g., 59K for ViT-Base vs 86M full fine-tuning) while maintaining competitive performance.

The method optionally supports **ASA** (Adaptive Subspace Allocation) for dynamic subspace selection during training, further improving efficiency and performance.

See the [paper](https://neurips.cc/virtual/2025/poster/119606) for more details.


## Installation & Quick Test

Install from local source:
```bash
cd peft-main && pip install -e .
pip install transformers datasets torch torchvision evaluate accelerate
```

Verify installation:
```bash
python -c "from peft import AdaMSSConfig, ASACallback; print('AdaMSS ready')"
```

## Detailed Code Explanation

**Core AdaMSS Configuration:**
```python
from peft import AdaMSSConfig, get_peft_model, ASACallback

# Configure AdaMSS with ASA
config = AdaMSSConfig(
r=100, # SVD rank (full decomposition rank)
num_subspaces=10, # Number of subspaces (K) - initial capacity
subspace_rank=3, # Rank per subspace (ri) - use 1 for NLU, 3 for Vision
target_modules=["query", "value"], # Target attention layers
use_asa=True, # Enable Adaptive Subspace Allocation
target_kk=5, # Target active subspaces (ASA reduces K→5)
modules_to_save=["classifier"], # Modules to train without decomposition
)
peft_model = get_peft_model(model, config)
```

**ASA Callback Setup:**
```python
asa_callback = ASACallback(
target_kk=5, # Gradually mask to 5 active subspaces
init_warmup=50, # Start ASA after 50 steps (Vision) or 5 epochs (NLU)
final_warmup=1000, # Complete masking by step 1000 (Vision) or epoch 95 (NLU)
mask_interval=100, # Update mask every 100 steps (Vision) or 10 epochs (NLU)
verbose=True, # Print ASA progress
)

# Integrate with Trainer
trainer = Trainer(
model=peft_model,
callbacks=[asa_callback], # Add ASA callback
# ... other arguments
)
```

**Key Points:**
- **Parameterization**: Total params = `r × (d_in + d_out)`, split into K subspaces of rank `ri` each
- **ASA Mechanism**: Dynamically selects `target_kk` most important subspaces from initial `num_subspaces`
- **Warmup Schedule**: ASA gradually increases masking strength from `init_warmup` to `final_warmup`
- **Vision vs NLU**: Use `subspace_rank=3` for vision, `subspace_rank=1` for NLU tasks

## Use the training example scripts

### Vision Tasks (Image Classification)

Run the provided script with your configuration:
```bash
python examples/adamss_finetuning/image_classification_adamss_asa.py \
--model_name_or_path google/vit-base-patch16-224-in21k \
--dataset_name cifar10 \
--adamss_r 100 \
--adamss_k 10 \
--adamss_ri 3 \
--use_asa \
--target_kk 5 \
--output_dir ./output
```

### NLU Tasks (GLUE Benchmark)

Run GLUE tasks (e.g., CoLA) with ASA:
```bash
python examples/adamss_finetuning/glue_adamss_asa_example.py \
--dataset_name cola \
--adamss_r 100 \
--adamss_k 10 \
--adamss_ri 1 \
--use_asa \
--target_kk 5 \
--num_epochs 100 \
--batch_size 32 \
--output_dir ./output_cola_asa
```

Without ASA (fixed K=10):
```bash
python examples/adamss_finetuning/glue_adamss_asa_example.py \
--dataset_name cola \
--adamss_r 100 \
--adamss_k 10 \
--adamss_ri 1 \
--num_epochs 100 \
--batch_size 32 \
--output_dir ./output_cola_no_asa
```

### AdaMSSConfig Parameters

| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| `r` | int | 100 | SVD decomposition rank |
| `num_subspaces` | int | 10 | Number of subspaces (K) |
| `subspace_rank` | int | 3 | Rank per subspace (ri) |
| `target_modules` | list | - | Modules to apply AdaMSS (e.g., ["query", "value"]) |
| `use_asa` | bool | False | Enable Adaptive Subspace Allocation |
| `target_kk` | int | None | Target active subspaces when ASA enabled |
| `modules_to_save` | list | None | Modules to train without decomposition |

### ASACallback Parameters

| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| `target_kk` | int | - | Target number of active subspaces |
| `init_warmup` | int | 50 | Steps before starting masking |
| `final_warmup` | int | 1000 | Steps to reach target active subspaces |
| `mask_interval` | int | 100 | Steps between subspace selection updates |
| `beta1` | float | 0.85 | EMA decay for importance tracking |
| `beta2` | float | 0.85 | EMA decay for uncertainty tracking |


## Experimental Results

### NLU Tasks (GLUE Benchmark)

Results with AdaMSS + ASA (100 epochs, seed=0):

| Task | Model | AdaMSS Params | Metric | Score |
|------|-------|---------------|--------|-------|
| CoLA | RoBERTa-base | 27.0K (ASA K→5) | Matthews | **0.6466** |
| CoLA | RoBERTa-large | 64.8K (ASA K→5) | Matthews | **0.7093** |
| MRPC | RoBERTa-base | 27.2K (ASA K→5) | Accuracy | **0.8824** |
| MRPC | RoBERTa-large | 66.7K (ASA K→5) | Accuracy | **0.9044** |

**Notes:**
- Configuration: r=100, K=10→5 (ASA), ri=1
- AdaMSS active params with ASA (5 out of 10 subspaces selected)
- Full AdaMSS capacity: 97K (large) / 42K (base)
- Training: 100 epochs, batch_size=32, warmup_ratio=0.06

### Vision Tasks (Image Classification)

Results with AdaMSS on Stanford Cars (10 epochs, seed=0):

| Model | Method | AdaMSS Params | Test Accuracy |
|-------|--------|---------------|---------------|
| ViT-Base | AdaMSS (no ASA) | 121K (K=10) | **82.15%** |
| ViT-Base | AdaMSS + ASA | 75.0K (K→5) | **80.45%** |

**Notes:**
- Configuration: r=100, K=10, ri=3, 10 epochs, batch_size=32
- ASA dynamically selects 5 out of 10 subspaces (75K active from 121K total)



## Citation

If you use AdaMSS in your research, please cite:

```bibtex
@inproceedings{zheng2025adamss,
title={AdaMSS: Adaptive Multi-Subspace Approach for Parameter-Efficient Fine-Tuning},
author={Zheng, Jingjing and Lu, Wanglong and Dong, Yiming and Ji, Chaojie and Cao, Yankai and Lin, Zhouchen},
booktitle={The Thirty-ninth Annual Conference on Neural Information Processing Systems},
year={2025},
}
```

## Reference

- [AdaMSS Paper](https://neurips.cc/virtual/2025/loc/san-diego/poster/119606)
- [PEFT Documentation](https://huggingface.co/docs/peft)
Loading