Skip to content
Merged
Show file tree
Hide file tree
Changes from 5 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
131 changes: 131 additions & 0 deletions docs/source/en/trainer.md
Original file line number Diff line number Diff line change
Expand Up @@ -443,6 +443,137 @@ trainer.train()

Note layerwise optimization is a bit experimental and does not support DDP (Distributed Data Parallel), thus you can run the training script only on a single GPU. Please see [this appropriate section](https://github.com/jiaweizzhao/GaLore?tab=readme-ov-file#train-7b-model-with-a-single-gpu-with-24gb-memory) for more details. Other features such as gradient clipping, DeepSpeed, etc might not be supported out of the box. Please [raise an issue on GitHub](https://github.com/huggingface/transformers/issues) if you encounter such issue.


### APOLLO

Approximated Gradient Scaling for Memory Efficient LLM Optimization (APOLLO) is a memory-efficient low-rank training strategy that allows full-parameter learning for both pre-training and fine-tuning, while maintaining AdamW-level performance with SGD-like memory efficiency.

* **Ultra-low rank efficiency** → Requires much lower rank than GaLore—even rank 1 (APOLLO-Mini) suffices.
* **No expensive SVD computations** → Unlike GaLore, APOLLO leverages random projection, avoiding training stalls.

First make sure to install APOLLO from its official repository:

```bash
pip install apollo-torch
```

Then simply add one of `["apollo_adamw"]` in `optim` together with `optim_target_modules`, which can be a list of strings, regex or full path corresponding to the target module names you want to adapt. Below is an end-to-end example script (make sure to `pip install trl datasets`):

```python
import torch
import datasets
import trl

from transformers import TrainingArguments, AutoConfig, AutoTokenizer, AutoModelForCausalLM

train_dataset = datasets.load_dataset('imdb', split='train')

args = TrainingArguments(
output_dir="./test-apollo",
max_steps=100,
per_device_train_batch_size=2,
optim="apollo_adamw",
optim_target_modules=[r".*.attn.*", r".*.mlp.*"]
)

model_id = "google/gemma-2b"

config = AutoConfig.from_pretrained(model_id)

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_config(config).to(0)

trainer = trl.SFTTrainer(
model=model,
args=args,
train_dataset=train_dataset,
dataset_text_field='text',
max_seq_length=512,
)

trainer.train()
```

To pass extra arguments supported by APOLLO, you should pass correctly `optim_args`, for example:

```python
import torch
import datasets
import trl

from transformers import TrainingArguments, AutoConfig, AutoTokenizer, AutoModelForCausalLM

train_dataset = datasets.load_dataset('imdb', split='train')

args = TrainingArguments(
output_dir="./test-galore",
max_steps=100,
per_device_train_batch_size=2,
optim="galore_adamw",
optim_target_modules=[r".*.attn.*", r".*.mlp.*"],
optim_args="proj=random,scale_type=tensor,rank=128,update_proj_gap=100,scale=1.0",

)

model_id = "google/gemma-2b"

config = AutoConfig.from_pretrained(model_id)

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_config(config).to(0)

trainer = trl.SFTTrainer(
model=model,
args=args,
train_dataset=train_dataset,
dataset_text_field='text',
max_seq_length=512,
)

trainer.train()
```

Currently only Linear layers are considered to use the APOLLO optimizers, while the remaining modueles are still using AdamW.

You can read more about the method in the [original repository](https://github.com/zhuhanqing/APOLLO) or the [paper](https://arxiv.org/abs/2412.05270).


You can also perform layer-wise APOLLO by simply post-pending the optimizer name with `layerwise` like below:

```python
import torch
import datasets
import trl

from transformers import TrainingArguments, AutoTokenizer, AutoModelForCausalLM

train_dataset = datasets.load_dataset('imdb', split='train')

args = TrainingArguments(
output_dir="./test-apollo",
max_steps=100,
per_device_train_batch_size=2,
optim="apollo_adamw_layerwise",
optim_target_modules=[r".*.attn.*", r".*.mlp.*"]
)

model_id = "google/gemma-2b"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, low_cpu_mem_usage=True).to(0)

trainer = trl.SFTTrainer(
model=model,
args=args,
train_dataset=train_dataset,
dataset_text_field='text',
max_seq_length=512,
)

trainer.train()
```


### LOMO optimizer

The LOMO optimizers have been introduced in [Full Parameter Fine-Tuning for Large Language Models with Limited Resources](https://hf.co/papers/2306.09782) and [AdaLomo: Low-memory Optimization with Adaptive Learning Rate](https://hf.co/papers/2310.10195).
Expand Down
9 changes: 9 additions & 0 deletions src/transformers/testing_utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -62,6 +62,7 @@
GGUF_MIN_VERSION,
is_accelerate_available,
is_apex_available,
is_apollo_torch_available,
is_aqlm_available,
is_auto_awq_available,
is_auto_gptq_available,
Expand Down Expand Up @@ -403,6 +404,14 @@ def require_galore_torch(test_case):
return unittest.skipUnless(is_galore_torch_available(), "test requires GaLore")(test_case)


def require_apollo_torch(test_case):
"""
Decorator marking a test that requires GaLore. These tests are skipped when APOLLO isn't installed.
https://github.com/zhuhanqing/APOLLO
"""
return unittest.skipUnless(is_apollo_torch_available(), "test requires APOLLO")(test_case)


def require_lomo(test_case):
"""
Decorator marking a test that requires LOMO. These tests are skipped when LOMO-optim isn't installed.
Expand Down
Loading