Skip to content

Conversation

@PaulAlbert31
Copy link
Contributor

Follow up to issue 2441: #2441
This is an implementation of RandLoRA into peft, including some support for quantized training.

Some work is left to do on proper BnB configurations, the current implementation still works on --load_4bit configurations. Would need some guidance on this.

The 4bit and 8bit layer classes are however implemented.

Similarly to Vera, RandLoRA uses random bases which are shared between layers. These bases are saved into BufferDict objects but I had trouble using safetensors to save them. This may be due to a deprecated transformers version (4.46.3) during testing.

Happy to iterate as per feedback.

@PaulAlbert31 PaulAlbert31 marked this pull request as draft March 28, 2025 06:14
Copy link
Member

@BenjaminBossan BenjaminBossan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR. I've only skimmed it so far but it looks already quite good, well done (good job using the new method registration).

I think the best next step would be to add RandLoRA to the unit testing harness. This will allow us to easily see what already works and what doesn't. For this, check for instance how VeRA was added here:

########
# VeRA #
########
("Vanilla MLP 1 VeRA", "MLP", VeraConfig, {"target_modules": "lin0"}),
("Vanilla MLP 2 VeRA", "MLP", VeraConfig, {"target_modules": ["lin0"]}),
("Vanilla MLP 3 VeRA", "MLP", VeraConfig, {"target_modules": ["lin1"]}),
("Vanilla MLP 4 VeRA", "MLP", VeraConfig, {"target_modules": ["lin0", "lin1"]}),
(
"Vanilla MLP 5 VeRA",
"MLP",
VeraConfig,
{"target_modules": ["lin0"], "modules_to_save": ["lin1"]},
),
(
"Embedding + transformers Conv1D 1 VeRA",
"EmbConv1D",
VeraConfig,
{"target_modules": ["conv1d"]},
),

Let's add similar entries for RandLoRA. If we then find that some tests are failing, I'll guide you through the steps needed to make them pass.

Later, we'll probably also have to add some dedicated RandLoRA tests to ensure that the weight sharing works (see https://github.com/huggingface/peft/blob/main/tests/test_vera.py), maybe also something for the special grad handling. But it's fine to leave this for later.

Some work is left to do on proper BnB configurations, the current implementation still works on --load_4bit configurations. Would need some guidance on this.

I'm not quite sure what you mean by that, could you explain further? But we can also leave bnb integration for later, it might be better to get the normal RandLoRA fully working first.

but I had trouble using safetensors to save them. This may be due to a deprecated transformers version (4.46.3) during testing.

Could you give a small snippet to reproduce this?

Below, I'll list a few more steps required before merging the PR. This is just to get an overview and can be left for the later, for now let's focus on the points mentioned above.

@BenjaminBossan
Copy link
Member

@PaulAlbert31 Could you please run make style, otherwise the CI won't run the tests. Also, please always ping me once your changes are ready.

@PaulAlbert31
Copy link
Contributor Author

Hey @BenjaminBossan , apologies for the previous update I am still learning my way around these tools.
I have updated the draft after running make style and fixing errors.

Regarding the safetensor error I was mentioning earlier, this seems to go away after upgrading transformers so nothing to worry about there. This happens when saving the model after training so I guess this would be covered in the existing tests. Here is a small snippet I used to test.

import torch
from peft import ( 
    RandLoraConfig,
    get_peft_model
)
import transformers
from transformers import AutoModelForCausalLM, AutoTokenizer, TrainingArguments, Trainer, TextDataset, DataCollatorForLanguageModeling

config = RandLoraConfig(
            r=16,
            randlora_alpha=32,
            randlora_dropout=0.05,
            target_modules=["k_proj"],
            bias="none",
            task_type="CAUSAL_LM",
            projection_prng_key=int(torch.exp(torch.tensor(1))*3.1415*1000),
)

model = AutoModelForCausalLM.from_pretrained(
            "Qwen/Qwen2.5-0.5B",
            torch_dtype=torch.float16,
            trust_remote_code=True,
)

tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-0.5B", trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token # Set pad token if it's missing

model = get_peft_model(model, config)

dummy_text = "This is a dummy text for training.\nAnother line of dummy text.\nAnd a third line."
dummy_data_path = "dummy_data.txt" 
with open(dummy_data_path, "w") as f:
    f.write(dummy_text)

train_dataset = TextDataset(
    tokenizer=tokenizer,
    file_path=dummy_data_path,
    block_size=1
)

data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer,
    mlm=False
)

training_args = TrainingArguments(
    output_dir="./output_trainer",
    num_train_epochs=1,
    per_device_train_batch_size=1,
    optim="adamw_torch", 
    save_steps=1000,
    logging_steps=10, 
    learning_rate=1e-4,
    weight_decay=0.01,
    fp16=True, 
    bf16=False,
    max_steps=-1, 
    logging_dir="./logs", 
    dataloader_num_workers=0, 
    save_total_limit=1, 
    save_safetensors=True,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    data_collator=data_collator,
    tokenizer=tokenizer,
)

trainer.train()
model.save_pretrained("test_randlora")

Regarding quantized training, I meant that I use the load_4_bit argument to the AutoModelForCausalLM.from_pretrained to quantize the base model here in my code. My understanding is that a BnbQuantizationConfig is the updated way of doing it.

Since this config interacts with the base model and not the adapter, I guess this is not important for peft ?

Paul

Copy link
Member

@BenjaminBossan BenjaminBossan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the updates, the PR is making good progress.

A lot of the added tests are still failing. You can run them locally with:

pytest tests/test_custom_models.py -k randlora -v

The error most of the time is something like:

RuntimeError: The size of tensor a (20) must match the size of tensor b (10) at non-singleton dimension 1

I'm fairly certain that is due to how the slices are determined in get_scaled_bases but I'm not sure what the correct fix is. Could you please investigate?

Regarding the safetensor error I was mentioning earlier, this seems to go away after upgrading transformers so nothing to worry about there.

Nice. I tested your snippet and it passes for me too.

Regarding quantized training, I meant that I use the load_4_bit argument to the AutoModelForCausalLM.from_pretrained

Ah I see, no worries about that. In PEFT, we use the proper arguments. We will later just copy one of the existing PEFT bnb tests and make adjustments for RandLoRA. But as mentioned above, let's first get the non-quantized code fully running.

Comment on lines 139 to 144
# check input size
if randlora_A_param.shape[-1] < self.in_features:
raise ValueError(error_tmpl.format("randlora_A", randlora_A_param.shape[1], self.in_features))
# check output size
if randlora_B_param.shape[0] < self.out_features:
raise ValueError(error_tmpl.format("randlora_B", randlora_B_param.shape[0], self.out_features))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if the logic is correct here. Below, we have:

F.linear(F.linear(dropout(x), update_B), update_A), so it seems like B is applied first, so should be compared to in_features, and A is applied last, so should be compared to out_features. WDYT?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated_A will always be used to obtain the smallest dimension of the weight matrix to reduce trainable parameters (randlora_gamma is applied to the A random base and contains most of the trainable parameters).
The size selection for randlora_A and randlora_B was wrong so I corrected and added a comment here.

)
if randlora_A_param.shape[0] < self.r[adapter_name]:
raise ValueError(error_tmpl.format("randlora_A", randlora_A_param.shape[0], self.r[adapter_name]))
if randlora_B_param.shape[1] < self.r[adapter_name]:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should be:

Suggested change
if randlora_B_param.shape[1] < self.r[adapter_name]:
if randlora_B_param.shape[-1] < self.r[adapter_name]:

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes good catch, thanks

self.randlora_dropout.update(nn.ModuleDict({adapter_name: randlora_dropout_layer}))

# Actual trainable parameters
n = min(self.in_features, self.out_features) / r
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use a more descriptive name than n please?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed to n -> num_bases

requires_grad=True,
)

self.scaling[adapter_name] = randlora_alpha / r / math.sqrt(self.n) # * 10#/ math.sqrt(self.n)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about this comment?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry remains from an hyper-parameter study

if adapter_name in self.randlora_lambda.keys():
with torch.no_grad():
nn.init.zeros_(self.randlora_lambda[adapter_name])
nn.init.zeros_(self.randlora_gamma[adapter_name]).fill_(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the zeros_ required if it is followed by a fill_?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is what is done in the vera implementation. I agree this is not very efficient.
Do you have another suggestion ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I see. zeros_ + fill_ could be replaced by constant_, right?

Comment on lines 306 to 310

# cast back the weights
# TODO: why?, taken from the VeRA implementation
self.randlora_lambda[adapter].data = self.randlora_lambda[adapter].to(dtype)
self.randlora_gamma[adapter].data = self.randlora_gamma[adapter].to(dtype)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can indeed be removed.

@BenjaminBossan
Copy link
Member

We had some issues with the CI that should now be fixed, could you please merge with/rebase on the latest main?

@PaulAlbert31
Copy link
Contributor Author

Hey @BenjaminBossan, thanks a lot for all your comments and catching some errors.
I have now updated the code according to your comments and all tests are passing.
I have also added tests for the sparse and very sparse bases.

Let me know if more changes are needed and we can move to more specific tests or to documentation/example scripts

Copy link
Member

@BenjaminBossan BenjaminBossan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the updates. The unit tests are now passing, nice. Overall, this already looks quite good, I mostly have smaller comments. The main job is now to extend the testing and to ensure that those tests also pass. For this, please extend the test matrix here:

},
)
CLASSES_MAPPING = {
"ia3": (IA3Config, CONFIG_TESTING_KWARGS[0]),
"lora": (LoraConfig, CONFIG_TESTING_KWARGS[1]),
"prefix_tuning": (PrefixTuningConfig, CONFIG_TESTING_KWARGS[2]),
"prompt_encoder": (PromptEncoderConfig, CONFIG_TESTING_KWARGS[3]),
"prompt_tuning": (PromptTuningConfig, CONFIG_TESTING_KWARGS[4]),
"adalora": (AdaLoraConfig, CONFIG_TESTING_KWARGS[5]),
"boft": (BOFTConfig, CONFIG_TESTING_KWARGS[6]),
"vera": (VeraConfig, CONFIG_TESTING_KWARGS[7]),
"fourierft": (FourierFTConfig, CONFIG_TESTING_KWARGS[8]),
"hra": (HRAConfig, CONFIG_TESTING_KWARGS[9]),
"vblora": (VBLoRAConfig, CONFIG_TESTING_KWARGS[10]),
"oft": (OFTConfig, CONFIG_TESTING_KWARGS[11]),
"bone": (BoneConfig, CONFIG_TESTING_KWARGS[12]),
"lora+trainable_tokens": (LoraConfig, CONFIG_TESTING_KWARGS[13]),
}

Moreover, let's add some examples for GPU training with and without BNB, like we have for VeRA here:

@pytest.mark.single_gpu_tests
def test_causal_lm_training_vera(self):
r"""
Same as test_causal_lm_training but with VeRA
"""
with tempfile.TemporaryDirectory() as tmp_dir:
model = AutoModelForCausalLM.from_pretrained(
self.causal_lm_model_id,
quantization_config=BitsAndBytesConfig(load_in_8bit=True),
device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained(self.causal_lm_model_id)
model = prepare_model_for_kbit_training(model)
config = VeraConfig(
r=16,
target_modules=["q_proj", "v_proj"],
vera_dropout=0.05,
bias="none",
task_type="CAUSAL_LM",
)
model = get_peft_model(model, config)
data = load_dataset_english_quotes()
data = data.map(lambda samples: tokenizer(samples["quote"]), batched=True)
trainer = Trainer(
model=model,
train_dataset=data["train"],
args=TrainingArguments(
per_device_train_batch_size=4,
gradient_accumulation_steps=4,
warmup_steps=2,
max_steps=3,
learning_rate=2e-4,
fp16=True,
logging_steps=1,
output_dir=tmp_dir,
),
data_collator=DataCollatorForLanguageModeling(tokenizer, mlm=False),
)
model.config.use_cache = False
trainer.train()
model.cpu().save_pretrained(tmp_dir)
assert "adapter_config.json" in os.listdir(tmp_dir)
assert SAFETENSORS_WEIGHTS_NAME in os.listdir(tmp_dir)
# assert loss is not None
assert trainer.state.log_history[-1]["train_loss"] is not None
@pytest.mark.single_gpu_tests
def test_causal_lm_training_4bit_vera(self):
r"""
Same as test_causal_lm_training_4bit but with VeRA
"""
with tempfile.TemporaryDirectory() as tmp_dir:
model = AutoModelForCausalLM.from_pretrained(
self.causal_lm_model_id,
quantization_config=BitsAndBytesConfig(load_in_4bit=True),
device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained(self.causal_lm_model_id)
model = prepare_model_for_kbit_training(model)
config = VeraConfig(
r=16,
target_modules=["q_proj", "v_proj"],
vera_dropout=0.05,
bias="none",
task_type="CAUSAL_LM",
)
model = get_peft_model(model, config)
data = load_dataset_english_quotes()
data = data.map(lambda samples: tokenizer(samples["quote"]), batched=True)
trainer = Trainer(
model=model,
train_dataset=data["train"],
args=TrainingArguments(
per_device_train_batch_size=4,
gradient_accumulation_steps=4,
warmup_steps=2,
max_steps=3,
learning_rate=2e-4,
fp16=True,
logging_steps=1,
output_dir=tmp_dir,
),
data_collator=DataCollatorForLanguageModeling(tokenizer, mlm=False),
)
model.config.use_cache = False
trainer.train()
model.cpu().save_pretrained(tmp_dir)
assert "adapter_config.json" in os.listdir(tmp_dir)
assert SAFETENSORS_WEIGHTS_NAME in os.listdir(tmp_dir)
# assert loss is not None
assert trainer.state.log_history[-1]["train_loss"] is not None
@pytest.mark.multi_gpu_tests
def test_causal_lm_training_multi_gpu_vera(self):
r"""
Same as test_causal_lm_training_multi_gpu but with VeRA
"""
with tempfile.TemporaryDirectory() as tmp_dir:
model = AutoModelForCausalLM.from_pretrained(
self.causal_lm_model_id,
device_map="auto",
quantization_config=BitsAndBytesConfig(load_in_8bit=True),
)
assert set(model.hf_device_map.values()) == set(range(device_count))
model = prepare_model_for_kbit_training(model)
setattr(model, "model_parallel", True)
setattr(model, "is_parallelizable", True)
config = VeraConfig(
r=16,
target_modules=["q_proj", "v_proj"],
vera_dropout=0.05,
bias="none",
task_type="CAUSAL_LM",
)
model = get_peft_model(model, config)
data = load_dataset_english_quotes()
data = data.map(lambda samples: self.tokenizer(samples["quote"]), batched=True)
trainer = Trainer(
model=model,
train_dataset=data["train"],
args=TrainingArguments(
per_device_train_batch_size=4,
gradient_accumulation_steps=4,
warmup_steps=2,
max_steps=3,
learning_rate=2e-4,
fp16=True,
logging_steps=1,
output_dir=tmp_dir,
),
data_collator=DataCollatorForLanguageModeling(self.tokenizer, mlm=False),
)
model.config.use_cache = False
trainer.train()
model.cpu().save_pretrained(tmp_dir)
assert "adapter_config.json" in os.listdir(tmp_dir)
assert SAFETENSORS_WEIGHTS_NAME in os.listdir(tmp_dir)
# assert loss is not None
assert trainer.state.log_history[-1]["train_loss"] is not None
@pytest.mark.multi_gpu_tests
def test_causal_lm_training_multi_gpu_4bit_vera(self):
r"""
Same as test_causal_lm_training_multi_gpu_4bit but with VeRA
"""
with tempfile.TemporaryDirectory() as tmp_dir:
model = AutoModelForCausalLM.from_pretrained(
self.causal_lm_model_id,
device_map="auto",
quantization_config=BitsAndBytesConfig(load_in_4bit=True),
)
assert set(model.hf_device_map.values()) == set(range(device_count))
model = prepare_model_for_kbit_training(model)
setattr(model, "model_parallel", True)
setattr(model, "is_parallelizable", True)
config = VeraConfig(
r=16,
target_modules=["q_proj", "v_proj"],
vera_dropout=0.05,
bias="none",
task_type="CAUSAL_LM",
)
model = get_peft_model(model, config)
data = load_dataset_english_quotes()
data = data.map(lambda samples: self.tokenizer(samples["quote"]), batched=True)
trainer = Trainer(
model=model,
train_dataset=data["train"],
args=TrainingArguments(
per_device_train_batch_size=4,
gradient_accumulation_steps=4,
warmup_steps=2,
max_steps=3,
learning_rate=2e-4,
fp16=True,
logging_steps=1,
output_dir=tmp_dir,
),
data_collator=DataCollatorForLanguageModeling(self.tokenizer, mlm=False),
)
model.config.use_cache = False
trainer.train()
model.cpu().save_pretrained(tmp_dir)
assert "adapter_config.json" in os.listdir(tmp_dir)
assert SAFETENSORS_WEIGHTS_NAME in os.listdir(tmp_dir)
# assert loss is not None
assert trainer.state.log_history[-1]["train_loss"] is not None

There are also some common GPU tests for VeRA that could be copied for RandLoRA (not high priority for me, though), starting here:

def test_vera_bnb_8bit_quantization(self):

Also, please ensure to run make style on your changes.

I also did a first run of RandLoRA using the PEFT method comparison suite. Here is the direct comparison to LoRA with rank 32:

metric LoRA rank 32 RandLoRA rank 32
cuda memory max 22.3 GB 22.8 GB
cuda memory reserved avg 11.9 GB 12.7 GB
cuda memory reserved 99th percentile 17.7 GB 18.4 GB
train loss 0.607 0.618
test accuracy 47.8% 46.8%
total time (sec) 1847 2064

I used the default config and all other settings were also the defaults from that train script. As you can see, RandLoRA is roughly on par, though with slightly worse performance and runtime. If you have a suggestion for better settings, LMK. But even if not, this is just a single data point and not important for merging the PR.

Comment on lines 44 to 49
sparse (`bool`):
Whether to use sparse random bases as described in the RandLora paper. The current implementation is a
proof of concept where the sparseness is not used to improve speed or memory usage. Defaults to `False`.
very_sparse (`bool`):
Whether to use very sparse random bases. The current implementation is a proof of concept where the
sparseness is not used to improve speed or memory usage. Defaults to `False`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since you've already done the work, I'm fine with keeping this. We should just clarify to the user what they should do with these parameters. If you add the info from this comment to the docstrings, it would help.

if adapter_name in self.randlora_lambda.keys():
with torch.no_grad():
nn.init.zeros_(self.randlora_lambda[adapter_name])
nn.init.zeros_(self.randlora_gamma[adapter_name]).fill_(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I see. zeros_ + fill_ could be replaced by constant_, right?

The name of the adapter for which the delta weight should be computed.
"""

update_B, update_A, dtype = self.get_scaled_bases(adapter)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dtype is not being used.

if active_adapter not in self.randlora_lambda.keys():
continue

update_B, update_A, dtype = self.get_scaled_bases(active_adapter)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dtype is not being used

for active_adapter in self.active_adapters:
if active_adapter not in self.randlora_lambda.keys():
continue
update_B, update_A, dtype = self.get_scaled_bases(active_adapter)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dtype is not used


# coding=utf-8
# Copyright 2023-present the HuggingFace Inc. team.
# Copyright 2025-present the HuggingFace Inc. team.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need for this, please revert.

from transformers.pytorch_utils import Conv1D


sys.path.append(os.path.join(os.getcwd(), "src"))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove.

**config_kwargs,
)
model = model.to(self.torch_device).eval()

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please revert.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You have changed the permissions on these files from 755 to 644. Could you please revert this?

@BenjaminBossan
Copy link
Member

@PaulAlbert31 You may have seen the merge conflicts in test_decoder_models.py and test_encoder_decoder_models.py. This is because we refactored the test files to get rid of unittest and only use pytest instead. Overall, this should simplify the test structure. Don't hesitate to ping me if you have questions when resolving the conflict.

@PaulAlbert31
Copy link
Contributor Author

Hey @BenjaminBossan
This new update runs the gpu tests correctly, I have also updated the default hyper-parameters of the RandLoRA config to be more comparable to a rank 32 LoRA. I haven't run the benchmark yet, I'll try and do that next week to ensure a fair comparison.
I haven't rebased on the current main yet to include the new bfloat16 tests, still working on that as well.

@PaulAlbert31
Copy link
Contributor Author

Hey @BenjaminBossan
I have now merged with main and updated hyper-parameters to have better performance on the PEFT method comparison suite.

Here are the numbers for equal trainable parameters to LoRA:

Metric LoRA rank 32 RandLoRA rank 32
cuda memory max 22.3 GB 22.8 GB
cuda memory reserved avg 11.9 GB 12.7 GB
cuda memory reserved 99th percentile 17.7 GB 18.4 GB
train loss 0.607 0.577
test accuracy 47.8% 52.3%

I had to pause the LoRA training halfway so I do not have a accurate training time for it but it should be close to the results you reported.

I am surprised that the memory usage for RandLoRA is higher than LoRA, this goes against what I saw in my paper but that could also be due to specific matrix sizes. I think that memory savings of RandLora are not as interesting on square matrices (q in the attention) but I have to write a small script to test.

Hopefully I conducted the merge with main the right way, let me know if this is not appropriate.

@BenjaminBossan
Copy link
Member

Hi @PaulAlbert31 thanks for updating the PR. Unfortunately, it looks like something went wrong during the rebase, maybe the main branch was not up-to-date? As is, the PR contains 42 commits and 95 changed files, which makes it basically impossible to review. Could you please:

  1. Ensure that you fetched the latest main branch on your fork.
  2. Rebase against this latest main again.
  3. If you haven't already, make sure that git rerere is enabled, which could come in handy because of the merge conflicts.

This should hopefully get the PR back into a nice state. If worse comes to worst, there is always the option to create a new PR.

Here are the numbers for equal trainable parameters to LoRA:

Thanks for running this. I will also give it another try when the PR is ready.

I am surprised that the memory usage for RandLoRA is higher than LoRA, this goes against what I saw in my paper

Of course, this experiment is just a single data point, so we shouldn't overgeneralize from this.

@PaulAlbert31
Copy link
Contributor Author

Hey @BenjaminBossan,
I have reset and redone the rebase properly. Things should be more readable now.

All contributions should still be here but let me know I am missing some tests. The bfloat16-float16 tests are passing as well.

Let me know if this is ok and we can move on to the next step 👍

Copy link
Member

@BenjaminBossan BenjaminBossan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you very much for rebasing again, otherwise it would have been very hard for me to do another in-depth review.

Overall, this PR is in a very good state, most things look very good. I mostly added some smaller comments, then there is an issue with multi-GPU training. Finally, I think we need some tests along the line of test_vera.py because of the shared variables in RandLoRA. I think those tests can just be copied with small edits and we should be fine. Test rest looks good.

I will also re-run the method comparison suite and report the results later.

assert trainer.state.log_history[-1]["train_loss"] is not None

@pytest.mark.multi_gpu_tests
def test_causal_lm_training_multi_gpu_RandLora(self):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def test_causal_lm_training_multi_gpu_RandLora(self):
def test_causal_lm_training_multi_gpu_8bit_randlora(self):

Comment on lines 33 to 35
Out = randlora_lambda[:, :, None] * randlora_A * randlora_gamma[None,]
ctx.save_for_backward(randlora_A, randlora_lambda, randlora_gamma)
return Out
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Out = randlora_lambda[:, :, None] * randlora_A * randlora_gamma[None,]
ctx.save_for_backward(randlora_A, randlora_lambda, randlora_gamma)
return Out
out = randlora_lambda[:, :, None] * randlora_A * randlora_gamma[None,]
ctx.save_for_backward(randlora_A, randlora_lambda, randlora_gamma)
return out

Or output.

assert trainer.state.log_history[-1]["train_loss"] is not None

@pytest.mark.single_gpu_tests
def test_causal_lm_training_randlora(self):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def test_causal_lm_training_randlora(self):
def test_causal_lm_training_8bit_randlora(self):

Comment on lines 62 to 66
if self.merged:
warnings.warn(
f"Already following adapters were merged {','.join(self.merged_adapters)}. "
f"You are now additionally merging {','.join(self.active_adapters)}."
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove this, as check_adapters_to_merge already takes care of it (yes, it's not correct in VeRA and should be removed there too). Same for the other classes.


return largest_shape

def _init_randlora_A_randlora_B_sparse(self, config: RandLoraConfig, adapter_name: str, s: int = 3) -> None:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's avoid one letter variable names like s and n. Please provide a more expressive name.

self.check_requires_grad(
peft_model,
"base_model.model.lin1.vera_lambda_b.default",
"nobase_model.model.lin1.vera_lambda_b.default",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change is probably not supposed to be there?

prepare_layer_inputs_fn = fn
else:
prepare_layer_inputs_fn = {k: fn for k in prepare_layer_inputs_keys}
prepare_layer_inputs_fn = dict.fromkeys(prepare_layer_inputs_keys, fn)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please revert, this ruff rule is excluded in the pyproject.toml so should not be changed by ruff.

@BenjaminBossan
Copy link
Member

Here are the new results I got, which are pretty much in line with what you posted above.

metric LoRA rank 32 new RandLoRA rank 32
cuda memory max 22.3 GB 22.8 GB
cuda memory reserved avg 11.9 GB 12.7 GB
cuda memory reserved 99th percentile 17.7 GB 18.4 GB
train loss 0.607 0.578
test accuracy 47.8% 51.9%
total time (sec) 1847 1949

@PaulAlbert31
Copy link
Contributor Author

Hey @BenjaminBossan, great to see you were able to reproduce the results !

I have updated the pull request with your suggested changes and added a test_randlora.py test file.

As I commented above, I am not sure how to proceed with the multi-GPU OOM error, let me know if you find an easy fix.

Copy link
Member

@BenjaminBossan BenjaminBossan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the latest changes and for adding the new tests. I only skimmed them, as they are mostly copy-pasted, right?

Apart from that, I only discovered a small mistake, the rest looks good.

As I commented above, I am not sure how to proceed with the multi-GPU OOM error, let me know if you find an easy fix.

Let's not worry too much about this, since it's only DP, which very few people still use. My suggestion is to just move to a smaller base model in the GPU tests, e.g. facebook/opt-350m, with a comment explaining that 6.7b results in OOM.

@PaulAlbert31
Copy link
Contributor Author

Hi @BenjaminBossan,
Great to see we are getting close !

Yes test_randlora.py is a direct copy/paste of test_vera.py

I have not been able to reproduce the OOM error with DP as I only have 1 GPU available. Could you let me know where I can add a condition to switch to facebook/opt-350m for RandLoRA's tests ?

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@BenjaminBossan
Copy link
Member

@PaulAlbert31 I tried running the tests from test_gpu_examples.py again and this time they passed without OOM. I'm not sure what's different this time around, but I think we're good now and no further changes are required.

For the CI to run, could you please call make style and fix the remaining linter issue?

@PaulAlbert31
Copy link
Contributor Author

@BenjaminBossan Apologies, I keep forgetting to run make style
Great to hear the OOM is gone for now !

@BenjaminBossan
Copy link
Member

Thanks for the fix @PaulAlbert31. The CI now ran and is mostly green, but as you can see, all the Windows runs failed, all with regard to this test:

tests/test_custom_models.py::PeftCustomModelTester::test_safe_merge_143_Vanilla_MLP_4_RandLora

I'm not sure what's going on and I don't have a Windows machine to test this on. Most likely, it's a matter of lowering precision, but I'm not sure. If you also can't test it, in my book it would be fine to skip this specific test on Windows with a comment to explain, as this feels very much like an edge case.

@PaulAlbert31
Copy link
Contributor Author

PaulAlbert31 commented Apr 24, 2025

@BenjaminBossan
Damn, so close. I don't have access to a windows system either.

I have had a similar problem before and reducing the randlora_alpha coefficient helped. I have now reduced it even further. Is it ok to try the windows tests again ?

If the test still fails I'll add a skip for windows.

@BenjaminBossan
Copy link
Member

This seems to have resolved the error, thanks @PaulAlbert31. I assume that lowering alpha helps with numerical stability. Is that something worth documenting? If not, please set the status from draft to ready, I'll do a final quick pass, and then we can hopefully merge the PR.

@PaulAlbert31
Copy link
Contributor Author

@BenjaminBossan thanks for the suggestion, I have documented the possible instabilities with larger randlora_alpha values and moved the PR status to ready 🎉

@PaulAlbert31 PaulAlbert31 marked this pull request as ready for review April 25, 2025 01:11
Copy link
Member

@BenjaminBossan BenjaminBossan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the final addition, the PR LGTM, great work and fine addition to PEFT.

Let's not forget to also add documentation and an example for RandLoRA to promote it's adoption, but this can be done in a future PR.

@BenjaminBossan BenjaminBossan merged commit d3ff133 into huggingface:main Apr 25, 2025
14 checks passed
@PaulAlbert31 PaulAlbert31 deleted the randlora branch April 25, 2025 22:22
@PaulAlbert31 PaulAlbert31 restored the randlora branch May 6, 2025 02:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet