-
Notifications
You must be signed in to change notification settings - Fork 10
Description
I was able to run the codes for three stages and each requires a new virtual environment. It troubled when it comes to training safety neurons.
I ran Llama3-8B-Instruct with the following requirements:
transformers==4.38.2
peft==0.10.0
trl==0.9.6
accelerate==0.43.2
Along with replacing /conda/env/path/site-packages/transformers/trainer.py with the transformers/trainer.py provided in this repo, you also need to append the definition of activate_neurons to /conda/env/path/site-packages/transformers/training_args.py in line 2786.
After training, I checked the changes in parameters and discovered checkpoint saved was identical to the original model. This could be solved by modifying the saving strategy as follows:
is_main_process = True
if hasattr(trainer, "accelerator"):
is_main_process = trainer.accelerator.is_main_process
if is_main_process:
try:
model_to_save = trainer.accelerator.unwrap_model(trainer.model)
except Exception:
model_to_save = trainer.model
if isinstance(model_to_save, PeftModel):
model_to_save.save_pretrained(output_dir)
tokenizer.save_pretrained(output_dir)
else:
trainer.save_model(output_dir)
tokenizer.save_pretrained(output_dir)However, I was not able to reproduce the result in the paper. Here are the training logs of safety-neuron tuned version and all parameter tuned version:
Safety-Neuron tuned version:

All parameters tuned version:
SFT data: the 50 samples randomly selected from the training data in repo of Circuit-Break(https://arxiv.org/pdf/2406.04313):
circuit_breakers_train_sample50.json
I compared math capability using gsm8k-250 English and safety using MultiJail-EN:
The tag safe/unsafe/invalid is measured with the prompt provided in MultiJail paper (https://openreview.net/pdf?id=vESNKdEMGp)
The result was rather wired:
I noticed that in the paper there was no comparison with the all parameters tuned with same SFT data. I expected that all-param SFT would perform worse in math tasks and have equivalent level or less safe than the safety-neuron tuned version.
Could you provide more information on the running environment that I could replicate the experimental result? Or could you open-source your tuned models for reference?
Thanks