Issues on reproducing the paper results

I was able to run the codes for three stages and each requires a new virtual environment. It troubled when it comes to training safety neurons.
I ran Llama3-8B-Instruct with the following requirements:
`transformers==4.38.2`
`peft==0.10.0`
`trl==0.9.6`
`accelerate==0.43.2`
Along with replacing `/conda/env/path/site-packages/transformers/trainer.py` with the `transformers/trainer.py` provided in this repo, you also need to append the definition of `activate_neurons` to `/conda/env/path/site-packages/transformers/training_args.py` in line 2786.

<img width="622" height="291" alt="Image" src="https://github.com/user-attachments/assets/aecf69d5-f1b5-446f-ba90-9e65fc0b97a6" />

After training, I checked the changes in parameters and discovered checkpoint saved was identical to the original model. This could be solved by modifying the saving strategy as follows:
```python
is_main_process = True
if hasattr(trainer, "accelerator"):
      is_main_process = trainer.accelerator.is_main_process

if is_main_process:
      try:
          model_to_save = trainer.accelerator.unwrap_model(trainer.model)
      except Exception:
          model_to_save = trainer.model

      if isinstance(model_to_save, PeftModel):
          model_to_save.save_pretrained(output_dir)
          tokenizer.save_pretrained(output_dir)
      else:
          trainer.save_model(output_dir)
          tokenizer.save_pretrained(output_dir)
```

However, I was not able to reproduce the result in the paper. Here are the training logs of safety-neuron tuned version and all parameter tuned version:
Safety-Neuron tuned version:
<img width="894" height="54" alt="Image" src="https://github.com/user-attachments/assets/99d7ff25-a948-4648-8448-3a810d1431df" />
All parameters tuned version:

<img width="875" height="48" alt="Image" src="https://github.com/user-attachments/assets/f25d98ba-d50e-45a1-b72c-aa473917d325" />


SFT data: the 50 samples randomly selected from the training data in repo of Circuit-Break(https://arxiv.org/pdf/2406.04313):

[circuit_breakers_train_sample50.json](https://github.com/user-attachments/files/23598855/circuit_breakers_train_sample50.json)


I compared math capability using gsm8k-250 English and safety using MultiJail-EN:
The tag safe/unsafe/invalid is measured with the prompt provided in MultiJail paper (https://openreview.net/pdf?id=vESNKdEMGp)

The result was rather wired:

<img width="758" height="255" alt="Image" src="https://github.com/user-attachments/assets/7a5b8dc6-5cae-4030-871a-8a2a07805760" />

I noticed that in the paper there was no comparison with the all parameters tuned with same SFT data. I expected that all-param SFT would perform worse in math tasks and have equivalent level or less safe than the safety-neuron tuned version. 

Could you provide more information on the running environment that I could replicate the experimental result? Or could you open-source your tuned models for reference?

Thanks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issues on reproducing the paper results #11

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issues on reproducing the paper results #11

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions