conda create -n tokenbuncher python=3.10
# install torch
pip install torch==2.7.0 torchvision==0.22.0 torchaudio==2.7.0 --index-url https://download.pytorch.org/whl/cu126
# install vllm
pip3 install vllm==0.9.2
pip3 install ray
# verl
pip install -e .
# flash attention 2
pip3 install flash-attn --no-build-isolation
# quality of life
pip install flask wandb IPython matplotlib
We used the following datasets for training, and we provide the original links here for easy download:
python ./dataset/data_preprocess/beavertails.py --template_type qwen --local_dir ./dataset/beavertails-qwen
--template_typesupportsbase,ministral,qwen
python ./dataset/data_preprocess/wmdp.py --type cyber --template_type qwen --local_dir ./dataset/wmdpcyber-qwen
--typesupportsbio,chem,cyber
Then we merged it with beavertails to get a mixed dataset, which is used for training:
python ./dataset/data_preprocess/wmdp_merge.py --root ./dataset
python ./dataset/data_preprocess/gsm8k.py --template_type qwen --local_dir ./dataset/gsm8k-qwen
We used the following open-source models in our experiments:
- Qwen2.5-3B-Instruct
- Qwen2.5-7B-Instruct
- Ministral-8B-Instruct-2410
- Beaver-dam-7b
- deberta-v3-xsmall-beavertails-harmful-qa-classifier We used this model as the reward model.
We provide defense-trained checkpoints on HuggingFace Hub for direct use:
- suv11235/tokenbuncher-qwen-3b-step100 - Early defense checkpoint
- suv11235/tokenbuncher-qwen-3b-step200 - Mid-early defense checkpoint
- suv11235/tokenbuncher-qwen-3b-step300 - Mid defense checkpoint
- suv11235/tokenbuncher-qwen-3b-step400 - Mid-late defense checkpoint
- suv11235/tokenbuncher-qwen-3b-step500 - Late defense checkpoint
These models can be used directly in evaluation scripts without local conversion (see Evaluation sections below).
We provide a sample script run_attack.sh to run the attack experiment. This script contains the basic usage of GRPO algorithm for Harmful-RL.
To start the attack training, you can run the following command:
# fist start the reward model service, you can use nohup to run it in the background
nohup python scripts/host_deberta_api.py > /dev/null 2>&1 &
# then start the attack
bash run_attack.sh
We provide some other algorithm running scripts in the configs directory, you can modify run_attack.sh to run them according to your needs.
Among these, attack_beawmmix_grpo.sh is used to train the harmful-capability-strengthening model.
Our script is suitable for 2xA100 80G, using 2 GPUs. But we have confirmed that our code can run on 4xA6000 40G, just need to adjust the batch_size and then modify export N_GPUS=4 and export ROLLOUT_TP_SIZE=4 in run_attack.sh.
We provide a simple run_defence.sh script to add defence to the model. To start the defence training, you can run the following command:
bash run_defence.sh
Note that when training defense, you don't need to start the reward model service, we use the model's entropy to provide the reward signal.
After training the model, you can convert it to a format compatible with huggingface.
This script handles paths better and can optionally push to HuggingFace Hub:
# From project root directory
python scripts/convert_and_push_to_hf.py \
--fsdp_checkpoint ./outputs/tokenbuncher-qwen-2.5-3b/global_step_500/actor \
--base_model Qwen/Qwen2.5-3B-Instruct \
--output_dir ./outputs/Qwen2.5-3B-TokenBuncher
# To also push to HuggingFace Hub:
python scripts/convert_and_push_to_hf.py \
--fsdp_checkpoint ./outputs/tokenbuncher-qwen-2.5-3b/global_step_500/actor \
--base_model Qwen/Qwen2.5-3B-Instruct \
--output_dir ./outputs/Qwen2.5-3B-TokenBuncher \
--push_to_hub \
--hf_repo_id YOUR_USERNAME/tokenbuncher-qwen-3b \
--hf_token YOUR_HF_TOKENImportant: Use absolute paths or run from project root to avoid path issues.
# From project root directory
python scripts/convert_fsdp_to_hf.py \
./outputs/tokenbuncher-qwen-2.5-3b/global_step_500/actor \
Qwen/Qwen2.5-3B-Instruct \
./outputs/Qwen2.5-3B-TokenBuncher
# OR use absolute paths from any directory:
python scripts/convert_fsdp_to_hf.py \
/path/to/outputs/tokenbuncher-qwen-2.5-3b/global_step_500/actor \
Qwen/Qwen2.5-3B-Instruct \
/path/to/outputs/Qwen2.5-3B-TokenBuncherNote: Make sure the base model matches your training configuration:
- For Qwen2.5-3B training: use
Qwen/Qwen2.5-3B-Instruct - For Qwen2.5-7B training: use
Qwen/Qwen2.5-7B-Instruct
We use wandb to record the training logs, use wandb login if you are using wandb for the first time.
We used the following datasets for evaluation:
We provide the evaluation scripts in the evaluation folder.
Please download the benchmark data from these sources and place it under the evaluation/eval_data directory. The folder tree looks like this:
.
├── gsm8k
│ ├── test.jsonl
│ └── train.jsonl
├── harmbench
│ └── harmbench_behaviors_text_all.csv
├── math500
│ └── test.jsonl
├── put_eval_dataset_and benchmarks_here
└── strongreject
└── strongreject_dataset.csv
run the following command to compute the harmful score:
cd evaluation/harmfulscore
# generate model response first
python generate_response.py \
--model <MODEL_TO_BE_EVALUATED_PATH> \
--tokenizer <TOKENIZER_PATH> \
--batch-size 16 --max-new-tokens 1024 \
--use-sampler \
--output-file-name <OUTPUT_FILE_NAME> \
--dataset HarmBench # or StrongREJECT
# compute harmful score
python beaverdam.py --model_path <beaver-dam-7b PATH> --eval_dataset <OUTPUT_FILE_NAME>.jsonl --output_dir <OUTPUT_DIR>
Note: Evaluation scripts support both local checkpoints and models from HuggingFace Hub.
cd evaluation/math
# Generate model response
python math_local_gen.py \
--model ../hf_models/your-model-checkpoint \
--tokenizer ../hf_models/your-model-checkpoint \
--batch-size 4 \
--max-new-tokens 4000 \
--output-file-name your-model-gsm8k \
--dataset GSM8K # or MATH500
# Compute accuracy for GSM8K
python evaluate_gsm8k.py --results-file your-model-gsm8k.jsonl
# Or for MATH500
python evaluate_math500.py --input_file your-model-math500.jsonlcd evaluation/math
# Generate model response directly from HuggingFace
python math_local_gen.py \
--model suv11235/tokenbuncher-qwen-3b-step500 \
--tokenizer suv11235/tokenbuncher-qwen-3b-step500 \
--batch-size 4 \
--max-new-tokens 4000 \
--output-file-name tokenbuncher-step500-gsm8k-hf \
--dataset GSM8K # or MATH500
# Compute accuracy for GSM8K
python evaluate_gsm8k.py --results-file tokenbuncher-step500-gsm8k-hf.jsonl
# Or for MATH500
python evaluate_math500.py --input_file tokenbuncher-step500-math500-hf.jsonlFor MMLU-Pro, we directly referred to the official repo and made some modifications to adapt it to our own model.
To evaluate on MMLU-pro, please follow these steps:
cd evaluation/mmlu-pro
# add your MODEL_PATH and MODEL_NAME before run:
bash eval_models.sh
Check the eval_results folder for the results.
Run the following command to compute the accuracy on WMDP:
Note: Evaluation scripts support both local checkpoints and models from HuggingFace Hub.
Run from project root to avoid path issues:
# Navigate to project root
cd /path/to/rlvr-safety
# Generate model responses
python evaluation/wmdp/generate_response.py \
--model ./hf_models/your-model-checkpoint \
--tokenizer ./hf_models/your-model-checkpoint \
--batch-size 16 \
--max-new-tokens 1024 \
--use-sampler \
--output-file-name your-model-wmdpcyber \
--dataset WDMPCyber
# Compute accuracy
python evaluation/wmdp/eval_wdmp_matchcases.py \
--questions_file ./dataset/wmdpcyber-qwen/test.jsonl \
--results_file ./evaluation/wmdp/your-model-wmdpcyber.jsonlcd /path/to/rlvr-safety
# Generate model responses directly from HuggingFace
python evaluation/wmdp/generate_response.py \
--model suv11235/tokenbuncher-qwen-3b-step500 \
--tokenizer suv11235/tokenbuncher-qwen-3b-step500 \
--batch-size 16 \
--max-new-tokens 1024 \
--use-sampler \
--output-file-name tokenbuncher-step500-wmdpcyber-hf \
--dataset WDMPCyber
# Compute accuracy
python evaluation/wmdp/eval_wdmp_matchcases.py \
--questions_file ./dataset/wmdpcyber-qwen/test.jsonl \
--results_file ./evaluation/wmdp/tokenbuncher-step500-wmdpcyber-hf.jsonlDataset Options:
WDMPBio- Biology (requires preprocessing with--type bio)WDMPChem- Chemistry (requires preprocessing with--type chem)WDMPCyber- Cybersecurity (example above, requires preprocessing with--type cyber)
Example with converted checkpoint:
cd /path/to/rlvr-safety
# For step 500 checkpoint on WMDPCyber
python evaluation/wmdp/generate_response.py \
--model ./hf_models/tokenbuncher-qwen-3b-step500 \
--tokenizer ./hf_models/tokenbuncher-qwen-3b-step500 \
--batch-size 16 \
--max-new-tokens 1024 \
--use-sampler \
--output-file-name tokenbuncher-step500-wmdpcyber \
--dataset WDMPCyber
# Evaluate (the script now auto-resolves paths from project root)
python evaluation/wmdp/eval_wdmp_matchcases.py \
--questions_file ./dataset/wmdpcyber-qwen/test.jsonl \
--results_file tokenbuncher-step500-wmdpcyber.jsonlAlternative: Run from evaluation/wmdp/ directory
cd /path/to/rlvr-safety/evaluation/wmdp
# Generate responses
python generate_response.py \
--model ../../hf_models/tokenbuncher-qwen-3b-step500 \
--tokenizer ../../hf_models/tokenbuncher-qwen-3b-step500 \
--batch-size 16 \
--max-new-tokens 1024 \
--use-sampler \
--output-file-name tokenbuncher-step500-wmdpcyber \
--dataset WDMPCyber
# Evaluate (paths auto-resolve)
python eval_wdmp_matchcases.py \
--questions_file dataset/wmdpcyber-qwen/test.jsonl \
--results_file tokenbuncher-step500-wmdpcyber.jsonl