Token Buncher: Shielding LLMs from Harmful Reinforcement Learning Fine-Tuning

Installation

conda create -n tokenbuncher python=3.10
# install torch
pip install torch==2.7.0 torchvision==0.22.0 torchaudio==2.7.0 --index-url https://download.pytorch.org/whl/cu126
# install vllm
pip3 install vllm==0.9.2
pip3 install ray

# verl
pip install -e .

# flash attention 2
pip3 install flash-attn --no-build-isolation
# quality of life
pip install flask wandb IPython matplotlib

Data Preparation

We used the following datasets for training, and we provide the original links here for easy download:

For BeaverTails, we used the following preprocessing script:

python ./dataset/data_preprocess/beavertails.py --template_type qwen --local_dir ./dataset/beavertails-qwen

--template_type supports base, ministral, qwen

For WMDP, we used the following preprocessing script:

python ./dataset/data_preprocess/wmdp.py --type cyber --template_type qwen --local_dir ./dataset/wmdpcyber-qwen

--type supports bio, chem, cyber

Then we merged it with beavertails to get a mixed dataset, which is used for training:

python ./dataset/data_preprocess/wmdp_merge.py --root ./dataset

For GSM8K, we used the following preprocessing script:

python ./dataset/data_preprocess/gsm8k.py --template_type qwen --local_dir ./dataset/gsm8k-qwen

Model Preparation

We used the following open-source models in our experiments:

Qwen2.5-3B-Instruct
Qwen2.5-7B-Instruct
Ministral-8B-Instruct-2410
Beaver-dam-7b
deberta-v3-xsmall-beavertails-harmful-qa-classifier We used this model as the reward model.

Pre-trained Token Buncher Models

We provide defense-trained checkpoints on HuggingFace Hub for direct use:

suv11235/tokenbuncher-qwen-3b-step100 - Early defense checkpoint
suv11235/tokenbuncher-qwen-3b-step200 - Mid-early defense checkpoint
suv11235/tokenbuncher-qwen-3b-step300 - Mid defense checkpoint
suv11235/tokenbuncher-qwen-3b-step400 - Mid-late defense checkpoint
suv11235/tokenbuncher-qwen-3b-step500 - Late defense checkpoint

These models can be used directly in evaluation scripts without local conversion (see Evaluation sections below).

Harmful-RL Fine-Tuning

We provide a sample script run_attack.sh to run the attack experiment. This script contains the basic usage of GRPO algorithm for Harmful-RL.

To start the attack training, you can run the following command:

# fist start the reward model service, you can use nohup to run it in the background
nohup python scripts/host_deberta_api.py > /dev/null 2>&1 &

# then start the attack
bash run_attack.sh

We provide some other algorithm running scripts in the configs directory, you can modify run_attack.sh to run them according to your needs. Among these, attack_beawmmix_grpo.sh is used to train the harmful-capability-strengthening model.

Our script is suitable for 2xA100 80G, using 2 GPUs. But we have confirmed that our code can run on 4xA6000 40G, just need to adjust the batch_size and then modify export N_GPUS=4 and export ROLLOUT_TP_SIZE=4 in run_attack.sh.

Token Buncher

We provide a simple run_defence.sh script to add defence to the model. To start the defence training, you can run the following command:

bash run_defence.sh

Note that when training defense, you don't need to start the reward model service, we use the model's entropy to provide the reward signal.

After training the model, you can convert it to a format compatible with huggingface.

Option 1: Using convert_and_push_to_hf.py (Recommended)

This script handles paths better and can optionally push to HuggingFace Hub:

# From project root directory
python scripts/convert_and_push_to_hf.py \
    --fsdp_checkpoint ./outputs/tokenbuncher-qwen-2.5-3b/global_step_500/actor \
    --base_model Qwen/Qwen2.5-3B-Instruct \
    --output_dir ./outputs/Qwen2.5-3B-TokenBuncher

# To also push to HuggingFace Hub:
python scripts/convert_and_push_to_hf.py \
    --fsdp_checkpoint ./outputs/tokenbuncher-qwen-2.5-3b/global_step_500/actor \
    --base_model Qwen/Qwen2.5-3B-Instruct \
    --output_dir ./outputs/Qwen2.5-3B-TokenBuncher \
    --push_to_hub \
    --hf_repo_id YOUR_USERNAME/tokenbuncher-qwen-3b \
    --hf_token YOUR_HF_TOKEN

Option 2: Using convert_fsdp_to_hf.py

Important: Use absolute paths or run from project root to avoid path issues.

# From project root directory
python scripts/convert_fsdp_to_hf.py \
    ./outputs/tokenbuncher-qwen-2.5-3b/global_step_500/actor \
    Qwen/Qwen2.5-3B-Instruct \
    ./outputs/Qwen2.5-3B-TokenBuncher

# OR use absolute paths from any directory:
python scripts/convert_fsdp_to_hf.py \
    /path/to/outputs/tokenbuncher-qwen-2.5-3b/global_step_500/actor \
    Qwen/Qwen2.5-3B-Instruct \
    /path/to/outputs/Qwen2.5-3B-TokenBuncher

Note: Make sure the base model matches your training configuration:

For Qwen2.5-3B training: use Qwen/Qwen2.5-3B-Instruct
For Qwen2.5-7B training: use Qwen/Qwen2.5-7B-Instruct

We use wandb to record the training logs, use wandb login if you are using wandb for the first time.

Evaluation

We used the following datasets for evaluation:

We provide the evaluation scripts in the evaluation folder.

Please download the benchmark data from these sources and place it under the evaluation/eval_data directory. The folder tree looks like this:

.
├── gsm8k
│   ├── test.jsonl
│   └── train.jsonl
├── harmbench
│   └── harmbench_behaviors_text_all.csv
├── math500
│   └── test.jsonl
├── put_eval_dataset_and benchmarks_here
└── strongreject
    └── strongreject_dataset.csv

Compute Harmful Score on HarmBench and StrongREJECT

run the following command to compute the harmful score:

cd evaluation/harmfulscore

# generate model response first
python generate_response.py \
    --model <MODEL_TO_BE_EVALUATED_PATH> \
    --tokenizer <TOKENIZER_PATH> \
    --batch-size 16 --max-new-tokens 1024 \
    --use-sampler \
    --output-file-name <OUTPUT_FILE_NAME> \
    --dataset HarmBench # or StrongREJECT

# compute harmful score
python beaverdam.py --model_path <beaver-dam-7b PATH> --eval_dataset <OUTPUT_FILE_NAME>.jsonl --output_dir <OUTPUT_DIR>

Evaluation on MATH500 and GSM8K

Note: Evaluation scripts support both local checkpoints and models from HuggingFace Hub.

Using Local Converted Checkpoints

cd evaluation/math

# Generate model response
python math_local_gen.py \
    --model ../hf_models/your-model-checkpoint \
    --tokenizer ../hf_models/your-model-checkpoint \
    --batch-size 4 \
    --max-new-tokens 4000 \
    --output-file-name your-model-gsm8k \
    --dataset GSM8K  # or MATH500

# Compute accuracy for GSM8K
python evaluate_gsm8k.py --results-file your-model-gsm8k.jsonl

# Or for MATH500
python evaluate_math500.py --input_file your-model-math500.jsonl

Using Models from HuggingFace Hub

cd evaluation/math

# Generate model response directly from HuggingFace
python math_local_gen.py \
    --model suv11235/tokenbuncher-qwen-3b-step500 \
    --tokenizer suv11235/tokenbuncher-qwen-3b-step500 \
    --batch-size 4 \
    --max-new-tokens 4000 \
    --output-file-name tokenbuncher-step500-gsm8k-hf \
    --dataset GSM8K  # or MATH500

# Compute accuracy for GSM8K
python evaluate_gsm8k.py --results-file tokenbuncher-step500-gsm8k-hf.jsonl

# Or for MATH500
python evaluate_math500.py --input_file tokenbuncher-step500-math500-hf.jsonl

Evaluation on MMLU-pro

For MMLU-Pro, we directly referred to the official repo and made some modifications to adapt it to our own model.

To evaluate on MMLU-pro, please follow these steps:

cd evaluation/mmlu-pro

# add your MODEL_PATH and MODEL_NAME before run:
bash eval_models.sh

Check the eval_results folder for the results.

Evaluation on WMDP

Run the following command to compute the accuracy on WMDP:

Note: Evaluation scripts support both local checkpoints and models from HuggingFace Hub.

Using Local Converted Checkpoints

Run from project root to avoid path issues:

# Navigate to project root
cd /path/to/rlvr-safety

# Generate model responses
python evaluation/wmdp/generate_response.py \
    --model ./hf_models/your-model-checkpoint \
    --tokenizer ./hf_models/your-model-checkpoint \
    --batch-size 16 \
    --max-new-tokens 1024 \
    --use-sampler \
    --output-file-name your-model-wmdpcyber \
    --dataset WDMPCyber

# Compute accuracy
python evaluation/wmdp/eval_wdmp_matchcases.py \
    --questions_file ./dataset/wmdpcyber-qwen/test.jsonl \
    --results_file ./evaluation/wmdp/your-model-wmdpcyber.jsonl

Using Models from HuggingFace Hub

cd /path/to/rlvr-safety

# Generate model responses directly from HuggingFace
python evaluation/wmdp/generate_response.py \
    --model suv11235/tokenbuncher-qwen-3b-step500 \
    --tokenizer suv11235/tokenbuncher-qwen-3b-step500 \
    --batch-size 16 \
    --max-new-tokens 1024 \
    --use-sampler \
    --output-file-name tokenbuncher-step500-wmdpcyber-hf \
    --dataset WDMPCyber

# Compute accuracy
python evaluation/wmdp/eval_wdmp_matchcases.py \
    --questions_file ./dataset/wmdpcyber-qwen/test.jsonl \
    --results_file ./evaluation/wmdp/tokenbuncher-step500-wmdpcyber-hf.jsonl

Dataset Options:

WDMPBio - Biology (requires preprocessing with --type bio)
WDMPChem - Chemistry (requires preprocessing with --type chem)
WDMPCyber - Cybersecurity (example above, requires preprocessing with --type cyber)

Example with converted checkpoint:

cd /path/to/rlvr-safety

# For step 500 checkpoint on WMDPCyber
python evaluation/wmdp/generate_response.py \
    --model ./hf_models/tokenbuncher-qwen-3b-step500 \
    --tokenizer ./hf_models/tokenbuncher-qwen-3b-step500 \
    --batch-size 16 \
    --max-new-tokens 1024 \
    --use-sampler \
    --output-file-name tokenbuncher-step500-wmdpcyber \
    --dataset WDMPCyber

# Evaluate (the script now auto-resolves paths from project root)
python evaluation/wmdp/eval_wdmp_matchcases.py \
    --questions_file ./dataset/wmdpcyber-qwen/test.jsonl \
    --results_file tokenbuncher-step500-wmdpcyber.jsonl

Alternative: Run from evaluation/wmdp/ directory

cd /path/to/rlvr-safety/evaluation/wmdp

# Generate responses
python generate_response.py \
    --model ../../hf_models/tokenbuncher-qwen-3b-step500 \
    --tokenizer ../../hf_models/tokenbuncher-qwen-3b-step500 \
    --batch-size 16 \
    --max-new-tokens 1024 \
    --use-sampler \
    --output-file-name tokenbuncher-step500-wmdpcyber \
    --dataset WDMPCyber

# Evaluate (paths auto-resolve)
python eval_wdmp_matchcases.py \
    --questions_file dataset/wmdpcyber-qwen/test.jsonl \
    --results_file tokenbuncher-step500-wmdpcyber.jsonl

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
.claude		.claude
configs		configs
dataset/data_preprocess		dataset/data_preprocess
evaluation		evaluation
scripts		scripts
verl		verl
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
EMAIL_SETUP.md		EMAIL_SETUP.md
HF_QUICKREF.md		HF_QUICKREF.md
HUGGINGFACE_GUIDE.md		HUGGINGFACE_GUIDE.md
IMPLEMENTATION_SUMMARY.md		IMPLEMENTATION_SUMMARY.md
LAMBDA_QUICKSTART.md		LAMBDA_QUICKSTART.md
README.md		README.md
README_UPDATED.md		README_UPDATED.md
SETUP_FIXES.md		SETUP_FIXES.md
SETUP_LAMBDA.md		SETUP_LAMBDA.md
TRANSFER_TO_LAMBDA.md		TRANSFER_TO_LAMBDA.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
run_attack.sh		run_attack.sh
run_attack_minimal.sh		run_attack_minimal.sh
run_attack_ultra_minimal.sh		run_attack_ultra_minimal.sh
run_defence.sh		run_defence.sh
run_defence_minimal.sh		run_defence_minimal.sh
run_defence_ultra_minimal.sh		run_defence_ultra_minimal.sh
setup.py		setup.py
setup_email_notifications.sh		setup_email_notifications.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Token Buncher: Shielding LLMs from Harmful Reinforcement Learning Fine-Tuning

Installation

Data Preparation

For BeaverTails, we used the following preprocessing script:

For WMDP, we used the following preprocessing script:

For GSM8K, we used the following preprocessing script:

Model Preparation

Pre-trained Token Buncher Models

Harmful-RL Fine-Tuning

Token Buncher

Option 1: Using convert_and_push_to_hf.py (Recommended)

Option 2: Using convert_fsdp_to_hf.py

Evaluation

Compute Harmful Score on HarmBench and StrongREJECT

Evaluation on MATH500 and GSM8K

Using Local Converted Checkpoints

Using Models from HuggingFace Hub

Evaluation on MMLU-pro

Evaluation on WMDP

Using Local Converted Checkpoints

Using Models from HuggingFace Hub

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Token Buncher: Shielding LLMs from Harmful Reinforcement Learning Fine-Tuning

Installation

Data Preparation

For BeaverTails, we used the following preprocessing script:

For WMDP, we used the following preprocessing script:

For GSM8K, we used the following preprocessing script:

Model Preparation

Pre-trained Token Buncher Models

Harmful-RL Fine-Tuning

Token Buncher

Option 1: Using convert_and_push_to_hf.py (Recommended)

Option 2: Using convert_fsdp_to_hf.py

Evaluation

Compute Harmful Score on HarmBench and StrongREJECT

Evaluation on MATH500 and GSM8K

Using Local Converted Checkpoints

Using Models from HuggingFace Hub

Evaluation on MMLU-pro

Evaluation on WMDP

Using Local Converted Checkpoints

Using Models from HuggingFace Hub

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages