Skip to content

Trustworthy-ML-Lab/Steer2Edit

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Steer2Edit

This is the official repository for the paper: Steer2Edit: From Activation Steering to Component-Level Editing. Here is our 5 min read project website.


Table of Contents


Introduction

In this work, we propose Steer2Edit, a theoretically grounded, training-free framework that transforms steering vectors from inference-time control signals into diagnostic signals for component-level rank-1 weight editing.

Steer2Edit overview

This repository contains the full Steer2Edit pipeline with three behavior controls:

  • safety_alignment/
  • truthfulness/
  • efficient_reasoning/

Each behavior control includes the data for probing:

  • Safety Alignment: safety_alignment/data/ (advllm & gcg adversarial prompts, benign prompts from Alpaca dataset)
  • Truthfulness: truthfulness/data/truthfulqa/
  • Efficienct Reasoning: efficient_reasoning/data/

Setup

pip install -r requirements.txt

Safety alignment

Models: llama2-7b, mistral-7b

1) Generate probe responses

First, given harmful and neutral queries, generate responses for each model.

cd safety_alignment
python generate_response.py --model llama2-7b
python generate_response.py --model mistral-7b

2) Extract directions

Extract the steering vectors for refusal behavior after attention and MLP layers. The vectors will be saved in directions/.

python extract_directions.py --model llama2-7b
python extract_directions.py --model mistral-7b

3) Create edits

Run Steer2Edit based on the steering vectors. This creates multiple edits over a hyperparameter grid and saves all edits in edited_models/.

bash edit.sh --model llama2-7b
bash edit.sh --model mistral-7b

4) Attribute evaluation (safety)

Evaluate the refusal rate of each edit. We use vLLM to speed up inference; reduce TENSOR_PARALLEL_SIZE=4 if you do not have 4 GPUs.

bash evaluate_editing.sh --model llama2-7b
bash evaluate_editing.sh --model mistral-7b

5) Utility evaluation (top-10 attribute configs)

This script selects the top 10 configurations based on the attribute evaluation results and evaluates the utility of the edited model.

python run_top_config_utility.py --model llama2-7b
python run_top_config_utility.py --model mistral-7b

6) Activation steering baseline

Run the baseline steering method with different steering strengths.

bash evaluate_steering.sh
bash evaluate_utility_steering.sh

7) Plot trade-off (safety vs utility)

Generate a trade-off plot similar to the one in the paper.

python plot.py --models llama2-7b,mistral-7b

8) Visualize edit strength (heatmap)

Plot the per-layer edit strength (λ) under the best hyperparameter setting to highlight which components drive the target attribute. You can switch to different hyperparameter settings; otherwise, this uses the best setting reported in the paper.

python visualize_edit.py \
  --model1 llama2-7b --rho_attn1 0.18 --rho_mlp1 0.55 --alpha1 0.8 \
  --model2 mistral-7b --rho_attn2 0.42 --rho_mlp2 0.55 --alpha2 0.75 \

Expected figures:

Safety-Utility Trade-off
Safety–Utility Trade-off
Safety edit heatmap
Heatmap of Edit Strengths

Truthfulness

Models: gemma2-2b, llama3-8b

1) Generate probe responses (TruthfulQA train split)

cd truthfulness
python generate_response.py --model gemma2-2b --probe_path data/truthfulqa/train
python generate_response.py --model llama3-8b --probe_path data/truthfulqa/train

2) Extract directions

Extract the steering vectors for truthfulness after attention and MLP layers. The vectors will be saved in directions/.

python extract_directions.py --model gemma2-2b
python extract_directions.py --model llama3-8b

3) Create edits

Run Steer2Edit based on the steering vectors. This creates multiple edits over a hyperparameter grid and saves all edits in edited_models/.

bash edit.sh --model gemma2-2b
bash edit.sh --model llama3-8b

4) Attribute evaluation (TruthfulQA)

Evaluate the TruthfulQA attribute metric for each edit.

bash evaluate_editing.sh --model gemma2-2b
bash evaluate_editing.sh --model llama3-8b

5) Utility evaluation (top-10 attribute configs)

This script selects the top 10 configurations based on the attribute evaluation results and evaluates the utility of the edited model.

python run_top_config_utility.py --model gemma2-2b
python run_top_config_utility.py --model llama3-8b

6) Activation steering baseline

Run the baseline steering method with different steering strengths.

bash evaluate_steering.sh
bash evaluate_utility_steering.sh

7) Plot trade-off (truthfulness vs utility)

Generate a trade-off plot similar to the one in the paper.

python plot.py --models gemma2-2b,llama3-8b

8) Visualize edit strength (heatmap)

Plot the per-layer edit strength (λ) under the best hyperparameter setting to highlight which components drive the target attribute. You can switch to different hyperparameter settings; otherwise, this uses the best setting reported in the paper.

python visualize_edit.py \
  --model1 gemma2-2b --rho_attn1 0.3 --rho_mlp1 -1.0 --alpha1 0.9 \
  --model2 llama3-8b --rho_attn2 0.1 --rho_mlp2 0.3 --alpha2 0.4 \
  --output_dir plots --filename edit_distribution_truth

Expected figures:

Truthfulness-Utility Trade-off
Truthfulness–Utility Trade-off
Truthfulness edit heatmap
Heatmap of Edit Strengths

Efficient reasoning

Reasoning Models: qwen3-4b-thinking, nemotron-7b

1) Generate probe responses (subset of GSM8K train split)

cd efficient_reasoning
python generate_response.py --model qwen3-4b-thinking --probe_path data/gsm8k_probe.jsonl
python generate_response.py --model nemotron-7b --probe_path data/gsm8k_probe.jsonl

2) Extract directions

Extract the steering vectors for reasoning efficiency after attention and MLP layers. The vectors will be saved in directions/.

python extract_directions.py --model qwen3-4b-thinking
python extract_directions.py --model nemotron-7b

3) Create edits

Run Steer2Edit based on the steering vectors. This creates multiple edits over a hyperparameter grid and saves all edits in edited_models/.

bash edit.sh --model qwen3-4b-thinking
bash edit.sh --model nemotron-7b

4) Attribute and Utility evaluation (reasoning length & accuracy)

Evaluate each edit on accuracy and reasoning length.

bash evaluate_editing.sh --model qwen3-4b-thinking
bash evaluate_editing.sh --model nemotron-7b

5) Activation steering baseline

Run the baseline steering method with different steering strengths.

bash evaluate_steering.sh

6) Plot trade-off (reasoning length vs accuracy)

Generate a trade-off plot similar to the one in the paper.

python plot.py --models qwen3-4b-thinking,nemotron-7b 

7) Visualize edit strength (heatmap)

Plot the per-layer edit strength (λ) under the best hyperparameter setting to highlight which components drive the target attribute. You can switch to different hyperparameter settings; otherwise, this uses the best setting reported in the paper.

python visualize_edit.py \
  --model1 qwen3-4b-thinking --rho_attn1 -1.0 --rho_mlp1 0.8 --alpha1 0.05 \
  --model2 nemotron-7b --rho_attn2 0.3 --rho_mlp2 0.9 --alpha2 0.2 \
  --output_dir plots --filename edit_distribution_efficiency

Expected figures:

Efficiency-Accuracy Trade-off
Efficiency–Accuracy Trade-off
Efficiency edit heatmap
Heatmap of Edit Strengths

Cite this work

@article{sun2026steer2edit,
  title={Steer2Edit: From Activation Steering to Component-Level Editing},
  author={Sun, Chung-En and Yan, Ge and Wang, Zimo and Weng, Tsui-Wei},
  journal={arXiv preprint arXiv:2602.09870},
  year={2026}
}

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors