This is the official repository for the paper: Steer2Edit: From Activation Steering to Component-Level Editing. Here is our 5 min read project website.
In this work, we propose Steer2Edit, a theoretically grounded, training-free framework that transforms steering vectors from inference-time control signals into diagnostic signals for component-level rank-1 weight editing.
This repository contains the full Steer2Edit pipeline with three behavior controls:
safety_alignment/truthfulness/efficient_reasoning/
Each behavior control includes the data for probing:
- Safety Alignment:
safety_alignment/data/(advllm & gcg adversarial prompts, benign prompts from Alpaca dataset) - Truthfulness:
truthfulness/data/truthfulqa/ - Efficienct Reasoning:
efficient_reasoning/data/
pip install -r requirements.txtModels: llama2-7b, mistral-7b
First, given harmful and neutral queries, generate responses for each model.
cd safety_alignment
python generate_response.py --model llama2-7b
python generate_response.py --model mistral-7bExtract the steering vectors for refusal behavior after attention and MLP layers. The vectors will be saved in directions/.
python extract_directions.py --model llama2-7b
python extract_directions.py --model mistral-7bRun Steer2Edit based on the steering vectors. This creates multiple edits over a hyperparameter grid and saves all edits in edited_models/.
bash edit.sh --model llama2-7b
bash edit.sh --model mistral-7bEvaluate the refusal rate of each edit. We use vLLM to speed up inference; reduce TENSOR_PARALLEL_SIZE=4 if you do not have 4 GPUs.
bash evaluate_editing.sh --model llama2-7b
bash evaluate_editing.sh --model mistral-7bThis script selects the top 10 configurations based on the attribute evaluation results and evaluates the utility of the edited model.
python run_top_config_utility.py --model llama2-7b
python run_top_config_utility.py --model mistral-7bRun the baseline steering method with different steering strengths.
bash evaluate_steering.sh
bash evaluate_utility_steering.shGenerate a trade-off plot similar to the one in the paper.
python plot.py --models llama2-7b,mistral-7bPlot the per-layer edit strength (λ) under the best hyperparameter setting to highlight which components drive the target attribute. You can switch to different hyperparameter settings; otherwise, this uses the best setting reported in the paper.
python visualize_edit.py \
--model1 llama2-7b --rho_attn1 0.18 --rho_mlp1 0.55 --alpha1 0.8 \
--model2 mistral-7b --rho_attn2 0.42 --rho_mlp2 0.55 --alpha2 0.75 \Expected figures:
![]() Safety–Utility Trade-off |
![]() Heatmap of Edit Strengths |
Models: gemma2-2b, llama3-8b
cd truthfulness
python generate_response.py --model gemma2-2b --probe_path data/truthfulqa/train
python generate_response.py --model llama3-8b --probe_path data/truthfulqa/trainExtract the steering vectors for truthfulness after attention and MLP layers. The vectors will be saved in directions/.
python extract_directions.py --model gemma2-2b
python extract_directions.py --model llama3-8bRun Steer2Edit based on the steering vectors. This creates multiple edits over a hyperparameter grid and saves all edits in edited_models/.
bash edit.sh --model gemma2-2b
bash edit.sh --model llama3-8bEvaluate the TruthfulQA attribute metric for each edit.
bash evaluate_editing.sh --model gemma2-2b
bash evaluate_editing.sh --model llama3-8bThis script selects the top 10 configurations based on the attribute evaluation results and evaluates the utility of the edited model.
python run_top_config_utility.py --model gemma2-2b
python run_top_config_utility.py --model llama3-8bRun the baseline steering method with different steering strengths.
bash evaluate_steering.sh
bash evaluate_utility_steering.shGenerate a trade-off plot similar to the one in the paper.
python plot.py --models gemma2-2b,llama3-8bPlot the per-layer edit strength (λ) under the best hyperparameter setting to highlight which components drive the target attribute. You can switch to different hyperparameter settings; otherwise, this uses the best setting reported in the paper.
python visualize_edit.py \
--model1 gemma2-2b --rho_attn1 0.3 --rho_mlp1 -1.0 --alpha1 0.9 \
--model2 llama3-8b --rho_attn2 0.1 --rho_mlp2 0.3 --alpha2 0.4 \
--output_dir plots --filename edit_distribution_truthExpected figures:
![]() Truthfulness–Utility Trade-off |
![]() Heatmap of Edit Strengths |
Reasoning Models: qwen3-4b-thinking, nemotron-7b
cd efficient_reasoning
python generate_response.py --model qwen3-4b-thinking --probe_path data/gsm8k_probe.jsonl
python generate_response.py --model nemotron-7b --probe_path data/gsm8k_probe.jsonlExtract the steering vectors for reasoning efficiency after attention and MLP layers. The vectors will be saved in directions/.
python extract_directions.py --model qwen3-4b-thinking
python extract_directions.py --model nemotron-7bRun Steer2Edit based on the steering vectors. This creates multiple edits over a hyperparameter grid and saves all edits in edited_models/.
bash edit.sh --model qwen3-4b-thinking
bash edit.sh --model nemotron-7bEvaluate each edit on accuracy and reasoning length.
bash evaluate_editing.sh --model qwen3-4b-thinking
bash evaluate_editing.sh --model nemotron-7bRun the baseline steering method with different steering strengths.
bash evaluate_steering.shGenerate a trade-off plot similar to the one in the paper.
python plot.py --models qwen3-4b-thinking,nemotron-7b Plot the per-layer edit strength (λ) under the best hyperparameter setting to highlight which components drive the target attribute. You can switch to different hyperparameter settings; otherwise, this uses the best setting reported in the paper.
python visualize_edit.py \
--model1 qwen3-4b-thinking --rho_attn1 -1.0 --rho_mlp1 0.8 --alpha1 0.05 \
--model2 nemotron-7b --rho_attn2 0.3 --rho_mlp2 0.9 --alpha2 0.2 \
--output_dir plots --filename edit_distribution_efficiencyExpected figures:
![]() Efficiency–Accuracy Trade-off |
![]() Heatmap of Edit Strengths |
@article{sun2026steer2edit,
title={Steer2Edit: From Activation Steering to Component-Level Editing},
author={Sun, Chung-En and Yan, Ge and Wang, Zimo and Weng, Tsui-Wei},
journal={arXiv preprint arXiv:2602.09870},
year={2026}
}





