[ICLR 2025] Understanding and Enhancing Safety Mechanisms of LLMs via Safety-Specific Neuron

This repository contains code for the paper "Understanding and Enhancing Safety Mechanisms of LLMs via Safety-Specific Neuron".

Neuron Detection (PLND)

The codebase is totally the same as How do Large Language Models Handle Multilingualism? We provide codes for detecting neurons in Llama, Mistral and Gemma.

Installation

The package can be installed by running the following command at the root of this repository:

conda create -n Neuron python=3.9
conda activate Neuron
pip install -r requirement.txt

Running

Detect corpus is harmful behavior dataset of llm-attack, we need to change transformers package. When detecting, we need to define the language and number of documents used to detect. Detected neurons will be stored in folder ./output_neurons.

cd /neuron_detection
python neuron_detection.py english 1000

Parameters

Number of Top-k neurons in each layer

top_number_attn = 1000
top_number_ffn = 2000

Neuron Deactivation

We provide codes for detecting neurons in Llama, Mistral and Gemma.

Installation

The package can be installed by running the following command at the root of this repository:

conda create -n SeaExam python=3.9
conda activate Deactivate
pip install -r requirement.txt

Running

We need to change transformers package.

cd /neuron_deactivate
python test_mistral_gsm.py {language} {understanding layer} {generation layer} {attn deact_number} {ffn deact_number} {whether under_attn} {whether reason_attn} {whether gen_attn} {whether under_ffn} {whether reason_ffn} {whether gen_ffn}

Neuron Specific Enhancement

Neuron specific tuning code is the same for all models.

Installation

The package can be installed by running the following command at the root of this repository:

conda create -n SeaExam python=3.9
conda activate Enhance
pip install -r requirement.txt

Running

We need to change transformers package.

cd /neuron_enhancement
python train_neuron.py

Parameters

Note that attn_k and attn_v needs to be divided by kv_repeat. index_keys requires fitting to model you want to train and number of understanding layer and generation layer needs to be changed correspondingly.

index_keys = [0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31]         

index_keys_under = [i for i in range(8)]
index_keys_gen = [31-i for i in range(4)]

attn_k = {key: {num//4 for num in value} for key, value in attn_k.items()}
attn_v = {key: {num//4 for num in value} for key, value in attn_v.items()}

Citation

If you found this repository useful, please consider

@inproceedings{
zhao2025understanding,
title={Understanding and Enhancing Safety Mechanisms of {LLM}s via Safety-Specific Neuron},
author={Yiran Zhao and Wenxuan Zhang and Yuxi Xie and Anirudh Goyal and Kenji Kawaguchi and Michael Shieh},
booktitle={The Thirteenth International Conference on Learning Representations},
year={2025},
url={https://openreview.net/forum?id=yR47RmND1m}
}

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
figures		figures
neuron_deactivate		neuron_deactivate
neuron_detection		neuron_detection
neuron_enhancement		neuron_enhancement
.DS_Store		.DS_Store
README.md		README.md
requirement.txt		requirement.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

[ICLR 2025] Understanding and Enhancing Safety Mechanisms of LLMs via Safety-Specific Neuron

Neuron Detection (PLND)

Installation

Running

Parameters

Neuron Deactivation

Installation

Running

Neuron Specific Enhancement

Installation

Running

Parameters

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

[ICLR 2025] Understanding and Enhancing Safety Mechanisms of LLMs via Safety-Specific Neuron

Neuron Detection (PLND)

Installation

Running

Parameters

Neuron Deactivation

Installation

Running

Neuron Specific Enhancement

Installation

Running

Parameters

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages