Abliterator

Orthogonal Projection Abliteration toolkit featuring Norm-Preservation, Null-Space Constaints, Winsorization, and Adaptive Layer Weighting

Installation

pip install -e .

Requirements

Python 3.10+ with PyTorch
CUDA (optional) — GPU acceleration for faster processing; falls back to CPU if unavailable
llama.cpp (optional) — required for GGUF export; install separately from github.com/ggerganov/llama.cpp and ensure convert_hf_to_gguf.py and llama-quantize are available

Quick Start

abliterate

On first run, a setup wizard walks you through configuration—where your models live, output directories, and default precision. After that, you'll land in the main menu.

Using the CLI

Abliterate Model

The main workflow. Select a model from discovered directories (or enter a path manually), configure your options, and let it run.

Step 1: Select Base Model The CLI scans your configured directories and shows available models. Already-abliterated models are marked with [A].

Step 2: Output Path Defaults to ./abliterate/abliterated_models/{model-name}-abliterated. Change it if you like.

Step 3: Configuration

Number of prompts: How many harmful/harmless pairs to use (default: 30)
Direction multiplier: Ablation strength—1.0 is full, lower values are gentler
Norm preservation: Keeps weight magnitudes stable (recommended)
Filter prompts by refusal: Only uses prompts the model actually refuses (recommended)
Precision: float16 is fastest, bfloat16 for better precision

Step 4: Advanced Options Optional enhancements for better results:

Option	What it does	When to use
Winsorization	Clips outlier activations before computing directions	Gemma models, or when baseline gives weak results
Null-space constraints	Preserves model capabilities (math, coding, reasoning)	When you want minimal capability degradation
Adaptive layer weighting	Focuses ablation on middle-to-later layers	For targeted, surgical ablation

Test Model

Quick sanity checks:

Quick test: 5 default prompts with refusal detection
Custom prompt: Enter anything and see how the model responds
Full evaluation: Statistical analysis (see below)

Compare Models

Load an original and abliterated model side-by-side, enter a prompt, and see both responses. Useful for spot-checking behavior changes.

Evaluate Refusal

Runs the model against harmful and harmless prompt sets, computing refusal rates for each. Results are saved as timestamped JSON files to your configured eval directory.

Harmful refusal rate: Lower = more abliterated
Harmless refusal rate: Lower = fewer false positives

Export to GGUF

Converts abliterated models to GGUF format for llama.cpp, Ollama, or LM Studio. Supports Q4_K_M, Q5_K_M, Q8_0, and F16 quantization types. Vision-language models get automatic mmproj export.

Settings

Manage model search directories, eval output location, llama.cpp path, and defaults.

The Math

Refusal Direction Extraction

Based on Arditi et al. (2024), refusal behavior is mediated by a single direction in activation space.

Run the model on harmful prompts, extract hidden states from middle layers
Run the model on harmless prompts, extract hidden states
Refusal direction d = mean(harmful) − mean(harmless), normalized

Orthogonal Projection

Following Lai's norm-preserving method, we remove the refusal component from weight matrices:

$$W_{proj} = W - (W \cdot d) \otimes d^T$$

This projects out the component of each weight row that aligns with the refusal direction.

Norm Preservation

Continuing Lai's norm-preserving method, to maintain activation magnitudes, we rescale:

$$W_{final} = W_{proj} \times \frac{|W|_F}{|W_{proj}|_F}$$

This keeps the Frobenius norm unchanged, preventing downstream instabilities.

Winsorization

For models with outlier activations (especially Gemma), we clip extreme values before direction computation:

$$\text{threshold} = \text{quantile}(|x|, 0.995)$$ $$x_{clipped} = \text{clamp}(x, -\text{threshold}, \text{threshold})$$

Null-Space Constraints

Adapted from AlphaEdit (Fang et al., ICLR 2025). To preserve capabilities, we project the ablation update into the null space of preservation activations:

Collect activations K from diverse capability prompts (math, coding, reasoning)
Compute SVD: U, S, V = SVD(K)
Build null-space projector: P_null = I − VV^T
Constrain update: ΔW_constrained = ΔW · P_null

This mathematically guarantees the update won't affect outputs for preserved prompts.

Adaptive Layer Weighting

Research shows refusal concentrates in middle-to-later layers. We apply Gaussian-weighted strength:

$$\text{weight}_i = \exp\left(-\frac{1}{2}\left(\frac{i - \mu}{\sigma}\right)^2\right)$$

Where μ = 60% of model depth and σ = 20% of layers.

References

Core Research

Refusal in Language Models Is Mediated by a Single Direction — Arditi et al. (2024)
Representation Engineering — Zou et al. (2023)

Techniques

Norm-Preserving Biprojected Abliteration — Jim Lai
AlphaEdit: Null-Space Constrained Knowledge Editing — Fang et al. (ICLR 2025)

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
prompts		prompts
src		src
utils		utils
LICENSE		LICENSE
README.md		README.md
calibration_results.json		calibration_results.json
cli_example.jpg		cli_example.jpg
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Abliterator

Orthogonal Projection Abliteration toolkit featuring Norm-Preservation, Null-Space Constaints, Winsorization, and Adaptive Layer Weighting

Installation

Requirements

Quick Start

Using the CLI

Abliterate Model

Test Model

Compare Models

Evaluate Refusal

Export to GGUF

Settings

The Math

Refusal Direction Extraction

Orthogonal Projection

Norm Preservation

Winsorization

Null-Space Constraints

Adaptive Layer Weighting

References

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Abliterator

Orthogonal Projection Abliteration toolkit featuring Norm-Preservation, Null-Space Constaints, Winsorization, and Adaptive Layer Weighting

Installation

Requirements

Quick Start

Using the CLI

Abliterate Model

Test Model

Compare Models

Evaluate Refusal

Export to GGUF

Settings

The Math

Refusal Direction Extraction

Orthogonal Projection

Norm Preservation

Winsorization

Null-Space Constraints

Adaptive Layer Weighting

References

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages