Orthogonal Projection Abliteration toolkit featuring Norm-Preservation, Null-Space Constaints, Winsorization, and Adaptive Layer Weighting
pip install -e .- Python 3.10+ with PyTorch
- CUDA (optional) — GPU acceleration for faster processing; falls back to CPU if unavailable
- llama.cpp (optional) — required for GGUF export; install separately from github.com/ggerganov/llama.cpp and ensure
convert_hf_to_gguf.pyandllama-quantizeare available
abliterateOn first run, a setup wizard walks you through configuration—where your models live, output directories, and default precision. After that, you'll land in the main menu.
The main workflow. Select a model from discovered directories (or enter a path manually), configure your options, and let it run.
Step 1: Select Base Model
The CLI scans your configured directories and shows available models. Already-abliterated models are marked with [A].
Step 2: Output Path
Defaults to ./abliterate/abliterated_models/{model-name}-abliterated. Change it if you like.
Step 3: Configuration
- Number of prompts: How many harmful/harmless pairs to use (default: 30)
- Direction multiplier: Ablation strength—1.0 is full, lower values are gentler
- Norm preservation: Keeps weight magnitudes stable (recommended)
- Filter prompts by refusal: Only uses prompts the model actually refuses (recommended)
- Precision: float16 is fastest, bfloat16 for better precision
Step 4: Advanced Options Optional enhancements for better results:
| Option | What it does | When to use |
|---|---|---|
| Winsorization | Clips outlier activations before computing directions | Gemma models, or when baseline gives weak results |
| Null-space constraints | Preserves model capabilities (math, coding, reasoning) | When you want minimal capability degradation |
| Adaptive layer weighting | Focuses ablation on middle-to-later layers | For targeted, surgical ablation |
Quick sanity checks:
- Quick test: 5 default prompts with refusal detection
- Custom prompt: Enter anything and see how the model responds
- Full evaluation: Statistical analysis (see below)
Load an original and abliterated model side-by-side, enter a prompt, and see both responses. Useful for spot-checking behavior changes.
Runs the model against harmful and harmless prompt sets, computing refusal rates for each. Results are saved as timestamped JSON files to your configured eval directory.
- Harmful refusal rate: Lower = more abliterated
- Harmless refusal rate: Lower = fewer false positives
Converts abliterated models to GGUF format for llama.cpp, Ollama, or LM Studio. Supports Q4_K_M, Q5_K_M, Q8_0, and F16 quantization types. Vision-language models get automatic mmproj export.
Manage model search directories, eval output location, llama.cpp path, and defaults.
Based on Arditi et al. (2024), refusal behavior is mediated by a single direction in activation space.
- Run the model on harmful prompts, extract hidden states from middle layers
- Run the model on harmless prompts, extract hidden states
- Refusal direction d = mean(harmful) − mean(harmless), normalized
Following Lai's norm-preserving method, we remove the refusal component from weight matrices:
This projects out the component of each weight row that aligns with the refusal direction.
Continuing Lai's norm-preserving method, to maintain activation magnitudes, we rescale:
This keeps the Frobenius norm unchanged, preventing downstream instabilities.
For models with outlier activations (especially Gemma), we clip extreme values before direction computation:
Adapted from AlphaEdit (Fang et al., ICLR 2025). To preserve capabilities, we project the ablation update into the null space of preservation activations:
- Collect activations K from diverse capability prompts (math, coding, reasoning)
- Compute SVD: U, S, V = SVD(K)
- Build null-space projector: P_null = I − VV^T
- Constrain update: ΔW_constrained = ΔW · P_null
This mathematically guarantees the update won't affect outputs for preserved prompts.
Research shows refusal concentrates in middle-to-later layers. We apply Gaussian-weighted strength:
Where μ = 60% of model depth and σ = 20% of layers.
Core Research
- Refusal in Language Models Is Mediated by a Single Direction — Arditi et al. (2024)
- Representation Engineering — Zou et al. (2023)
Techniques
- Norm-Preserving Biprojected Abliteration — Jim Lai
- AlphaEdit: Null-Space Constrained Knowledge Editing — Fang et al. (ICLR 2025)
