Contrastive Perplexity for Controlled Generation: An Application in Detoxifying Large Language Models
- 12/18/2025: 🎊 Model provided 🎉
- 12/17/2025: Source code provided
- 05/30/2025: Initial repo created
This repository contains the source code for the ACL2 025 paper: Contrastive Perplexity for Controlled Generation: An Application in Detoxifying Large Language Models
The generation of toxic content by large language models (LLMs) remains a critical challenge for the safe deployment of language technology. We propose a novel framework for implicit knowledge editing and controlled text generation by fine-tuning LLMs with a prototype-based contrastive perplexity objective. Central to our method is the construction of hard negatives—toxic outputs that are generated through adversarial paraphrasing to be semantically similar and closely matched in length and model probability to their non-toxic counterparts. By training on these challenging and realistic pairs, our approach ensures robust and stable contrastive optimization. Experimental results in the domain of detoxification demonstrate that our method significantly reduces toxic generation while maintaining strong performance on downstream tasks such as commonsense reasoning and reading comprehension. Our findings highlight the effectiveness of leveraging hard negatives for attribute-aware language model fine-tuning.
Language models trained for which the performance is reported in the paper are available at the Huggingface Model Repository:
The generation of toxic content by large language models (LLMs) remains a critical challenge for the safe deployment of language technology. We propose a novel framework for implicit knowledge editing and controlled text generation by fine-tuning LLMs with a prototype-based contrastive perplexity objective.
Central to our method is the construction of hard negatives—toxic outputs that are generated through adversarial paraphrasing to be semantically similar and closely matched in length and model probability to their non-toxic counterparts. By training on these challenging and realistic pairs, our approach ensures robust and stable contrastive optimization.
- Contrastive Perplexity Framework: A novel training objective that leverages perplexity differences between toxic and non-toxic text pairs
- Hard Negative Mining: Adversarial paraphrasing technique to generate challenging negative examples
- Effective Detoxification: Significantly reduces toxic generation while maintaining strong performance on downstream tasks
- Minimal Performance Degradation: Preserves model capabilities on commonsense reasoning and reading comprehension
- Python 3.8+
- CUDA-capable GPU (tested on NVIDIA A100)
- 16GB+ GPU memory (for 4-bit quantization) or 40GB+ (for fp16)
Install required packages:
pip install -r requirements.txtMain dependencies:
transformers >= 4.35.0torch >= 2.0.0peft >= 0.6.0datasets >= 2.14.0bitsandbytes >= 0.41.0einops >= 0.7.0wandb(optional, for experiment tracking)
# Clone the repository
git clone https://github.com/SAP-samples/acl2025-contrastive-perplexity.git
cd acl2025-contrastive-perplexity
# Install dependencies
pip install -r requirements.txt
# Prepare data (downloads ToxiGen dataset)
bash scripts/prepare_data.sh
⚠️ IMPORTANT: The contrastive perplexity method requires per-token loss computation, which is not supported in standard HuggingFace transformer implementations. We provide a utility to patch models at runtime.
Standard HuggingFace CausalLM models compute loss with hardcoded reduction="mean":
loss_fct = CrossEntropyLoss() # Always averages the lossOur method needs individual token losses to compute perplexity for each example:
loss_fct = CrossEntropyLoss(reduction="none") # Returns loss per tokenAll training scripts automatically apply the patch. If you're using the models in your own code:
from transformers import MistralForCausalLM
from models.patch_utils import patch_causal_lm_for_loss_reduction
# Load model
model = MistralForCausalLM.from_pretrained("mistralai/Mistral-7B-v0.1")
# Apply patch (required for contrastive perplexity training)
model = patch_causal_lm_for_loss_reduction(model)
# Now you can use loss_reduction parameter
output = model(input_ids=..., labels=..., loss_reduction="none")The patch utility is located at models/patch_utils.py and works with any HuggingFace CausalLM model (LLaMA, Mistral, GPT, etc.).
The method requires paired data: neutral prompts and their toxic paraphrases. We use the ToxiGen dataset:
# Download and prepare ToxiGen data
bash scripts/prepare_data.sh
# Generate paraphrases using LLM (optional, pre-processed data available)
python scripts/paraphrase.py \
--input_dir data/eval/toxigen \
--output_dir data/paraphrase/toxigen \
--model_name TheBloke/Wizard-Vicuna-13B-Uncensored-GPTQpython train_mistral.py \
--model_name mistralai/Mistral-7B-v0.1 \
--data_dir data/paraphrase/safeNLP_processed \
--output_dir results/mistral_contrastive \
--num_pos 6 \
--num_neg 6 \
--alpha 100.0 \
--per_device_train_batch_size 2 \
--gradient_accumulation_steps 4 \
--num_train_epochs 3 \
--learning_rate 2e-5 \
--use_4bit_quantization \
--lora_r 64 \
--lora_target_modules q_proj,v_proj,k_proj,o_proj \
--bf16--num_pos: Number of positive (non-toxic) paraphrases per batch--num_neg: Number of negative (toxic) paraphrases per batch--alpha: Weight for the contrastive loss term (higher = stronger detoxification)--tau: Temperature parameter for contrastive learning--smoothing: Label smoothing factor for the contrastive loss
For even stronger detoxification, use the hard negatives variant:
python train_mistral_hard_negatives.py \
--model_name mistralai/Mistral-7B-v0.1 \
--num_pos 5 \
--num_neg 5 \
--alpha 100.0 \
--gradient_accumulation_steps 6 \
--num_train_epochs 5 \
--learning_rate 2.2e-5We also provide implementations of baseline methods for comparison:
# DPO (Direct Preference Optimization)
python train_dpo.py --model_name mistralai/Mistral-7B-v0.1
# PPO (Proximal Policy Optimization)
python train_ppo.py --model_name mistralai/Mistral-7B-v0.1
# SimPO
python run_simpo.py --model_name mistralai/Mistral-7B-v0.1Compute perplexity on toxic vs. non-toxic text:
python scripts/compute_perplexity.py \
--model_path results/mistral_contrastive \
--data_dir data/paraphrase/safeNLP_processed \
--num_toxic_samples 4 \
--num_neutral_samples 0Expected output: Lower perplexity on non-toxic text indicates successful detoxification.
.
├── models/
│ ├── patch_utils.py # Utility to patch HuggingFace models for loss_reduction
│ └── README.md # Documentation of model modifications
├── scripts/
│ ├── prepare_data.sh # Download ToxiGen dataset
│ ├── paraphrase.py # Generate paraphrases using LLMs
│ ├── clean_data.py # Data cleaning and toxicity scoring
│ └── compute_perplexity.py # Evaluation script
├── train_mistral.py # Main training script (Mistral)
├── train_mistral_hard_negatives.py # Hard negatives variant
├── train_mistral_cycle.py # Cycle consistency variant
├── contrastive-train.py # LLaMA-2 training
├── train_dpo.py # DPO baseline
├── train_ppo.py # PPO baseline
├── run_simpo.py # SimPO baseline
└── requirements.txt # Python dependencies
Training hyperparameters can be configured via:
- Command-line arguments (see
--helpfor each script) - YAML config files (in
configs/directory) - Environment variables (for data paths, wandb settings)
Example config file:
# configs/contrastive_mistral.yaml
model_name: mistralai/Mistral-7B-v0.1
num_pos: 6
num_neg: 6
alpha: 100.0
learning_rate: 2e-5
num_train_epochs: 3Use with: python train_mistral.py --config configs/contrastive_mistral.yaml
Our method achieves:
- 67% reduction in toxic generation (measured on RealToxicityPrompts)
- Maintains 95%+ performance on MMLU, HellaSwag, and other benchmarks
- Lower perplexity on non-toxic text compared to baseline methods
See the paper for full experimental results and analysis.
If you use this code in your research, please cite our paper:
@inproceedings{klein2025contrastive,
title={Contrastive Perplexity for Controlled Generation: An Application in Detoxifying Large Language Models},
author={Klein, Tassilo and Nabi, Moin},
booktitle={Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (ACL 2025)},
year={2025}
}Create an issue in this repository if you find a bug or have questions about the content.
For additional support, ask a question in SAP Community.
If you wish to contribute code, offer fixes or improvements, please send a pull request. Due to legal reasons, contributors will be asked to accept a DCO when they create the first pull request to this project. This happens in an automated fashion during the submission process. SAP uses the standard DCO text of the Linux Foundation.
Copyright (c) 2025 SAP SE or an SAP affiliate company. All rights reserved. This project is licensed under the Apache Software License, version 2.0 except as noted otherwise in the LICENSE file.