A research project exploring the intersection of AI safety and creative writing through DPO-based training methods.
Team: Shikha, Sidhaarth, Yash for CMU's Advanced NLP course (11-711 Fall 2025)
Large language models are often fine-tuned with Reinforcement Learning (RL) to enhance creative diversity. However, these post-training methods rarely examine their effect on safety. When an RL objective rewards unusual or rare responses, it may also weaken the guardrails learned during supervised instruction tuning.
We study this relationship between diversity and safety and find that prior methods—which optimize diversity or alignment in isolation—are insufficient. Models that maximize diversity experience substantial degradation in safety benchmarks, while safer models exhibit significantly reduced semantic diversity.
To resolve this trade-off, we propose SafeDORPO (Safe Diversity-Oriented Reinforcement Learning Post-Training), a rubric-guided, safety-gated modification of DORPO that:
- Rubric Generation: Instance-specific rubrics for fine-grained safety evaluation
- Hard Safety Gate: Removes all unsafe completions from contributing to the gradient
- Joint Diversity-Safety Weighting: Applied directly within the ORPO objective
Across Gemma-2B models, SafeDORPO:
- Recovers ~7 safety points compared to diversity-optimized baseline
- Achieves ~22 points higher semantic diversity than supervised fine-tuning
- Demonstrates that safe diversity is achievable when RL exploration is explicitly constrained by rubric-based safety criteria
Creativity-Killed-The-Guardrails/
├── src/ # Core source code
│ ├── classifier_models/ # Safety classifiers (WildGuard, etc.)
│ ├── data_utils.py # Data processing utilities
│ ├── generation_utils.py # Generation utilities
│ └── templates/ # Model input templates
├── evaluation/ # Safety evaluation suite
│ ├── tasks/ # Evaluation benchmarks
│ │ ├── generation/ # Generative LM evaluation
│ │ └── classification/ # Classifier evaluation
│ ├── eval.py # Main evaluation script
│ └── run_all_*.py # Batch evaluation runners
├── dorpo-diversity/ # Diversity-aware training (submodule)
│ ├── scripts_dpo/ # Standard DPO training
│ ├── scripts_orpo/ # ORPO/DORPO training
│ ├── scripts_eval/ # Creative writing evaluation
│ └── diversity/ # Diversity metrics
├── results/ # Evaluation results & plots
├── scripts/ # Utility scripts
│ └── generate_plots.py # Visualization
├── docs/ # Documentation
│ ├── A3-project-proposal.pdf # Initial proposal
│ ├── A3-report.pdf # Baseline reproduction report
│ ├── A4-report.pdf # Full project report
│ ├── anlp-poster.pdf # Conference poster
│ └── reference-papers/ # Key papers
└── README.md
conda create -n safety-eval python=3.11 && conda activate safety-eval
pip install -e .
pip install -r requirements.txt
pip install vllm==0.9.0.1# Run all generation benchmarks
python evaluation/run_all_generation_benchmarks.py \
--model_name_or_path allenai/tulu-2-dpo-7b \
--model_input_template_path_or_name tulu2 \
--report_output_path ./results/metrics.json
# Run specific benchmarks
python evaluation/eval.py generators \
--use_vllm \
--model_name_or_path allenai/tulu-2-dpo-7b \
--tasks wildguardtest,harmbench,xstestcd dorpo-diversity
# Standard DPO
bash cm_dpo.sh
# Diversity DPO (DDPO)
bash cm_ddpo.sh
# Evaluate creative writing
python -m scripts_eval.generation_eval1_1| Method | Diversity | Safety | Description |
|---|---|---|---|
| gemma-2b-it | High | Low | Base instruction-tuned model - creative but unsafe |
| gemma-2b-sft | Medium | Medium | Supervised fine-tuned |
| DORPO | Very High | Low | Diversity-optimized RL - most creative, weakest guardrails |
| SafeDORPO | High | Medium-High | Our method - recovers safety while preserving creativity |
SafeDORPO shifts the diversity-safety frontier upward, achieving higher diversity while recovering safety lost by DORPO.
Comprehensive benchmarks including:
- Safety: WildGuardTest, HarmBench, XSTest, ToxiGen, StrongReject
- Capabilities: AlpacaEval, MT-Bench, GSM8K, MMLU
Modified DPO training that incorporates:
- Semantic diversity scores
- Style diversity metrics
- Deviation-aware preference optimization
@misc{chung2025modifyingllm,
title={Modifying Large Language Model Post-Training for Diverse Creative Writing},
author={John Joon Young Chung and others},
year={2025},
eprint={2503.17126},
archivePrefix={arXiv}
}
@misc{wildguard2024,
title={WildGuard: Open One-Stop Moderation Tools for Safety Risks},
author={Seungju Han and others},
year={2024},
eprint={2406.18495}
}Developed as part of CMU 11-711 Advanced NLP Fall 2025
- Assignment 3: Baseline Reproduction
- Assignment 4: Full Research Project
This project was developed for academic purposes. The safety evaluation framework is based on Ai2 Safety-Eval and diversity training on DiversityTuning.