Creativity Killed The Guardrails

A research project exploring the intersection of AI safety and creative writing through DPO-based training methods.

Team: Shikha, Sidhaarth, Yash for CMU's Advanced NLP course (11-711 Fall 2025)

Research Overview

Large language models are often fine-tuned with Reinforcement Learning (RL) to enhance creative diversity. However, these post-training methods rarely examine their effect on safety. When an RL objective rewards unusual or rare responses, it may also weaken the guardrails learned during supervised instruction tuning.

We study this relationship between diversity and safety and find that prior methods—which optimize diversity or alignment in isolation—are insufficient. Models that maximize diversity experience substantial degradation in safety benchmarks, while safer models exhibit significantly reduced semantic diversity.

Our Approach: SafeDORPO

To resolve this trade-off, we propose SafeDORPO (Safe Diversity-Oriented Reinforcement Learning Post-Training), a rubric-guided, safety-gated modification of DORPO that:

Rubric Generation: Instance-specific rubrics for fine-grained safety evaluation
Hard Safety Gate: Removes all unsafe completions from contributing to the gradient
Joint Diversity-Safety Weighting: Applied directly within the ORPO objective

Key Results

Across Gemma-2B models, SafeDORPO:

Recovers ~7 safety points compared to diversity-optimized baseline
Achieves ~22 points higher semantic diversity than supervised fine-tuning
Demonstrates that safe diversity is achievable when RL exploration is explicitly constrained by rubric-based safety criteria

Repository Structure

Creativity-Killed-The-Guardrails/
├── src/                          # Core source code
│   ├── classifier_models/        # Safety classifiers (WildGuard, etc.)
│   ├── data_utils.py             # Data processing utilities
│   ├── generation_utils.py       # Generation utilities
│   └── templates/                # Model input templates
├── evaluation/                   # Safety evaluation suite
│   ├── tasks/                    # Evaluation benchmarks
│   │   ├── generation/           # Generative LM evaluation
│   │   └── classification/       # Classifier evaluation
│   ├── eval.py                   # Main evaluation script
│   └── run_all_*.py              # Batch evaluation runners
├── dorpo-diversity/              # Diversity-aware training (submodule)
│   ├── scripts_dpo/              # Standard DPO training
│   ├── scripts_orpo/             # ORPO/DORPO training
│   ├── scripts_eval/             # Creative writing evaluation
│   └── diversity/                # Diversity metrics
├── results/                      # Evaluation results & plots
├── scripts/                      # Utility scripts
│   └── generate_plots.py         # Visualization
├── docs/                         # Documentation
│   ├── A3-project-proposal.pdf   # Initial proposal
│   ├── A3-report.pdf             # Baseline reproduction report
│   ├── A4-report.pdf             # Full project report
│   ├── anlp-poster.pdf           # Conference poster
│   └── reference-papers/         # Key papers
└── README.md

Quick Start

Installation

conda create -n safety-eval python=3.11 && conda activate safety-eval
pip install -e .
pip install -r requirements.txt
pip install vllm==0.9.0.1

Safety Evaluation

# Run all generation benchmarks
python evaluation/run_all_generation_benchmarks.py \
    --model_name_or_path allenai/tulu-2-dpo-7b \
    --model_input_template_path_or_name tulu2 \
    --report_output_path ./results/metrics.json

# Run specific benchmarks
python evaluation/eval.py generators \
    --use_vllm \
    --model_name_or_path allenai/tulu-2-dpo-7b \
    --tasks wildguardtest,harmbench,xstest

Diversity-Aware Training

cd dorpo-diversity

# Standard DPO
bash cm_dpo.sh

# Diversity DPO (DDPO)
bash cm_ddpo.sh

# Evaluate creative writing
python -m scripts_eval.generation_eval1_1

Key Findings

Method	Diversity	Safety	Description
gemma-2b-it	High	Low	Base instruction-tuned model - creative but unsafe
gemma-2b-sft	Medium	Medium	Supervised fine-tuned
DORPO	Very High	Low	Diversity-optimized RL - most creative, weakest guardrails
SafeDORPO	High	Medium-High	Our method - recovers safety while preserving creativity

SafeDORPO shifts the diversity-safety frontier upward, achieving higher diversity while recovering safety lost by DORPO.

Components

Safety Evaluation Suite (Ai2 Safety-Eval)

Comprehensive benchmarks including:

Safety: WildGuardTest, HarmBench, XSTest, ToxiGen, StrongReject
Capabilities: AlpacaEval, MT-Bench, GSM8K, MMLU

Diversity Training (DORPO-Diversity)

Modified DPO training that incorporates:

Semantic diversity scores
Style diversity metrics
Deviation-aware preference optimization

References

@misc{chung2025modifyingllm,
  title={Modifying Large Language Model Post-Training for Diverse Creative Writing}, 
  author={John Joon Young Chung and others},
  year={2025},
  eprint={2503.17126},
  archivePrefix={arXiv}
}

@misc{wildguard2024,
  title={WildGuard: Open One-Stop Moderation Tools for Safety Risks}, 
  author={Seungju Han and others},
  year={2024},
  eprint={2406.18495}
}

Course Information

Developed as part of CMU 11-711 Advanced NLP Fall 2025

Assignment 3: Baseline Reproduction
Assignment 4: Full Research Project

License

This project was developed for academic purposes. The safety evaluation framework is based on Ai2 Safety-Eval and diversity training on DiversityTuning.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Creativity Killed The Guardrails

Research Overview

Our Approach: SafeDORPO

Key Results

Repository Structure

Quick Start

Installation

Safety Evaluation

Diversity-Aware Training

Key Findings

Components

Safety Evaluation Suite (Ai2 Safety-Eval)

Diversity Training (DORPO-Diversity)

References

Course Information

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
docs		docs
dorpo-diversity/dorpo-diversity-44e954c0f9bdc8d803f2af3245f7bdbaa51f4189		dorpo-diversity/dorpo-diversity-44e954c0f9bdc8d803f2af3245f7bdbaa51f4189
evaluation		evaluation
results		results
scripts		scripts
src		src
.gitmodules		.gitmodules
README.md		README.md

Folders and files

Latest commit

History

Repository files navigation

Creativity Killed The Guardrails

Research Overview

Our Approach: SafeDORPO

Key Results

Repository Structure

Quick Start

Installation

Safety Evaluation

Diversity-Aware Training

Key Findings

Components

Safety Evaluation Suite (Ai2 Safety-Eval)

Diversity Training (DORPO-Diversity)

References

Course Information

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages