Skip to content

shikha-16/Creativity-Killed-The-Guardrails

Repository files navigation

Creativity Killed The Guardrails

A research project exploring the intersection of AI safety and creative writing through DPO-based training methods.

Team: Shikha, Sidhaarth, Yash for CMU's Advanced NLP course (11-711 Fall 2025)

Paper Poster


Research Overview

Large language models are often fine-tuned with Reinforcement Learning (RL) to enhance creative diversity. However, these post-training methods rarely examine their effect on safety. When an RL objective rewards unusual or rare responses, it may also weaken the guardrails learned during supervised instruction tuning.

We study this relationship between diversity and safety and find that prior methods—which optimize diversity or alignment in isolation—are insufficient. Models that maximize diversity experience substantial degradation in safety benchmarks, while safer models exhibit significantly reduced semantic diversity.

Our Approach: SafeDORPO

To resolve this trade-off, we propose SafeDORPO (Safe Diversity-Oriented Reinforcement Learning Post-Training), a rubric-guided, safety-gated modification of DORPO that:

  1. Rubric Generation: Instance-specific rubrics for fine-grained safety evaluation
  2. Hard Safety Gate: Removes all unsafe completions from contributing to the gradient
  3. Joint Diversity-Safety Weighting: Applied directly within the ORPO objective

Key Results

Across Gemma-2B models, SafeDORPO:

  • Recovers ~7 safety points compared to diversity-optimized baseline
  • Achieves ~22 points higher semantic diversity than supervised fine-tuning
  • Demonstrates that safe diversity is achievable when RL exploration is explicitly constrained by rubric-based safety criteria

Repository Structure

Creativity-Killed-The-Guardrails/
├── src/                          # Core source code
│   ├── classifier_models/        # Safety classifiers (WildGuard, etc.)
│   ├── data_utils.py             # Data processing utilities
│   ├── generation_utils.py       # Generation utilities
│   └── templates/                # Model input templates
├── evaluation/                   # Safety evaluation suite
│   ├── tasks/                    # Evaluation benchmarks
│   │   ├── generation/           # Generative LM evaluation
│   │   └── classification/       # Classifier evaluation
│   ├── eval.py                   # Main evaluation script
│   └── run_all_*.py              # Batch evaluation runners
├── dorpo-diversity/              # Diversity-aware training (submodule)
│   ├── scripts_dpo/              # Standard DPO training
│   ├── scripts_orpo/             # ORPO/DORPO training
│   ├── scripts_eval/             # Creative writing evaluation
│   └── diversity/                # Diversity metrics
├── results/                      # Evaluation results & plots
├── scripts/                      # Utility scripts
│   └── generate_plots.py         # Visualization
├── docs/                         # Documentation
│   ├── A3-project-proposal.pdf   # Initial proposal
│   ├── A3-report.pdf             # Baseline reproduction report
│   ├── A4-report.pdf             # Full project report
│   ├── anlp-poster.pdf           # Conference poster
│   └── reference-papers/         # Key papers
└── README.md

Quick Start

Installation

conda create -n safety-eval python=3.11 && conda activate safety-eval
pip install -e .
pip install -r requirements.txt
pip install vllm==0.9.0.1

Safety Evaluation

# Run all generation benchmarks
python evaluation/run_all_generation_benchmarks.py \
    --model_name_or_path allenai/tulu-2-dpo-7b \
    --model_input_template_path_or_name tulu2 \
    --report_output_path ./results/metrics.json

# Run specific benchmarks
python evaluation/eval.py generators \
    --use_vllm \
    --model_name_or_path allenai/tulu-2-dpo-7b \
    --tasks wildguardtest,harmbench,xstest

Diversity-Aware Training

cd dorpo-diversity

# Standard DPO
bash cm_dpo.sh

# Diversity DPO (DDPO)
bash cm_ddpo.sh

# Evaluate creative writing
python -m scripts_eval.generation_eval1_1

Key Findings

Method Diversity Safety Description
gemma-2b-it High Low Base instruction-tuned model - creative but unsafe
gemma-2b-sft Medium Medium Supervised fine-tuned
DORPO Very High Low Diversity-optimized RL - most creative, weakest guardrails
SafeDORPO High Medium-High Our method - recovers safety while preserving creativity

SafeDORPO shifts the diversity-safety frontier upward, achieving higher diversity while recovering safety lost by DORPO.

Components

Safety Evaluation Suite (Ai2 Safety-Eval)

Comprehensive benchmarks including:

  • Safety: WildGuardTest, HarmBench, XSTest, ToxiGen, StrongReject
  • Capabilities: AlpacaEval, MT-Bench, GSM8K, MMLU

Diversity Training (DORPO-Diversity)

Modified DPO training that incorporates:

  • Semantic diversity scores
  • Style diversity metrics
  • Deviation-aware preference optimization

References

@misc{chung2025modifyingllm,
  title={Modifying Large Language Model Post-Training for Diverse Creative Writing}, 
  author={John Joon Young Chung and others},
  year={2025},
  eprint={2503.17126},
  archivePrefix={arXiv}
}

@misc{wildguard2024,
  title={WildGuard: Open One-Stop Moderation Tools for Safety Risks}, 
  author={Seungju Han and others},
  year={2024},
  eprint={2406.18495}
}

Course Information

Developed as part of CMU 11-711 Advanced NLP Fall 2025

  • Assignment 3: Baseline Reproduction
  • Assignment 4: Full Research Project

License

This project was developed for academic purposes. The safety evaluation framework is based on Ai2 Safety-Eval and diversity training on DiversityTuning.

About

Studying the creativity and safety tradeoffs in LLM post-training. SafeDORPO is a rubric-guided RL that preserves diversity while maintaining guardrails

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors