Skip to content

delta-hq/arena-rlhf

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

3 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Arena RLHF: Complete RLHF Pipeline from Arena-Style Feedback

Transform Arena-style human feedback into production-ready RLHF training with minimal setup.

This repository provides a complete pipeline for taking preference data from Arena-style evaluations and training better models. It includes both reward model training and RLHF training phases, making it easy to go from raw Arena feedback to aligned models.

License: MIT Python 3.8+

🎯 What This Repository Does

Arena RLHF provides a complete two-phase pipeline:

Phase 1: Reward Model Training (reward_model_training.py)

  1. Takes Arena preference data - Processes chosen/rejected pairs from Arena evaluations
  2. Trains reward models - Creates models that score response quality based on human feedback
  3. Efficient training - Uses LoRA for memory-efficient training on consumer hardware
  4. Ready for RLHF - Outputs reward models compatible with the RLHF training phase

Phase 2: RLHF Training (rlhf.py)

  1. Uses trained reward models - Leverages Phase 1 outputs or pre-trained reward models
  2. GRPO training - Efficient RLHF training without separate critic models
  3. Produces aligned models - Final models optimized for human preferences
  4. Minimal setup - Simple YAML configuration for both phases

Perfect for:

  • Arena operators who want to train models from their collected feedback
  • Researchers studying human preferences and model alignment
  • Teams building custom models from human evaluations
  • Anyone with preference data wanting to do RLHF training

πŸ“ Repository Structure

arena-rlhf/
β”œβ”€β”€ reward_model/           # Phase 1: Reward Model Training
β”‚   β”œβ”€β”€ reward_model_training.py
β”‚   β”œβ”€β”€ reward_config.yaml
β”‚   └── README.md
β”œβ”€β”€ rlhf/                   # Phase 2: RLHF Training  
β”‚   β”œβ”€β”€ rlhf.py
β”‚   β”œβ”€β”€ config.yaml
β”‚   β”œβ”€β”€ config_fast.yaml
β”‚   └── README.md
β”œβ”€β”€ evaluate.py             # Model evaluation
β”œβ”€β”€ requirements.txt        # Dependencies
└── README.md              # This file

🏟️ Arena-Style Feedback Support

This implementation works with various types of Arena-style feedback:

Supported Data Formats

  • Pairwise comparisons ("Response A is better than Response B")
  • Ranking data (Response 1 > Response 2 > Response 3)
  • Rating scores (1-5 star ratings, thumbs up/down)
  • Custom preference datasets (any format with preference signals)

Popular Arena Datasets

  • Chatbot Arena conversations and preferences
  • Anthropic HH-RLHF preference data
  • OpenAssistant conversations
  • Custom Arena deployments

πŸš€ Quick Start: Complete Pipeline

1. Install Dependencies

git clone https://github.com/delta-hq/arena-rlhf.git
cd arena-rlhf
pip install -r requirements.txt

2. Option A: Full Pipeline (Recommended)

Step 1: Train Reward Model from Arena Data

cd reward_model
python reward_model_training.py --create-config
python reward_model_training.py --config reward_config.yaml

Step 2: RLHF Training with Trained Reward Model

cd ../rlhf
# The config already points to the reward model output
python rlhf.py --config config.yaml

2. Option B: Quick Start (Fast Testing)

For Reward Model Testing:

cd reward_model
# Fast config with minimal data for quick testing
python reward_model_training.py --config reward_config_fast.yaml

For RLHF Testing:

cd rlhf
# Skip reward model training, use built-in reward functions
python rlhf.py --config config_fast.yaml

3. Evaluate Your Trained Model

python evaluate.py --model_path ./rlhf/grpo_output/final_model --base_model Qwen/Qwen2-0.5B-Instruct

βš™οΈ How It Works: GRPO for Arena Feedback

GRPO (Group Relative Policy Optimization) is perfect for Arena-style feedback because:

  • Relative comparisons - Naturally handles "A vs B" preference data
  • Memory efficient - No separate critic model needed (unlike PPO)
  • Stable training - More robust than traditional RLHF approaches
  • Production ready - Used to train models like DeepSeek-R1

Complete Arena β†’ RLHF Pipeline

Arena Preferences β†’ Reward Model Training β†’ GRPO Training β†’ Aligned Model
     (Data)           (reward_model/)        (rlhf/)        (Output)
  1. Arena feedback collection - Gather human preferences on model outputs
  2. Reward model training - Train models to score responses based on preferences (reward_model/)
  3. GRPO training - Train model to maximize reward from human preferences (rlhf/)
  4. Evaluation - Test improved model against original

Two-Phase Approach Benefits:

  • Better reward signals - Learned rewards vs heuristic functions
  • Domain adaptation - Reward models trained on your specific Arena data
  • Reusable components - Trained reward models can be used across projects

πŸ“Š Configuration for Arena Data

Using Trained Reward Models (Recommended for Arena Data)

# config.yaml
model_name: "Qwen/Qwen2-0.5B-Instruct"
dataset:
  name: "Anthropic/hh-rlhf"  # Arena-style preference dataset
  split: "train"
  max_samples: 1000

reward:
  model: "OpenAssistant/reward-model-deberta-v3-large-v2"  # Trained on human preferences

# GRPO parameters optimized for Arena feedback
num_generations: 2
batch_size: 2
max_new_tokens: 256
temperature: 0.7

Using Custom Reward Functions

# config_fast.yaml - for quick testing without downloading reward models
reward:
  function: "balanced_length"  # Options: length, balanced_length, format, no_repetition

# Other built-in functions for specific Arena use cases:
# - format: Rewards structured thinking (good for reasoning tasks)
# - no_repetition: Penalizes repetitive responses
# - length: Simple length-based rewards

🎯 Arena-Specific Features

Multiple Response Generation

Perfect for Arena-style evaluation where you need multiple responses per prompt:

num_generations: 4  # Generate 4 responses per prompt for comparison

Flexible Reward Systems

  • Learned rewards - Use models trained on human preference data
  • Rule-based rewards - Quick testing with heuristic functions
  • Custom rewards - Easy to add your own reward functions

Evaluation Against Base Models

See how much your Arena-trained model improved:

python evaluate.py --model_path ./trained_model --base_model original_model --num_samples 50

πŸ“ˆ Example Results

Training on Arena-style feedback typically shows:

  • Higher reward scores - Models learn to generate preferred responses
  • Better human ratings - Improved alignment with human preferences
  • Reduced harmful outputs - Better safety through preference learning
  • Domain adaptation - Models adapt to specific Arena feedback patterns

πŸ”§ Advanced Usage

Custom Arena Datasets

# Load your custom Arena preference data
dataset = load_dataset("your-org/arena-preferences")
# The trainer automatically handles preference pair formatting

Multi-turn Arena Conversations

dataset:
  name: "your-arena-data"
  conversation_format: true  # Handle multi-turn conversations

Custom Reward Functions for Arena Data

def arena_safety_reward(completions, **kwargs):
    """Custom reward function for Arena safety preferences"""
    # Your custom logic here
    return rewards

πŸ“š Background: Why GRPO for Arena Data?

Traditional RLHF with PPO requires:

  • Separate critic model (more memory)
  • Absolute reward scores
  • Complex hyperparameter tuning

GRPO is better for Arena feedback because:

  • Works directly with relative preferences
  • More memory efficient
  • Simpler to tune and more stable
  • Proven in production (DeepSeek-R1, etc.)

🀝 Contributing

We welcome contributions, especially:

  • Support for new Arena dataset formats
  • Additional reward functions for specific domains
  • Evaluation metrics for Arena-trained models
  • Documentation and examples

πŸ“„ License

MIT License - see LICENSE file for details.

πŸ™ Acknowledgments

  • Arena community - For pioneering human preference collection
  • TRL team - For excellent RLHF tooling
  • Hugging Face - For model hosting and datasets
  • GRPO researchers - For the efficient training algorithm

πŸ“ž Support & Community


Ready to turn your Arena feedback into better models? Start with python rlhf.py --config config.yaml

About

Arena RLHF

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages