Arena RLHF: Complete RLHF Pipeline from Arena-Style Feedback

Transform Arena-style human feedback into production-ready RLHF training with minimal setup.

This repository provides a complete pipeline for taking preference data from Arena-style evaluations and training better models. It includes both reward model training and RLHF training phases, making it easy to go from raw Arena feedback to aligned models.

🎯 What This Repository Does

Arena RLHF provides a complete two-phase pipeline:

Phase 1: Reward Model Training (`reward_model_training.py`)

Takes Arena preference data - Processes chosen/rejected pairs from Arena evaluations
Trains reward models - Creates models that score response quality based on human feedback
Efficient training - Uses LoRA for memory-efficient training on consumer hardware
Ready for RLHF - Outputs reward models compatible with the RLHF training phase

Phase 2: RLHF Training (`rlhf.py`)

Uses trained reward models - Leverages Phase 1 outputs or pre-trained reward models
GRPO training - Efficient RLHF training without separate critic models
Produces aligned models - Final models optimized for human preferences
Minimal setup - Simple YAML configuration for both phases

Perfect for:

Arena operators who want to train models from their collected feedback
Researchers studying human preferences and model alignment
Teams building custom models from human evaluations
Anyone with preference data wanting to do RLHF training

📁 Repository Structure

arena-rlhf/
├── reward_model/           # Phase 1: Reward Model Training
│   ├── reward_model_training.py
│   ├── reward_config.yaml
│   └── README.md
├── rlhf/                   # Phase 2: RLHF Training  
│   ├── rlhf.py
│   ├── config.yaml
│   ├── config_fast.yaml
│   └── README.md
├── evaluate.py             # Model evaluation
├── requirements.txt        # Dependencies
└── README.md              # This file

🏟️ Arena-Style Feedback Support

This implementation works with various types of Arena-style feedback:

Supported Data Formats

Pairwise comparisons ("Response A is better than Response B")
Ranking data (Response 1 > Response 2 > Response 3)
Rating scores (1-5 star ratings, thumbs up/down)
Custom preference datasets (any format with preference signals)

Popular Arena Datasets

Chatbot Arena conversations and preferences
Anthropic HH-RLHF preference data
OpenAssistant conversations
Custom Arena deployments

🚀 Quick Start: Complete Pipeline

1. Install Dependencies

git clone https://github.com/delta-hq/arena-rlhf.git
cd arena-rlhf
pip install -r requirements.txt

2. Option A: Full Pipeline (Recommended)

Step 1: Train Reward Model from Arena Data

cd reward_model
python reward_model_training.py --create-config
python reward_model_training.py --config reward_config.yaml

Step 2: RLHF Training with Trained Reward Model

cd ../rlhf
# The config already points to the reward model output
python rlhf.py --config config.yaml

2. Option B: Quick Start (Fast Testing)

For Reward Model Testing:

cd reward_model
# Fast config with minimal data for quick testing
python reward_model_training.py --config reward_config_fast.yaml

For RLHF Testing:

cd rlhf
# Skip reward model training, use built-in reward functions
python rlhf.py --config config_fast.yaml

3. Evaluate Your Trained Model

python evaluate.py --model_path ./rlhf/grpo_output/final_model --base_model Qwen/Qwen2-0.5B-Instruct

⚙️ How It Works: GRPO for Arena Feedback

GRPO (Group Relative Policy Optimization) is perfect for Arena-style feedback because:

Relative comparisons - Naturally handles "A vs B" preference data
Memory efficient - No separate critic model needed (unlike PPO)
Stable training - More robust than traditional RLHF approaches
Production ready - Used to train models like DeepSeek-R1

Complete Arena → RLHF Pipeline

Arena Preferences → Reward Model Training → GRPO Training → Aligned Model
     (Data)           (reward_model/)        (rlhf/)        (Output)

Arena feedback collection - Gather human preferences on model outputs
Reward model training - Train models to score responses based on preferences (reward_model/)
GRPO training - Train model to maximize reward from human preferences (rlhf/)
Evaluation - Test improved model against original

Two-Phase Approach Benefits:

Better reward signals - Learned rewards vs heuristic functions
Domain adaptation - Reward models trained on your specific Arena data
Reusable components - Trained reward models can be used across projects

📊 Configuration for Arena Data

Using Trained Reward Models (Recommended for Arena Data)

# config.yaml
model_name: "Qwen/Qwen2-0.5B-Instruct"
dataset:
  name: "Anthropic/hh-rlhf"  # Arena-style preference dataset
  split: "train"
  max_samples: 1000

reward:
  model: "OpenAssistant/reward-model-deberta-v3-large-v2"  # Trained on human preferences

# GRPO parameters optimized for Arena feedback
num_generations: 2
batch_size: 2
max_new_tokens: 256
temperature: 0.7

Using Custom Reward Functions

# config_fast.yaml - for quick testing without downloading reward models
reward:
  function: "balanced_length"  # Options: length, balanced_length, format, no_repetition

# Other built-in functions for specific Arena use cases:
# - format: Rewards structured thinking (good for reasoning tasks)
# - no_repetition: Penalizes repetitive responses
# - length: Simple length-based rewards

🎯 Arena-Specific Features

Multiple Response Generation

Perfect for Arena-style evaluation where you need multiple responses per prompt:

num_generations: 4  # Generate 4 responses per prompt for comparison

Flexible Reward Systems

Learned rewards - Use models trained on human preference data
Rule-based rewards - Quick testing with heuristic functions
Custom rewards - Easy to add your own reward functions

Evaluation Against Base Models

See how much your Arena-trained model improved:

python evaluate.py --model_path ./trained_model --base_model original_model --num_samples 50

📈 Example Results

Training on Arena-style feedback typically shows:

Higher reward scores - Models learn to generate preferred responses
Better human ratings - Improved alignment with human preferences
Reduced harmful outputs - Better safety through preference learning
Domain adaptation - Models adapt to specific Arena feedback patterns

🔧 Advanced Usage

Custom Arena Datasets

# Load your custom Arena preference data
dataset = load_dataset("your-org/arena-preferences")
# The trainer automatically handles preference pair formatting

Multi-turn Arena Conversations

dataset:
  name: "your-arena-data"
  conversation_format: true  # Handle multi-turn conversations

Custom Reward Functions for Arena Data

def arena_safety_reward(completions, **kwargs):
    """Custom reward function for Arena safety preferences"""
    # Your custom logic here
    return rewards

📚 Background: Why GRPO for Arena Data?

Traditional RLHF with PPO requires:

Separate critic model (more memory)
Absolute reward scores
Complex hyperparameter tuning

GRPO is better for Arena feedback because:

Works directly with relative preferences
More memory efficient
Simpler to tune and more stable
Proven in production (DeepSeek-R1, etc.)

🤝 Contributing

We welcome contributions, especially:

Support for new Arena dataset formats
Additional reward functions for specific domains
Evaluation metrics for Arena-trained models
Documentation and examples

📄 License

MIT License - see LICENSE file for details.

🙏 Acknowledgments

Arena community - For pioneering human preference collection
TRL team - For excellent RLHF tooling
Hugging Face - For model hosting and datasets
GRPO researchers - For the efficient training algorithm

📞 Support & Community

🐛 Report issues
💡 Feature requests
💬 Discussions
📧 Questions about Arena integration

Ready to turn your Arena feedback into better models? Start with python rlhf.py --config config.yaml

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
reward_model		reward_model
rlhf		rlhf
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
evaluate.py		evaluate.py
requirements.txt		requirements.txt
rlhf.py		rlhf.py

License

delta-hq/arena-rlhf

Folders and files

Latest commit

History

Repository files navigation