Transform Arena-style human feedback into production-ready RLHF training with minimal setup.
This repository provides a complete pipeline for taking preference data from Arena-style evaluations and training better models. It includes both reward model training and RLHF training phases, making it easy to go from raw Arena feedback to aligned models.
Arena RLHF provides a complete two-phase pipeline:
- Takes Arena preference data - Processes chosen/rejected pairs from Arena evaluations
- Trains reward models - Creates models that score response quality based on human feedback
- Efficient training - Uses LoRA for memory-efficient training on consumer hardware
- Ready for RLHF - Outputs reward models compatible with the RLHF training phase
- Uses trained reward models - Leverages Phase 1 outputs or pre-trained reward models
- GRPO training - Efficient RLHF training without separate critic models
- Produces aligned models - Final models optimized for human preferences
- Minimal setup - Simple YAML configuration for both phases
- Arena operators who want to train models from their collected feedback
- Researchers studying human preferences and model alignment
- Teams building custom models from human evaluations
- Anyone with preference data wanting to do RLHF training
arena-rlhf/
βββ reward_model/ # Phase 1: Reward Model Training
β βββ reward_model_training.py
β βββ reward_config.yaml
β βββ README.md
βββ rlhf/ # Phase 2: RLHF Training
β βββ rlhf.py
β βββ config.yaml
β βββ config_fast.yaml
β βββ README.md
βββ evaluate.py # Model evaluation
βββ requirements.txt # Dependencies
βββ README.md # This file
This implementation works with various types of Arena-style feedback:
- Pairwise comparisons ("Response A is better than Response B")
- Ranking data (Response 1 > Response 2 > Response 3)
- Rating scores (1-5 star ratings, thumbs up/down)
- Custom preference datasets (any format with preference signals)
- Chatbot Arena conversations and preferences
- Anthropic HH-RLHF preference data
- OpenAssistant conversations
- Custom Arena deployments
git clone https://github.com/delta-hq/arena-rlhf.git
cd arena-rlhf
pip install -r requirements.txt
Step 1: Train Reward Model from Arena Data
cd reward_model
python reward_model_training.py --create-config
python reward_model_training.py --config reward_config.yaml
Step 2: RLHF Training with Trained Reward Model
cd ../rlhf
# The config already points to the reward model output
python rlhf.py --config config.yaml
For Reward Model Testing:
cd reward_model
# Fast config with minimal data for quick testing
python reward_model_training.py --config reward_config_fast.yaml
For RLHF Testing:
cd rlhf
# Skip reward model training, use built-in reward functions
python rlhf.py --config config_fast.yaml
python evaluate.py --model_path ./rlhf/grpo_output/final_model --base_model Qwen/Qwen2-0.5B-Instruct
GRPO (Group Relative Policy Optimization) is perfect for Arena-style feedback because:
- Relative comparisons - Naturally handles "A vs B" preference data
- Memory efficient - No separate critic model needed (unlike PPO)
- Stable training - More robust than traditional RLHF approaches
- Production ready - Used to train models like DeepSeek-R1
Arena Preferences β Reward Model Training β GRPO Training β Aligned Model
(Data) (reward_model/) (rlhf/) (Output)
- Arena feedback collection - Gather human preferences on model outputs
- Reward model training - Train models to score responses based on preferences (
reward_model/
) - GRPO training - Train model to maximize reward from human preferences (
rlhf/
) - Evaluation - Test improved model against original
Two-Phase Approach Benefits:
- Better reward signals - Learned rewards vs heuristic functions
- Domain adaptation - Reward models trained on your specific Arena data
- Reusable components - Trained reward models can be used across projects
# config.yaml
model_name: "Qwen/Qwen2-0.5B-Instruct"
dataset:
name: "Anthropic/hh-rlhf" # Arena-style preference dataset
split: "train"
max_samples: 1000
reward:
model: "OpenAssistant/reward-model-deberta-v3-large-v2" # Trained on human preferences
# GRPO parameters optimized for Arena feedback
num_generations: 2
batch_size: 2
max_new_tokens: 256
temperature: 0.7
# config_fast.yaml - for quick testing without downloading reward models
reward:
function: "balanced_length" # Options: length, balanced_length, format, no_repetition
# Other built-in functions for specific Arena use cases:
# - format: Rewards structured thinking (good for reasoning tasks)
# - no_repetition: Penalizes repetitive responses
# - length: Simple length-based rewards
Perfect for Arena-style evaluation where you need multiple responses per prompt:
num_generations: 4 # Generate 4 responses per prompt for comparison
- Learned rewards - Use models trained on human preference data
- Rule-based rewards - Quick testing with heuristic functions
- Custom rewards - Easy to add your own reward functions
See how much your Arena-trained model improved:
python evaluate.py --model_path ./trained_model --base_model original_model --num_samples 50
Training on Arena-style feedback typically shows:
- Higher reward scores - Models learn to generate preferred responses
- Better human ratings - Improved alignment with human preferences
- Reduced harmful outputs - Better safety through preference learning
- Domain adaptation - Models adapt to specific Arena feedback patterns
# Load your custom Arena preference data
dataset = load_dataset("your-org/arena-preferences")
# The trainer automatically handles preference pair formatting
dataset:
name: "your-arena-data"
conversation_format: true # Handle multi-turn conversations
def arena_safety_reward(completions, **kwargs):
"""Custom reward function for Arena safety preferences"""
# Your custom logic here
return rewards
Traditional RLHF with PPO requires:
- Separate critic model (more memory)
- Absolute reward scores
- Complex hyperparameter tuning
GRPO is better for Arena feedback because:
- Works directly with relative preferences
- More memory efficient
- Simpler to tune and more stable
- Proven in production (DeepSeek-R1, etc.)
We welcome contributions, especially:
- Support for new Arena dataset formats
- Additional reward functions for specific domains
- Evaluation metrics for Arena-trained models
- Documentation and examples
MIT License - see LICENSE file for details.
- Arena community - For pioneering human preference collection
- TRL team - For excellent RLHF tooling
- Hugging Face - For model hosting and datasets
- GRPO researchers - For the efficient training algorithm
- π Report issues
- π‘ Feature requests
- π¬ Discussions
- π§ Questions about Arena integration
Ready to turn your Arena feedback into better models? Start with python rlhf.py --config config.yaml