Reinforcement Learning with Human Feedback (RLHF)

A package for training language models to generate high-quality text using Reinforcement Learning from Human Feedback (RLHF) with GPT-4o as a reward model.

Overview

This package provides tools for fine-tuning language models to generate text that aligns with human preferences using a form of Reinforcement Learning from Human Feedback (RLHF). It uses GPT-4o as a reward model to evaluate generated text and guide the training process.

The implementation uses the TRL (Transformer Reinforcement Learning) library and the PPO (Proximal Policy Optimization) algorithm for policy optimization.

Note about Cross-Platform Support: This project supports multiple hardware platforms:

NVIDIA GPUs: Automatically detected and used via CUDA with optional optimizations for H100/A100

Apple Silicon: GPU acceleration via Metal Performance Shaders (MPS)

CPU: Fallback option for all platforms

Installation

Requirements

Python 3.7+
PyTorch 1.10+
Transformers 4.25+
TRL 0.17.0+
OpenAI API key (for GPT-4o reward model)

Installing

# Clone the repository
git clone https://github.com/SamFold/RL-Examples.git
cd RL-Examples

# Install dependencies
pip install -r requirements.txt

Usage

Quick Start with Convenience Scripts

For NVIDIA GPUs (H100/A100):

# Edit H100_RUN_SCRIPT.sh to add your OpenAI API key
chmod +x H100_RUN_SCRIPT.sh  # Make executable
./H100_RUN_SCRIPT.sh

For Apple Silicon (M1/M2/M3):

# Edit MPS_RUN_SCRIPT.sh to add your OpenAI API key
chmod +x MPS_RUN_SCRIPT.sh  # Make executable
./MPS_RUN_SCRIPT.sh

Manual Training

To train a model using GPT-4o as a reward model:

# Basic training command
python main.py --model_name lvwerra/gpt2-imdb --openai_api_key YOUR_API_KEY

# With device-specific optimizations enabled (for H100/A100 or Apple Silicon)
python main.py --model_name lvwerra/gpt2-imdb --optimize_device --openai_api_key YOUR_API_KEY

Inference with a Trained Model

python main.py --inference --model_path path/to/trained/model --openai_api_key YOUR_API_KEY

Command Line Arguments

--model_name: Model to train (default: "lvwerra/gpt2-imdb")
--max_epochs: Maximum training epochs (default: 200)
--batch_size: Global batch size for all operations (default: 32)
--save_freq: Checkpoint saving frequency (default: 50)
--openai_api_key: OpenAI API key for GPT-4o reward model (required)
--parallel_reward: Use parallel reward model for faster evaluation
--output_dir: Directory for outputs (default: "trainer_output")
--device: Training device ("cuda", "mps", or "cpu")
--optimize_device: Enable hardware-specific optimizations for CUDA/MPS
--mixed_precision: Enable mixed precision training for faster performance
--precision_dtype: Mixed precision type to use ("bfloat16" or "float16")
--no_exploration: Disable exploration (entropy bonus)
--inference: Run in inference mode
--model_path: Path to trained model for inference

Key Features

GPT-4o Reward Model: Uses OpenAI's GPT-4o as a reward function to evaluate text quality
Parallel Reward Processing: Supports parallel OpenAI API calls for 10-20x speedup in reward computation
Mixed Precision Training: Up to 3x speedup with BF16/FP16 on H100 GPUs
Separate Policy/Value Optimization: Uses distinct optimization steps for policy and value networks
Reward Normalization: Implements adaptive reward normalization for training stability
Reference Model Updates: Uses Exponential Moving Average (EMA) for stable reference model updates
Generalized Advantage Estimation (GAE): Implements GAE for improved policy gradients
Entropy-Based Exploration: Optional entropy bonus for exploration during training
KL Divergence Control: Prevents policy from diverging too far from reference model
Hardware Optimizations: Platform-specific optimizations for NVIDIA H100/A100 and Apple Silicon
Language Model Loss: Configurable auxiliary language modeling loss component

Project Structure

reinforment-learning-with-human-feedback/
├── main.py                     # Main script for training and inference
├── prepare_data.py             # Data preparation utilities
├── H100_RUN_SCRIPT.sh          # Convenience script for NVIDIA GPU training
├── MPS_RUN_SCRIPT.sh           # Convenience script for Apple Silicon
├── sentiment_rlhf/
│   ├── data/                   # Dataset utilities
│   ├── models/                 # Model loading utilities
│   │   └── model_loader.py     # Model loading implementation
│   ├── training/               # Training components
│   │   ├── ppo_trainer.py      # PPO trainer implementation
│   │   ├── reward_model.py     # Sequential reward model implementation
│   │   └── parallel_reward_model.py # Parallel reward model implementation
│   └── utils/                  # Utility functions
│       └── config.py           # Configuration utilities
├── setup.py                    # Package setup
└── requirements.txt            # Dependencies

Examples

Basic Training

# Basic training with default settings
python main.py --openai_api_key YOUR_API_KEY

GPU Training Examples

NVIDIA GPU with optimizations (best for H100/A100):

python main.py --model_name lvwerra/gpt2-imdb --batch_size 32 \
  --max_epochs 20 --device cuda --optimize_device \
  --openai_api_key YOUR_API_KEY

Apple Silicon with optimizations:

python main.py --model_name lvwerra/gpt2-imdb --batch_size 32 \
  --max_epochs 20 --device mps --optimize_device \
  --openai_api_key YOUR_API_KEY

Training with Parallel Reward Processing

python main.py --model_name lvwerra/gpt2-imdb --batch_size 32 \
  --parallel_reward --openai_api_key YOUR_API_KEY

Training Without Exploration

python main.py --model_name lvwerra/gpt2-imdb --no_exploration \
  --openai_api_key YOUR_API_KEY

Testing a Trained Model

python main.py --inference --model_path trainer_output/best-epoch-X \
  --openai_api_key YOUR_API_KEY

Cloud Deployment

For faster training on cloud platforms:

NVIDIA GPUs: See NEBIUS_DEPLOYMENT_GUIDE.md for detailed setup on Nebius Cloud
Google Cloud: See GCP_DEPLOYMENT_GUIDE.md for GCP deployment instructions

Citations

This implementation is based on the following resources:

License

MIT License

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Reinforcement Learning with Human Feedback (RLHF)

Overview

Installation

Requirements

Installing

Usage

Quick Start with Convenience Scripts

Manual Training

Inference with a Trained Model

Command Line Arguments

Key Features

Project Structure

Examples

Basic Training

GPU Training Examples

Training with Parallel Reward Processing

Training Without Exploration

Testing a Trained Model

Cloud Deployment

Citations

License

About

Uh oh!

Releases

Packages

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
sentiment_rlhf		sentiment_rlhf
.gitignore		.gitignore
H100_RUN_SCRIPT.sh		H100_RUN_SCRIPT.sh
MPS_RUN_SCRIPT.sh		MPS_RUN_SCRIPT.sh
NEBIUS_DEPLOYMENT_GUIDE.md		NEBIUS_DEPLOYMENT_GUIDE.md
NEBIUS_SETUP_FIX.md		NEBIUS_SETUP_FIX.md
README.md		README.md
main.py		main.py
prepare_data.py		prepare_data.py
requirements.txt		requirements.txt
view_history.sh		view_history.sh

SamFold/RL-Examples

Folders and files

Latest commit

History

Repository files navigation

Reinforcement Learning with Human Feedback (RLHF)

Overview

Installation

Requirements

Installing

Usage

Quick Start with Convenience Scripts

Manual Training

Inference with a Trained Model

Command Line Arguments

Key Features

Project Structure

Examples

Basic Training

GPU Training Examples

Training with Parallel Reward Processing

Training Without Exploration

Testing a Trained Model

Cloud Deployment

Citations

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages