Light, Efficient, Omni-modal & Reward-model Driven Reinforcement Fine-Tuning Framework
English | ็ฎไฝไธญๆ
LightRFT (Light Reinforcement Fine-Tuning) is an advanced reinforcement learning fine-tuning framework designed for Large Language Models (LLMs) and Vision-Language Models (VLMs). This framework provides efficient and scalable RLHF (Reinforcement Learning from Human Feedback) and RLVR training capabilities, supporting multiple state-of-the-art algorithms and distributed training strategies.
-
๐ High-Performance Inference Engines
- Integrated vLLM and SGLang for efficient sampling and inference
- FP8 inference optimization for significantly reduced latency and memory usage
- Flexible engine sleep/wake mechanisms for optimal resource utilization
-
๐ง Rich Algorithm Ecosystem
- Policy Optimization: GRPO, GSPO, GMPO, Dr.GRPO
- Advantage Estimation: REINFORCE++, CPGD
- Reward Processing: Reward Norm/Clip
- Sampling Strategy: FIRE Sampling, Token-Level Policy
- Stability Enhancement: DAPO, select_high_entropy_tokens
-
๐ง Flexible Training Strategies
- FSDP (Fully Sharded Data Parallel) v2 support
- DeepSpeed ZeRO (Stage 1/2/3) support
- Gradient checkpointing and mixed precision training (BF16/FP16)
- Adam Offload and memory optimization techniques
-
๐ฏ Innovative Resource Collaboration
- Colocate Anything: Co-locate reward models with training models to maximize GPU utilization
- Support multiple reward models for parallel inference on the same device
- Dynamic memory management with automatic training/inference phase switching
- Reduced cross-device communication overhead for improved end-to-end training efficiency
- Balance Anything ๐ง (Under Development): Intelligent load balancing system
- Adaptive task scheduling and resource allocation
- Automatic load balancing for multi-node training
- Performance optimization for heterogeneous hardware environments
- Colocate Anything: Co-locate reward models with training models to maximize GPU utilization
-
๐ Comprehensive Multimodal Support
- Native Vision-Language Model (VLM) Training
- Support for mainstream VLMs like Qwen-VL
- Parallel processing of multimodal image-text data
- Efficient multimodal tokenization and batching
- Multimodal Reward Modeling
- Support for multiple visual reward models working in collaboration
- Joint optimization of image understanding and text generation
- Complete Vision-Language Alignment Training Pipeline
- Optimized for multimodal RLVR/RLHF training
- Built-in support for vision-language model fine-tuning
- Native Vision-Language Model (VLM) Training
-
๐ Complete Experimental Toolkit
- Weights & Biases (W&B) integration
- Math capability benchmarking (GSM8K, Geo3K, etc.)
- Trajectory saving and analysis tools
- Automatic checkpoint management
For detailed algorithm descriptions, implementation details, and usage guide, see Algorithm Documentation.
| Algorithm | Type | Key Improvement | Paper |
|---|---|---|---|
| GRPO | Policy Optimization | Group normalized advantage estimation | arXiv:2402.03300 |
| GSPO | Policy Optimization | Group sequence policy optimization | arXiv:2507.18071 |
| GMPO (WIP) | Policy Optimization | Geometric-mean policy optimization | arXiv:2507.20673 |
| Dr.GRPO | Policy Optimization | Length bias mitigation | arXiv:2503.20783 |
| DAPO | Policy Optimization | Decoupled clip and dynamic sampling policy optimization | arXiv:2503.14476 |
| REINFORCE++ | Advantage Estimation | Improved baseline estimation | arXiv:2501.03262 |
| CPGD | Advantage Estimation | KL-based drift constraint | arXiv:2505.12504 |
| FIRE Sampling | Sampling Strategy | Filtering and ranking strategies | arXiv:2410.21236 |
- Python >= 3.10
- CUDA >= 12.8
- PyTorch >= 2.5.1
TO BE DONE
Clone and install LightRFT:
# Clone the repository
git clone https://github.com/opendilab/LightRFT.git
cd LightRFT
# Install dependencies
pip install -r requirements.txt
# Install LightRFT
pip install -e .# Single node, 8 GPU training example
cd LightRFT
# Run GRPO training (GSM8K math reasoning task)
bash examples/gsm8k_geo3k/run_grpo_gsm8k_qwen2.5_0.5b.sh
# Or run Geo3K geometry problem training (VLM multimodal)
bash examples/gsm8k_geo3k/run_grpo_geo3k_qwen2.5_vl_7b.shLightRFT/
โโโ lightrft/ # Core library
โ โโโ strategy/ # Training & inference strategies
โ โ โโโ fsdp/ # FSDP implementation
โ โ โโโ deepspeed/ # DeepSpeed implementation
โ โ โโโ vllm_utils/ # vLLM utilities
โ โ โโโ sglang_utils/ # SGLang utilities
โ โ โโโ utils/ # Strategy utilities
โ โโโ models/ # Model definitions
โ โ โโโ actor_al.py # Audio-language model actor
โ โ โโโ actor_language.py # Language model actor
โ โ โโโ actor_vl.py # Vision-language model actor
โ โ โโโ grm_vl.py # Generative reward model (Vision-Language)
โ โ โโโ srm_al.py # Scalar reward model (Audio-Language)
โ โ โโโ srm_vl.py # Scalar reward model (Vision-Language)
โ โ โโโ loss.py # Loss functions
โ โ โโโ monkey_patch/ # Model adaptation patches for distributed training
โ โ โโโ tests/ # Model tests
โ โ โโโ utils.py # Model utilities
โ โโโ trainer/ # Trainer implementations
โ โ โโโ ppo_trainer.py # LLM PPO trainer
โ โ โโโ ppo_trainer_vl.py # VLM PPO trainer
โ โ โโโ spmd_ppo_trainer.py # SPMD PPO trainer Extension (**Core**)
โ โ โโโ grm_trainer_vl.py # Generative reward model trainer (Vision-Language)
โ โ โโโ srm_trainer_al.py # Scalar reward model trainer (Audio-Language)
โ โ โโโ srm_trainer_vl.py # Scalar reward model trainer (Vision-Language)
โ โ โโโ fast_exp_maker.py # Fast experience generator (**Core**)
โ โ โโโ experience_maker.py # Base experience generator
โ โ โโโ experience_maker_vl.py # Base experience generator for VLM
โ โ โโโ replay_buffer.py # Replay buffer
โ โ โโโ replay_buffer_vl.py # VLM replay buffer
โ โ โโโ replay_buffer_utils.py # Replay buffer utilities
โ โ โโโ kl_controller.py # KL divergence controller
โ โ โโโ utils.py # Trainer utilities
โ โโโ datasets/ # Dataset processing
โ โ โโโ audio_alpaca.py # Audio Alpaca dataset
โ โ โโโ grm_dataset.py # Generative reward model dataset
โ โ โโโ hpdv3.py # HPDv3 reward model dataset
โ โ โโโ image_reward_db.py # Image reward database
โ โ โโโ imagegen_cot_reward.py # Image generation CoT generative reward
โ โ โโโ omnirewardbench.py # OmniRewardBench dataset
โ โ โโโ process_reward_dataset.py # Reward dataset processing
โ โ โโโ prompts_dataset.py # LLM Prompts dataset
โ โ โโโ prompts_dataset_vl.py # Vision-language prompts dataset
โ โ โโโ rapidata.py # Rapidata reward modeldataset
โ โ โโโ sft_dataset.py # SFT dataset
โ โ โโโ sft_dataset_vl.py # VLM SFT dataset
โ โ โโโ srm_dataset.py # Scalar reward model base dataset
โ โ โโโ utils.py # Dataset utilities
โ โโโ utils/ # Utility functions
โ โโโ ckpt_scripts/ # Checkpoint processing scripts
โ โโโ cli_args.py # CLI argument parsing
โ โโโ distributed_sampler.py # Distributed sampler
โ โโโ logging_utils.py # Logging utilities
โ โโโ processor.py # Data processor for HF model
โ โโโ remote_rm_utils.py # Remote reward model utilities
โ โโโ timer.py # Timer utilities
โ โโโ trajectory_saver.py # Trajectory saver
โ โโโ utils.py # General utilities
โ
โโโ examples/ # Usage examples
โ โโโ gsm8k_geo3k/ # GSM8K/Geo3K math reasoning training examples
โ โโโ grm_training/ # Generative reward model training examples
โ โโโ srm_training/ # Scalar reward model training examples
โ โโโ chat/ # Model dialogue examples
โ
โโโ docs/ # ๐ Sphinx documentation
โ โโโ Makefile # Documentation build Makefile
โ โโโ make.bat # Documentation build batch file
โ โโโ source/ # Documentation source
โ โโโ _static/ # Static files (CSS, etc.)
โ โโโ api_doc/ # API documentation
โ โโโ best_practice/ # Best practices & resources
โ โโโ installation/ # Installation guides
โ โโโ quick_start/ # Quick start & user guides
โ
โโโ assets/ # Assets
โ โโโ logo.png # Project logo
โ
โโโ CHANGELOG.md # Changelog
โโโ LICENSE # License file
โโโ Makefile # Project Makefile
โโโ README.md # Project documentation (English)
โโโ README_zh.md # Project documentation (Chinese)
โโโ requirements.txt # Python dependencies
โโโ requirements-dev.txt # Development dependencies
โโโ requirements-doc.txt # Documentation dependencies
โโโ setup.py # Package setup script
lightrft/: LightRFT core library, providing training strategies, model definitions, and trainer implementationsexamples/: Complete training examples and scriptsgsm8k_geo3k/: GSM8K and Geo3K math reasoning training examplesgrm_training/: Generative reward model training examplessrm_training/: Scalar reward model training exampleschat/: Model dialogue examples
docs/: Sphinx documentation with complete user guides and API documentation
TBS=128 # Training batch size
RBS=128 # Rollout batch size
micro_train_batch_size=1 # Micro batch size per GPU
micro_rollout_batch_size=2 # Rollout micro batch size--advantage_estimator group_norm # Advantage estimator: group_norm, reinforce, cpgd
--n_samples_per_prompt 8 # Number of samples per prompt
--max_epochs 1 # Training epochs per episode
--num_episodes 3 # Total training episodes
--kl_estimator k3 # KL estimator type
--init_kl_coef 0.001 # KL penalty coefficient--fsdp # Enable FSDP
--zero_stage 3 # DeepSpeed ZeRO Stage
--gradient_checkpointing # Gradient checkpointing
--adam_offload # Adam optimizer offload
--bf16 # BF16 mixed precision--rm_use_engine # Use inference engine (vLLM/SGLang)
--engine_mem_util 0.4 # Engine memory utilization
--engine_tp_size 1 # Engine tensor parallelism degree
--enable_engine_sleep # Enable engine sleep mechanismSee training scripts for detailed parameter validation logic.
Solutions:
- Reduce
micro_train_batch_sizeandmicro_rollout_batch_size - Enable
--gradient_checkpointing - Lower
--engine_mem_util - Use ZeRO Stage 3
Solutions:
- Enable Reward Normalization:
--normalize_reward - Lower learning rate
- Use
--advantage_estimator group_norm - Try DAPO algorithm
Quick Start:
- Installation Guide - Docker images, installation methods, and troubleshooting
- Supported Algorithms - Comprehensive algorithm guide with implementation details
- Configuration Reference - Complete parameter documentation
Best Practices:
- Training Strategy Usage - FSDP, DeepSpeed, and inference engine configuration
- FAQ - Frequently asked questions and solutions
- Troubleshooting Guide - Common issues and debugging
- Contributing Guide - How to contribute to LightRFT
Install documentation dependencies:
pip install -r requirements-doc.txtGenerate HTML documentation:
make docs
# Open docs/build/index.html to view documentationLive documentation preview:
make docs-live
# Visit http://localhost:8000We welcome and appreciate contributions from the community! To ensure a smooth workflow, please follow these steps:
- Fork the Repository: Click the "Fork" button at the top right to copy this project to your GitHub account.
- Create a Feature Branch: Create a new branch for your changes, preferably based on
main. Ensure documentation branches are named with the doc pattern to enable auto-deployment of the docs site.git checkout -b feature/your-feature-name
- Commit Your Changes: Please follow the Conventional Commits specification.
- Format example:
feature(user): short description of the change - Common types:
feature(new feature),fix(bug fix),polish(polish or optimize),docs(documentation),style(formatting),refactor(code restructuring).
git commit -m 'feature(user): add an amazing feature' - Format example:
- Push to the Branch: Push your changes to your forked repository.
git push origin feature/your-feature-name
- Open a Pull Request: Go to the original repository and create a Pull Request targeting the
main(or specific development) branch. Please provide a detailed description of your changes.
# Install development dependencies
pip install -r requirements-dev.txt
# Code formatting (YAPF)
make format
# Code linting (Flake8)
make fcheckIf you use this codebase in your research or applications, please cite it as follows:
@misc{lightrft,
title={LightRFT},
author={Niu, Yazhe and Pu, Yuan and Shi, Dongxing and Lu, Yudong and Xiong, Yingtong and Ge, Ruijun and Sun, Jiaxuan and Wan, Zunian and Zhang, Shaoang and others},
publisher={GitHub},
howpublished={\url{https://github.com/opendilab/LightRFT}},
year={2025},
}This project is licensed under the Apache 2.0 License - see the LICENSE file for details.
LightRFT is developed based on OpenRLHF. We extend our sincere gratitude to the OpenRLHF team for their excellent work. Some files and implementations in this project are adapted and reused from OpenRLHF.
This project is developed in collaboration with colleagues from the System Platform Center and Safe and Trustworthy AI Center at Shanghai AI Laboratory. We sincerely thank them for their contributions and support.
This project builds upon the following outstanding open-source projects (including but not limited):
- OpenRLHF, verl - Core RL framework foundation (parts of key components adapted and reused)
- vLLM - High-performance inference engine
- SGLang - Structured generation language runtime
- DeepSpeed - Distributed training optimization
- PyTorch FSDP - Fully Sharded Data Parallel
Thanks to all contributors and supporters!
For questions or suggestions, please contact us via:
- Issues: GitHub Issues
- Email: [email protected]
โญ If this project helps you, please give us a star!
Made with โค๏ธ by LightRFT Team