|
| 1 | +# Zero Bubble Distributed RL Framework for Language Model Fine-Tuning |
| 2 | + |
| 3 | +This folder contains code for the Zero Bubble distributed RL framework. It currently supports **GRPO** and **DAPO**. See the [main README](../README.md) for general installation instructions and usage. |
| 4 | + |
| 5 | +**Note:** This project is under active development — expect changes. |
| 6 | + |
| 7 | +## 🛠 Installation |
| 8 | + |
| 9 | +1. Follow the general installation guide in the [main README](../README.md). |
| 10 | +2. Install [pygloo](https://github.com/ray-project/pygloo). Build pygloo for Ray from source following the instructions in its repository README. |
| 11 | + |
| 12 | +## Design idea |
| 13 | + |
| 14 | +We aim to reduce the *“bubble”* — the idle time that occurs between rollouts and training steps (illustrated in Fig. 1). |
| 15 | + |
| 16 | +<div align="center"> |
| 17 | + <p align="center"> |
| 18 | + <img src="https://raw.githubusercontent.com/hpcaitech/public_assets/main/applications/chat/all_sync.png" width=700/> |
| 19 | + </p> |
| 20 | +</div> |
| 21 | + |
| 22 | +**Fig. 1** - In an all-sync online RL framework, rollout workers wait for the trainer to finish training and synchronize weights, and the trainer waits for rollouts. This causes large GPU idle time. |
| 23 | + |
| 24 | +<div align="center"> |
| 25 | + <p align="center"> |
| 26 | + <img src="https://raw.githubusercontent.com/hpcaitech/public_assets/main/applications/chat/zero_bubble.png" width=700/> |
| 27 | + </p> |
| 28 | +</div> |
| 29 | + |
| 30 | +**Fig. 2** - Our Zero Bubble pipeline follows a producer–consumer pattern: |
| 31 | + |
| 32 | +* A global **data buffer** temporarily stores rollouts produced by inference workers. |
| 33 | +* A **weights distributor** buffers updated model weights and distributes them to inference workers. |
| 34 | +* When the data buffer has enough data, the trainer continuously consumes from it and pushes updated weights to the weights distributor. |
| 35 | +* After finishing a mini-batch, each inference worker checks the weights distributor and synchronizes to a newer weight version if available. |
| 36 | + |
| 37 | +Under ideal conditions (inference workers produce data at the same rate the trainer consumes it), the pipeline eliminates idle time. We call it *zero bubble* because, with an unlimited data buffer, inference and training can run indefinitely without waiting. In practice, to avoid wasted compute and stale/off-policy data, we set a bounded buffer size so inference workers will briefly wait when the buffer is full. |
| 38 | + |
| 39 | +## Usage |
| 40 | + |
| 41 | +In addition to the general parameters (see the main README), the Zero Bubble pipeline introduces one additional parameter: |
| 42 | + |
| 43 | +* **`data_actor_buffer_size_limit`** - Maximum number of rollout batches the data buffer may hold. Defaults to **twice** the trainer’s mini-batch size. Avoid setting this too large — a very large buffer increases off-policy training. For DAPO, since only effective prompts count, you may need to raise `data_actor_buffer_size_limit` depending on sample utility. |
| 44 | + |
| 45 | +Example: RL training on 8 GPUs with Zero Bubble (zero2) |
| 46 | + |
| 47 | +```bash |
| 48 | +python rl_example_zero_bubble.py \ |
| 49 | + --dataset /path/to/your/dataset.jsonl \ |
| 50 | + --model /path/to/your/model \ |
| 51 | + -t 4 -i 4 -b vllm -a DAPO \ |
| 52 | + -imbs 8 -ibs 8 -tbs 8 -e 2 -rt boxed \ |
| 53 | + -si 25 -s "Please reason step by step, and put your final answer within \\boxed{}." \ |
| 54 | + -tMbs 2 -tmbs 2 -p Rebase_Experiments -zero 2 -mpt 512 -mnt 3584 |
| 55 | +``` |
| 56 | + |
| 57 | +## Performance |
| 58 | + |
| 59 | +<div align="center"> |
| 60 | + <p align="center"> |
| 61 | + <img src="https://raw.githubusercontent.com/hpcaitech/public_assets/main/applications/chat/zero_bubble_gpu_util.png" width=700/> |
| 62 | + </p> |
| 63 | +</div> |
| 64 | + |
| 65 | +**Fig. 3** - Performance of the Zero Bubble pipeline tested with an unlimited buffer size. |
| 66 | + |
| 67 | +--- |
| 68 | + |
| 69 | +If you'd like, I can: |
| 70 | + |
| 71 | +* Produce a short "What changed" summary for the repo (listing grammar/clarity edits). |
| 72 | +* Create a compact one-paragraph summary for the project page. |
| 73 | +* Convert this into a prettier doc with badges, table of contents, or a changelog. Which would you prefer? |
0 commit comments