You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: applications/ColossalChat/coati/distributed/README.md
+7-7Lines changed: 7 additions & 7 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,6 +1,6 @@
1
1
# Distributed RL Framework for Language Model Fine-Tuning
2
2
3
-
This repository implements a distributed Reinforcement Learning (RL) training framework designed to fine-tune large language models using algorithms such as **GRPO** and **DAPO**. It supports multi-node and multi-GPU setups, scalable rollout generation, and policy optimization using libraries like VLLM. Currently, we supports two Reinforcement Learning with Verifiable Reward (RLVR) tasks: solving math problems and code generation.
3
+
This repository implements a distributed Reinforcement Learning (RL) training framework designed to fine-tune large language models using algorithms such as **GRPO** and **DAPO**. It supports multi-node and multi-GPU setups, scalable rollout generation, and policy optimization using libraries like VLLM. Currently, we support two Reinforcement Learning with Verifiable Reward (RLVR) tasks: solving math problems and code generation.
4
4
5
5
**Please note that we are still under intensive development, stay tuned.**
6
6
@@ -70,7 +70,7 @@ Key features for Producer-Consumer Pattern:
70
70
71
71
## 🧠 Data Format
72
72
73
-
Samples in the training or evaluation `.jsonl` file should the same format depends on the type of task, we currently support two RLVR tasks: solving math problems and code generation.
73
+
Samples in the training or evaluation `.jsonl` file should follow the format specific to the type of task. We currently support two RLVR tasks: solving math problems and code generation.
74
74
75
75
### Math Data Format
76
76
```json
@@ -84,7 +84,7 @@ Samples in the training or evaluation `.jsonl` file should the same format depen
84
84
```
85
85
86
86
### Code Data Format
87
-
We support [Prime code dataset format](https://github.com/PRIME-RL/PRIME). Inputs and outputs in test cases should be two lists containing only strings and matching in the number of elements. You prompt must properly instruct the LLM to generate code to read test cases from stdin and output results to stdout.
87
+
We support [Prime code dataset format](https://github.com/PRIME-RL/PRIME). Inputs and outputs in test cases should be two lists containing only strings and matching in the number of elements. Your prompt must properly instruct the LLM to generate code to read test cases from stdin and output results to stdout.
88
88
```json
89
89
{
90
90
"messages": {
@@ -134,7 +134,7 @@ We support [Prime code dataset format](https://github.com/PRIME-RL/PRIME). Input
134
134
|`--temperature`| Sampling temperature for generation |`1.0`|
135
135
|`--top-k`| Top-K sampling parameter for generation |`None`|
136
136
|`--top-p`| Top-P sampling parameter for generation |`1.0`|
137
-
|`--system-prompt`| System prompt, Optional, default to the default system prompt for each reward types. For more information, refer to the [**reward type**](#-constraints-and-notes) section |`Please reason step by step, and put your final answer within \\boxed{}.`|
137
+
|`--system-prompt`| System prompt, optional, default to the default system prompt for each reward types. For more information, refer to the [**reward type**](#-constraints-and-notes) section |`Please reason step by step, and put your final answer within \\boxed{}.`|
138
138
|`--max-new-tokens`| Max generation tokens |`3584`|
139
139
|`--max-prompt-tokens`| Max prompt tokens |`512`|
140
140
@@ -169,7 +169,7 @@ We support [Prime code dataset format](https://github.com/PRIME-RL/PRIME). Input
169
169
170
170
## ⚙️ GRPO Settings
171
171
172
-
In addition to the two default training settings we provided--- original `GRPO` and `DAPO`, users can customize their training by changing the following hyperparameters in `grpo_config` in `rl_example.py`.
172
+
In addition to the two default training settings provided—`GRPO` and `DAPO`—users can customize their training by changing the following hyperparameters in `grpo_config` in `rl_example.py`.
@@ -272,7 +272,7 @@ We use 10.0.0.3 as master node. First we start a ray cluster on 10.0.0.3:
272
272
ray start --head --node-ip-address=10.0.0.3
273
273
```
274
274
275
-
Then, for each slave node (10.0.0.4/10.0.0.5/10.0.0.6), we add to the ray cluser by following code:
275
+
Then, for each slave node (10.0.0.4/10.0.0.5/10.0.0.6), we add to the ray cluster by following code:
276
276
```bash
277
277
ray start --address='10.0.0.3:6379'
278
278
```
@@ -313,4 +313,4 @@ python rl_example.py
313
313
```
314
314
315
315
## Acknowledgement
316
-
Colossal-RL is a distributed version of ColossalChat and inspired by a few awesome open-source projects. We would like to express our gratitude to the Fuyao-ray team and the vllm-ascend team for their support throughout the development of the this project. We also thank the following awesome open-source projects and algorithms: GRPO, DAPO, TRL, Verl, OpenRLHF, StreamRL, Qwen, Logic-RL.
316
+
Colossal-RL is a distributed version of ColossalChat and inspired by a few awesome open-source projects. We would like to express our gratitude to the following awesome open-source projects and algorithms: GRPO, DAPO, TRL, Verl, OpenRLHF, StreamRL, Qwen, Logic-RL.
0 commit comments