Skip to content

Commit dc29c74

Browse files
committed
update readme
1 parent 2559924 commit dc29c74

File tree

1 file changed

+7
-7
lines changed
  • applications/ColossalChat/coati/distributed

1 file changed

+7
-7
lines changed

applications/ColossalChat/coati/distributed/README.md

Lines changed: 7 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
# Distributed RL Framework for Language Model Fine-Tuning
22

3-
This repository implements a distributed Reinforcement Learning (RL) training framework designed to fine-tune large language models using algorithms such as **GRPO** and **DAPO**. It supports multi-node and multi-GPU setups, scalable rollout generation, and policy optimization using libraries like VLLM. Currently, we supports two Reinforcement Learning with Verifiable Reward (RLVR) tasks: solving math problems and code generation.
3+
This repository implements a distributed Reinforcement Learning (RL) training framework designed to fine-tune large language models using algorithms such as **GRPO** and **DAPO**. It supports multi-node and multi-GPU setups, scalable rollout generation, and policy optimization using libraries like VLLM. Currently, we support two Reinforcement Learning with Verifiable Reward (RLVR) tasks: solving math problems and code generation.
44

55
**Please note that we are still under intensive development, stay tuned.**
66

@@ -70,7 +70,7 @@ Key features for Producer-Consumer Pattern:
7070

7171
## 🧠 Data Format
7272

73-
Samples in the training or evaluation `.jsonl` file should the same format depends on the type of task, we currently support two RLVR tasks: solving math problems and code generation.
73+
Samples in the training or evaluation `.jsonl` file should follow the format specific to the type of task. We currently support two RLVR tasks: solving math problems and code generation.
7474

7575
### Math Data Format
7676
```json
@@ -84,7 +84,7 @@ Samples in the training or evaluation `.jsonl` file should the same format depen
8484
```
8585

8686
### Code Data Format
87-
We support [Prime code dataset format](https://github.com/PRIME-RL/PRIME). Inputs and outputs in test cases should be two lists containing only strings and matching in the number of elements. You prompt must properly instruct the LLM to generate code to read test cases from stdin and output results to stdout.
87+
We support [Prime code dataset format](https://github.com/PRIME-RL/PRIME). Inputs and outputs in test cases should be two lists containing only strings and matching in the number of elements. Your prompt must properly instruct the LLM to generate code to read test cases from stdin and output results to stdout.
8888
```json
8989
{
9090
"messages": {
@@ -134,7 +134,7 @@ We support [Prime code dataset format](https://github.com/PRIME-RL/PRIME). Input
134134
| `--temperature` | Sampling temperature for generation | `1.0` |
135135
| `--top-k` | Top-K sampling parameter for generation | `None` |
136136
| `--top-p` | Top-P sampling parameter for generation | `1.0` |
137-
| `--system-prompt` | System prompt, Optional, default to the default system prompt for each reward types. For more information, refer to the [**reward type**](#-constraints-and-notes) section | `Please reason step by step, and put your final answer within \\boxed{}.` |
137+
| `--system-prompt` | System prompt, optional, default to the default system prompt for each reward types. For more information, refer to the [**reward type**](#-constraints-and-notes) section | `Please reason step by step, and put your final answer within \\boxed{}.` |
138138
| `--max-new-tokens` | Max generation tokens | `3584` |
139139
| `--max-prompt-tokens` | Max prompt tokens | `512` |
140140

@@ -169,7 +169,7 @@ We support [Prime code dataset format](https://github.com/PRIME-RL/PRIME). Input
169169

170170
## ⚙️ GRPO Settings
171171

172-
In addition to the two default training settings we provided--- original `GRPO` and `DAPO`, users can customize their training by changing the following hyperparameters in `grpo_config` in `rl_example.py`.
172+
In addition to the two default training settings provided`GRPO` and `DAPO`users can customize their training by changing the following hyperparameters in `grpo_config` in `rl_example.py`.
173173

174174
| Argument Name | Description | Default |
175175
| ----------------------------- | ---------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------- |
@@ -272,7 +272,7 @@ We use 10.0.0.3 as master node. First we start a ray cluster on 10.0.0.3:
272272
ray start --head --node-ip-address=10.0.0.3
273273
```
274274

275-
Then, for each slave node (10.0.0.4/10.0.0.5/10.0.0.6), we add to the ray cluser by following code:
275+
Then, for each slave node (10.0.0.4/10.0.0.5/10.0.0.6), we add to the ray cluster by following code:
276276
```bash
277277
ray start --address='10.0.0.3:6379'
278278
```
@@ -313,4 +313,4 @@ python rl_example.py
313313
```
314314

315315
## Acknowledgement
316-
Colossal-RL is a distributed version of ColossalChat and inspired by a few awesome open-source projects. We would like to express our gratitude to the Fuyao-ray team and the vllm-ascend team for their support throughout the development of the this project. We also thank the following awesome open-source projects and algorithms: GRPO, DAPO, TRL, Verl, OpenRLHF, StreamRL, Qwen, Logic-RL.
316+
Colossal-RL is a distributed version of ColossalChat and inspired by a few awesome open-source projects. We would like to express our gratitude to the following awesome open-source projects and algorithms: GRPO, DAPO, TRL, Verl, OpenRLHF, StreamRL, Qwen, Logic-RL.

0 commit comments

Comments
 (0)