Skip to content

Commit 58f10c9

Browse files
committed
fix docs
Signed-off-by: Vladimir Suvorov <[email protected]>
1 parent 36acf60 commit 58f10c9

File tree

1 file changed

+26
-10
lines changed

1 file changed

+26
-10
lines changed

docs/tutorials/how_to_run_colabs.md

Lines changed: 26 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -59,7 +59,7 @@ Upload notebooks or mount your GitHub repo
5959
2. Try:
6060
- `sft_qwen3_demo.ipynb`
6161
- `sft_llama3_demo.ipynb`
62-
- `grpo_llama3_demo.ipynb`
62+
- `rl_llama3_demo.ipynb` (GRPO/GSPO training)
6363

6464

6565
> **Tip:** If Colab disconnects, re-enable TPU and re-run setup cells. Save checkpoints to GCS or Drive.
@@ -125,22 +125,25 @@ Use the link for Jupyter Lab as a link for "Connect to a local runtime" in Colla
125125
- **`sft_qwen3_demo.ipynb`** → Qwen3-0.6B SFT training and evaluation on [OpenAI's GSM8K dataset](https://huggingface.co/datasets/openai/gsm8k)
126126
- **`sft_llama3_demo.ipynb`** → Llama3.1-8B SFT training on [Hugging Face ultrachat_200k dataset](https://huggingface.co/datasets/HuggingFaceH4/ultrachat_200k)
127127

128-
### GRPO Training
128+
### Reinforcement Learning (GRPO/GSPO) Training
129129

130-
- **`grpo_llama3_1_8b_demo.ipynb`** → GRPO training on math dataset (Colab/notebook)
130+
- **`rl_llama3_demo.ipynb`** → GRPO/GSPO training on math dataset (Colab/notebook)
131131

132-
#### GRPO Colab Usage
132+
#### GRPO/GSPO Colab Usage
133133

134-
For interactive GRPO training in Google Colab or Jupyter:
134+
For interactive GRPO or GSPO training in Google Colab or Jupyter:
135135

136-
1. **Open** `src/MaxText/examples/grpo_llama3_1_8b_demo.ipynb`
136+
1. **Open** `src/MaxText/examples/rl_llama3_demo.ipynb`
137137
2. **Enable TPU runtime** (Runtime → Change runtime type → TPU)
138-
3. **Run cells** to train Llama3.1-8B with GRPO on GSM8K dataset
138+
3. **Set `LOSS_ALGO`** to `"grpo"` for GRPO or `"gspo-token"` for GSPO
139+
4. **Run cells** to train Llama3.1-8B with GRPO or GSPO on GSM8K dataset
139140

140-
#### GRPO Python Script Usage - local runs
141+
> **Note:** GRPO (Group Relative Policy Optimization) optimizes each token, while GSPO (Group Sequence Policy Optimization) optimizes the whole sequence. The difference is controlled by the `loss_algo` parameter.
142+
143+
#### GRPO/GSPO Python Script Usage - local runs
141144

142145
```bash
143-
# Llama3.1-8B-Instruct
146+
# Llama3.1-8B-Instruct with GRPO (default)
144147
python3 -m src.MaxText.rl.train_rl src/MaxText/configs/rl.yml \
145148
--model_name=llama3.1-8b \
146149
--tokenizer_path=meta-llama/Llama-3.1-8B-Instruct \
@@ -149,6 +152,16 @@ python3 -m src.MaxText.rl.train_rl src/MaxText/configs/rl.yml \
149152
--base_output_directory=$OUTPUT_PATH \
150153
--hf_access_token=$HF_TOKEN
151154

155+
# Llama3.1-8B-Instruct with GSPO
156+
python3 -m src.MaxText.rl.train_rl src/MaxText/configs/rl.yml \
157+
--model_name=llama3.1-8b \
158+
--tokenizer_path=meta-llama/Llama-3.1-8B-Instruct \
159+
--load_parameters_path=gs://path/to/checkpoint/0/items \
160+
--run_name=$WORKLOAD \
161+
--base_output_directory=$OUTPUT_PATH \
162+
--hf_access_token=$HF_TOKEN \
163+
--loss_algo=gspo-token
164+
152165
# Qwen2.5-7B
153166
python3 -m src.MaxText.rl.train_rl src/MaxText/configs/rl.yml \
154167
--model_name=qwen2.5-7b \
@@ -158,7 +171,10 @@ python3 -m src.MaxText.rl.train_rl src/MaxText/configs/rl.yml \
158171
--base_output_directory=$OUTPUT_PATH \
159172
--hf_access_token=$HF_TOKEN
160173
```
161-
#### GRPO Python Script Usage - cluster runs
174+
175+
> **Note:** To use GSPO instead of GRPO, add `--loss_algo=gspo-token` to the command. GRPO optimizes each token, while GSPO optimizes the whole sequence.
176+
177+
#### GRPO/GSPO Python Script Usage - cluster runs
162178

163179
For running on clusters, please refer to `maxtext/docs/tutorials/grpo_with_pathways.md`
164180

0 commit comments

Comments
 (0)