You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/tutorials/how_to_run_colabs.md
+26-10Lines changed: 26 additions & 10 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -59,7 +59,7 @@ Upload notebooks or mount your GitHub repo
59
59
2. Try:
60
60
-`sft_qwen3_demo.ipynb`
61
61
-`sft_llama3_demo.ipynb`
62
-
-`grpo_llama3_demo.ipynb`
62
+
-`rl_llama3_demo.ipynb` (GRPO/GSPO training)
63
63
64
64
65
65
> ⚡ **Tip:** If Colab disconnects, re-enable TPU and re-run setup cells. Save checkpoints to GCS or Drive.
@@ -125,22 +125,25 @@ Use the link for Jupyter Lab as a link for "Connect to a local runtime" in Colla
125
125
-**`sft_qwen3_demo.ipynb`** → Qwen3-0.6B SFT training and evaluation on [OpenAI's GSM8K dataset](https://huggingface.co/datasets/openai/gsm8k)
126
126
-**`sft_llama3_demo.ipynb`** → Llama3.1-8B SFT training on [Hugging Face ultrachat_200k dataset](https://huggingface.co/datasets/HuggingFaceH4/ultrachat_200k)
127
127
128
-
### GRPO Training
128
+
### Reinforcement Learning (GRPO/GSPO) Training
129
129
130
-
-**`grpo_llama3_1_8b_demo.ipynb`** → GRPO training on math dataset (Colab/notebook)
130
+
-**`rl_llama3_demo.ipynb`** → GRPO/GSPO training on math dataset (Colab/notebook)
131
131
132
-
#### GRPO Colab Usage
132
+
#### GRPO/GSPO Colab Usage
133
133
134
-
For interactive GRPO training in Google Colab or Jupyter:
134
+
For interactive GRPO or GSPO training in Google Colab or Jupyter:
2.**Enable TPU runtime** (Runtime → Change runtime type → TPU)
138
-
3.**Run cells** to train Llama3.1-8B with GRPO on GSM8K dataset
138
+
3.**Set `LOSS_ALGO`** to `"grpo"` for GRPO or `"gspo-token"` for GSPO
139
+
4.**Run cells** to train Llama3.1-8B with GRPO or GSPO on GSM8K dataset
139
140
140
-
#### GRPO Python Script Usage - local runs
141
+
> **Note:** GRPO (Group Relative Policy Optimization) optimizes each token, while GSPO (Group Sequence Policy Optimization) optimizes the whole sequence. The difference is controlled by the `loss_algo` parameter.
> **Note:** To use GSPO instead of GRPO, add `--loss_algo=gspo-token` to the command. GRPO optimizes each token, while GSPO optimizes the whole sequence.
176
+
177
+
#### GRPO/GSPO Python Script Usage - cluster runs
162
178
163
179
For running on clusters, please refer to `maxtext/docs/tutorials/grpo_with_pathways.md`
0 commit comments