AI-Hypercomputer
diff --git a/‎README.md‎
Lines changed: 2 additions & 2 deletions b/‎README.md‎
Lines changed: 2 additions & 2 deletions
diff --git a/‎docs/tutorials/grpo.md‎
Lines changed: 20 additions & 1 deletion b/‎docs/tutorials/grpo.md‎
Lines changed: 20 additions & 1 deletion
diff --git a/‎docs/tutorials/grpo_with_pathways.md‎
Lines changed: 8 additions & 0 deletions b/‎docs/tutorials/grpo_with_pathways.md‎
Lines changed: 8 additions & 0 deletions
diff --git a/‎docs/tutorials/how_to_run_colabs.md‎
Lines changed: 26 additions & 10 deletions b/‎docs/tutorials/how_to_run_colabs.md‎
Lines changed: 26 additions & 10 deletions
diff --git a/‎src/MaxText/examples/grpo_llama3_1_8b_demo.ipynb‎
Lines changed: 0 additions & 218 deletions b/‎src/MaxText/examples/grpo_llama3_1_8b_demo.ipynb‎
Lines changed: 0 additions & 218 deletions
@@ -22,7 +22,7 @@
 
 MaxText is a high performance, highly scalable, open-source LLM library and reference implementation written in pure Python/[JAX](https://docs.jax.dev/en/latest/jax-101.html) and targeting Google Cloud TPUs and GPUs for training. 
 
-MaxText provides a library of high performance models to choose from, including Gemma, Llama, DeepSeek, Qwen, and Mistral. For each of these models, MaxText supports pre-training (up to tens of thousands of chips) and scalable post-training, with popular techniques like Supervised Fine-Tuning (SFT) and Group Relative Policy Optimization (GRPO, a type of Reinforcement Learning). 
+MaxText provides a library of high performance models to choose from, including Gemma, Llama, DeepSeek, Qwen, and Mistral. For each of these models, MaxText supports pre-training (up to tens of thousands of chips) and scalable post-training, with popular techniques like Supervised Fine-Tuning (SFT) and Group Relative Policy Optimization (GRPO, a type of Reinforcement Learning) and Group Sequence Policy Optimization (GSPO, a type of Reinforcement Learning).
 
 MaxText achieves high Model FLOPs Utilization (MFU) and tokens/second from single host to very large clusters while staying simple and largely "optimization-free" thanks to the power of JAX and the XLA compiler.
 
@@ -70,7 +70,7 @@ Our goal is to provide a variety of models (dimension “a”) and techniques (d
 Check out these getting started guides:
 
 * [SFT](https://github.com/AI-Hypercomputer/maxtext/blob/main/end_to_end/tpu/llama3.1/8b/run_sft.sh) (Supervised Fine Tuning)  
-* [GRPO](https://maxtext.readthedocs.io/en/latest/tutorials/grpo.html) (Group Relative Policy Optimization)
+* [GRPO / GSPO](https://maxtext.readthedocs.io/en/latest/tutorials/grpo.html) (Group Relative & Group Sequence Policy Optimization – pass `loss_algo=gspo-token` to run GSPO)
 
 ### Model library
 
 
@@ -20,7 +20,7 @@ This tutorial demonstrates step-by-step instructions for setting up the environm
 
 GRPO is an RL algorithm designed to enhance the reasoning abilities of LLMs. It is a variant of Proximal Policy Optimization (PPO) that reduces memory usage by eliminating the need for a separate value function model. GRPO works by generating multiple responses for a given prompt, evaluating these responses using a reward model, and then calculating a relative advantage based on the group's performance to update the policy.
 
-We use Tunix as the library for GRPO.
+We use Tunix as the library for GRPO/GSPO.
 And we use vLLM as the library for efficient model inference and generation.
 
 In this tutorial we use a single host TPUVM such as `v6e-8/v5p-8`. Let's get started!
@@ -112,3 +112,22 @@ The overview of the what this run will do is as follows:
 2. Evaluate the policy model's performance on GSM8K math reasoning benchmark.
 3. Train the policy model using GRPO.
 4. Evaluate the policy model's performance on GSM8K math reasoning benchmark after the post-training with GRPO.
+
+GSPO (Group Sequence Policy Optimization)
+MaxText can also run the GSPO variant by setting `loss_algo=gspo-token` when invoking `train_rl.py` (or when constructing the pyconfig argv list). 
+
+## Run GSPO
+
+Finally, run the command
+
+```
+python3 -m src.MaxText.rl.train_rl src/MaxText/configs/rl.yml \
+  model_name=llama3.1-8b \
+  tokenizer_path=meta-llama/Llama-3.1-8B-Instruct \
+  load_parameters_path=gs://path/to/checkpoint/0/items \
+  run_name=$WORKLOAD \
+  base_output_directory=$OUTPUT_PATH \
+  hf_access_token=$HF_TOKEN \
+  loss_algo=gspo-token
+```
+
@@ -20,6 +20,14 @@ This tutorial demonstrates step-by-step instructions for setting up the environm
 
 GRPO is an RL algorithm designed to enhance the reasoning abilities of LLMs. It is a variant of Proximal Policy Optimization (PPO) that reduces memory usage by eliminating the need for a separate value function model. GRPO works by generating multiple responses for a given prompt, evaluating these responses using a reward model, and then calculating a relative advantage based on the group's performance to update the policy.
 
+GSPO support
+Some workloads prefer Group Sequence Policy Optimization (GSPO), which uses the same infrastructure but a different loss.  
+To switch from GRPO to GSPO, add the following override when invoking `train_rl.py` (or when building the `pyconfig` argv list):  
+```
+loss_algo=gspo-token
+```
+No other changes are required—the rest of this tutorial applies equally to GSPO runs.
+
 We use Tunix as the library for GRPO. 
 And we use vLLM as the library for efficient model inference and generation.
 
 
@@ -59,7 +59,7 @@ Upload notebooks or mount your GitHub repo
 2. Try:
    - `sft_qwen3_demo.ipynb`
    - `sft_llama3_demo.ipynb`
-   - `grpo_llama3_demo.ipynb`
+   - `rl_llama3_demo.ipynb` (GRPO/GSPO training)
 
 
 > ⚡ **Tip:** If Colab disconnects, re-enable TPU and re-run setup cells. Save checkpoints to GCS or Drive.
@@ -125,22 +125,25 @@ Use the link for Jupyter Lab as a link for "Connect to a local runtime" in Colla
 - **`sft_qwen3_demo.ipynb`** → Qwen3-0.6B SFT training and evaluation on [OpenAI's GSM8K dataset](https://huggingface.co/datasets/openai/gsm8k)
 - **`sft_llama3_demo.ipynb`** → Llama3.1-8B SFT training on [Hugging Face ultrachat_200k dataset](https://huggingface.co/datasets/HuggingFaceH4/ultrachat_200k)
 
-### GRPO Training
+### Reinforcement Learning (GRPO/GSPO) Training
 
-- **`grpo_llama3_1_8b_demo.ipynb`** → GRPO training on math dataset (Colab/notebook)
+- **`rl_llama3_demo.ipynb`** → GRPO/GSPO training on math dataset (Colab/notebook)
 
-#### GRPO Colab Usage
+#### GRPO/GSPO Colab Usage
 
-For interactive GRPO training in Google Colab or Jupyter:
+For interactive GRPO or GSPO training in Google Colab or Jupyter:
 
-1. **Open** `src/MaxText/examples/grpo_llama3_1_8b_demo.ipynb`
+1. **Open** `src/MaxText/examples/rl_llama3_demo.ipynb`
 2. **Enable TPU runtime** (Runtime → Change runtime type → TPU)
-3. **Run cells** to train Llama3.1-8B with GRPO on GSM8K dataset
+3. **Set `LOSS_ALGO`** to `"grpo"` for GRPO or `"gspo-token"` for GSPO
+4. **Run cells** to train Llama3.1-8B with GRPO or GSPO on GSM8K dataset
 
-#### GRPO Python Script Usage - local runs
+> **Note:** GRPO (Group Relative Policy Optimization) optimizes each token, while GSPO (Group Sequence Policy Optimization) optimizes the whole sequence. The difference is controlled by the `loss_algo` parameter.
+
+#### GRPO/GSPO Python Script Usage - local runs
 
 ```bash
-# Llama3.1-8B-Instruct
+# Llama3.1-8B-Instruct with GRPO (default)
 python3 -m src.MaxText.rl.train_rl src/MaxText/configs/rl.yml \
   --model_name=llama3.1-8b \
   --tokenizer_path=meta-llama/Llama-3.1-8B-Instruct \
@@ -149,6 +152,16 @@ python3 -m src.MaxText.rl.train_rl src/MaxText/configs/rl.yml \
   --base_output_directory=$OUTPUT_PATH \
   --hf_access_token=$HF_TOKEN
 
+# Llama3.1-8B-Instruct with GSPO
+python3 -m src.MaxText.rl.train_rl src/MaxText/configs/rl.yml \
+  --model_name=llama3.1-8b \
+  --tokenizer_path=meta-llama/Llama-3.1-8B-Instruct \
+  --load_parameters_path=gs://path/to/checkpoint/0/items \
+  --run_name=$WORKLOAD \
+  --base_output_directory=$OUTPUT_PATH \
+  --hf_access_token=$HF_TOKEN \
+  --loss_algo=gspo-token
+
 # Qwen2.5-7B
 python3 -m src.MaxText.rl.train_rl src/MaxText/configs/rl.yml \
   --model_name=qwen2.5-7b \
@@ -158,7 +171,10 @@ python3 -m src.MaxText.rl.train_rl src/MaxText/configs/rl.yml \
   --base_output_directory=$OUTPUT_PATH \
   --hf_access_token=$HF_TOKEN
 ```
-#### GRPO Python Script Usage - cluster runs
+
+> **Note:** To use GSPO instead of GRPO, add `--loss_algo=gspo-token` to the command. GRPO optimizes each token, while GSPO optimizes the whole sequence.
+
+#### GRPO/GSPO Python Script Usage - cluster runs
 
 For running on clusters, please refer to `maxtext/docs/tutorials/grpo_with_pathways.md`