Update documentation for GSPO

SurbhiJainUSC · SurbhiJainUSC · commit 497d9e30d2c5 · 2025-12-03T20:09:36.000Z
diff --git a/README.md b/README.md
@@ -78,10 +78,9 @@ Check out these getting started guides:
 * Supervised Fine Tuning (SFT)
   * [SFT on Single-Host TPUs](https://maxtext.readthedocs.io/en/latest/tutorials/sft.html)
   * [SFT on Multi-Host TPUs](https://maxtext.readthedocs.io/en/latest/tutorials/sft_on_multi_host.html)
-* Group Relative & Group Sequence Policy Optimization (GRPO & GSPO)
-  * [GRPO on Single-Host TPUs](https://maxtext.readthedocs.io/en/latest/tutorials/grpo.html)
-  * [GRPO on Multi-Host TPUs](https://maxtext.readthedocs.io/en/latest/tutorials/grpo_with_pathways.html) 
-  * [GSPO](https://maxtext.readthedocs.io/en/latest/tutorials/grpo.html#run-gspo) (pass `loss_algo=gspo-token` to run GSPO)
+* Reinforcement Learning (RL)
+  * [RL on Single-Host TPUs](https://maxtext.readthedocs.io/en/latest/tutorials/rl.html)
+  * [RL on Multi-Host TPUs](https://maxtext.readthedocs.io/en/latest/tutorials/rl_on_multi_host.html)
 
 ### Model library
 
diff --git a/dependencies/dockerfiles/maxtext_post_training_dependencies.Dockerfile b/dependencies/dockerfiles/maxtext_post_training_dependencies.Dockerfile
@@ -18,7 +18,7 @@ ARG MODE
 
 ENV MODE=$MODE
 
-RUN echo "Installing Post-Training dependencies (vLLM, tpu-common, tunix) with MODE=${MODE}"
+RUN echo "Installing Post-Training dependencies (vLLM, tpu-inference, tunix) with MODE=${MODE}"
 
 # Uninstall existing jax to avoid conflicts
 RUN pip uninstall -y jax jaxlib libtpu
diff --git a/dependencies/dockerfiles/maxtext_post_training_local_dependencies.Dockerfile b/dependencies/dockerfiles/maxtext_post_training_local_dependencies.Dockerfile
@@ -17,7 +17,7 @@ FROM ${BASEIMAGE}
 ARG MODE
 ENV MODE=$MODE
 
-RUN echo "Installing GRPO dependencies (vLLM, tpu-inference) with MODE=${MODE}"
+RUN echo "Installing Post-Training dependencies (tunix, vLLM, tpu-inference) with MODE=${MODE}"
 RUN pip uninstall -y jax jaxlib libtpu
 
 RUN pip install aiohttp==3.12.15
diff --git a/docs/tutorials/post_training_index.md b/docs/tutorials/post_training_index.md
@@ -18,14 +18,14 @@ MaxText was co-designed with key Google led innovations to provide a unified pos
 
 ## Supported techniques & models
 
-- **SFT (Supervised Fine-Tuning)** [(link)](https://maxtext.readthedocs.io/en/latest/tutorials/sft.html)
-    - Supports all MaxText models
-- **Multimodal SFT** [(link)](https://maxtext.readthedocs.io/en/latest/guides/multimodal.html)
-- **GRPO (Group Relative Policy Optimization)** [(link)](https://maxtext.readthedocs.io/en/latest/tutorials/grpo.html)
-    - Llama 3.1 8B
-    - Llama 3.1 70B
-- **GSPO-token**
-    - Coming soon
+- **SFT (Supervised Fine-Tuning)**
+  * [SFT on Single-Host TPUs](https://maxtext.readthedocs.io/en/latest/tutorials/sft.html)
+  * [SFT on Multi-Host TPUs](https://maxtext.readthedocs.io/en/latest/tutorials/sft_on_multi_host.html)
+- **Multimodal SFT**
+  * [Multimodal Support](https://maxtext.readthedocs.io/en/latest/guides/multimodal.html)
+- **Reinforcement Learning (RL)**
+  * [RL on Single-Host TPUs](https://maxtext.readthedocs.io/en/latest/tutorials/rl.html)
+  * [RL on Multi-Host TPUs](https://maxtext.readthedocs.io/en/latest/tutorials/rl_on_multi_host.html)
 
 ## Step by step RL
 
@@ -58,8 +58,8 @@ Start your Post-Training journey through quick experimentation with our [Google
 
 full_finetuning.md
 how_to_run_colabs.md
-grpo.md
+rl.md
 sft.md
 sft_on_multi_host.md
-grpo_with_pathways.md
+rl_on_multi_host.md
 ```
diff --git a/docs/tutorials/rl.md b/docs/tutorials/rl.md
@@ -14,19 +14,22 @@
  limitations under the License.
  -->
 
-# Try GRPO
+# Reinforcement Learning on Single-Host TPUs
 
-This tutorial demonstrates step-by-step instructions for setting up the environment and then training the Llama3.1 8B-IT model on the GSM8K math reasoning benchmark using Group Relative Policy Optimization (GRPO). GRPO can enhance your model's problem-solving skills on mathematical word problems, coding problems, etc.
+This tutorial demonstrates step-by-step instructions for setting up the environment and then training the Llama3.1 8B-IT model on the GSM8K math reasoning dataset using a single host TPU-VM such as `v6e-8/v5p-8`.
 
-GRPO is an RL algorithm designed to enhance the reasoning abilities of LLMs. It is a variant of Proximal Policy Optimization (PPO) that reduces memory usage by eliminating the need for a separate value function model. GRPO works by generating multiple responses for a given prompt, evaluating these responses using a reward model, and then calculating a relative advantage based on the group's performance to update the policy.
+We utilize two RL algorithms, implemented via the Tunix library, to enhance the model's reasoning capabilities:
 
-We use Tunix as the library for GRPO/GSPO.
-And we use vLLM as the library for efficient model inference and generation.
+* **Group Relative Policy Optimization (GRPO)**: GRPO is an RL algorithm designed to enhance the reasoning abilities of LLMs. It is a variant of Proximal Policy Optimization (PPO) that reduces memory usage by eliminating the need for a separate value function model. GRPO works by generating multiple responses for a given prompt, evaluating these responses using a reward model, and then calculating a relative advantage based on the group's performance to update the policy.
 
-In this tutorial we use a single host TPUVM such as `v6e-8/v5p-8`. Let's get started!
+* **Group Sequence Policy Optimization (GSPO)**: GSPO is an RL algorithm that improves training efficiency and performance of LLMs by using sequence-level importance ratios and operations. GSPO defines the importance ratio based on sequence likelihood and performs sequence-level clipping, rewarding, and optimization.
+
+For efficient model inference and response generation during this process, we rely on the vLLM library.
+
+Let's get started!
 
 ## Create virtual environment and Install MaxText dependencies
-If you have already completed the [MaxText installation](https://github.com/AI-Hypercomputer/maxtext/blob/main/docs/guides/install_maxtext.md), you can skip to the next section for vLLM and tpu-inference installations. Otherwise, please install MaxText using the following commands before proceeding.
+If you have already completed the [MaxText installation](https://github.com/AI-Hypercomputer/maxtext/blob/main/docs/guides/install_maxtext.md), you can skip to the next section for post-training dependencies installations. Otherwise, please install `MaxText` using the following commands before proceeding.
 ```bash
 # 1. Clone the repository
 git clone https://github.com/AI-Hypercomputer/maxtext.git
@@ -43,7 +46,7 @@ uv pip install -e .[tpu] --resolution=lowest
 install_maxtext_github_deps
 ```
 
-## vLLM and tpu-inference installations
+## Install Post-Training dependencies
 
 ### From PyPI releases
 
@@ -58,11 +61,11 @@ Primarily, it installs `vllm-tpu` which is [vllm](https://github.com/vllm-projec
 
 ### From Github
 
-You can also locally git clone [tunix](https://github.com/google/tunix) and install using the instructions [here](https://github.com/google/tunix?tab=readme-ov-file#installation). Similarly install [vllm](https://github.com/vllm-project/vllm) and [tpu-inference](https://github.com/vllm-project/tpu-inference) from source following the instructions [here](https://docs.vllm.ai/projects/tpu/en/latest/getting_started/installation/#install-from-source)
+You can also locally git clone [tunix](https://github.com/google/tunix) and install using the instructions [here](https://github.com/google/tunix?tab=readme-ov-file#installation). Similarly install [vllm](https://github.com/vllm-project/vllm) and [tpu-inference](https://github.com/vllm-project/tpu-inference) from source following the instructions [here](https://docs.vllm.ai/projects/tpu/en/latest/getting_started/installation/#install-from-source).
 
-## Setup the following environment variables before running GRPO
+## Setup environment variables
 
-Setup following environment variables before running GRPO
+Setup following environment variables before running GRPO/GSPO:
 
 ```bash
 # -- Model configuration --
@@ -82,7 +85,7 @@ export MAXTEXT_CKPT_PATH=${BASE_OUTPUT_DIRECTORY}/${RUN_NAME}/0/items
 
 You can convert a Hugging Face checkpoint to MaxText format using the `src/MaxText/utils/ckpt_conversion/to_maxtext.py` script. This is useful if you have a pre-trained model from Hugging Face that you want to use with MaxText.
 
-First, ensure you have the necessary dependencies installed. Then, run the conversion script on a CPU machine. For large models, it is recommended to use the --lazy_load_tensors flag to reduce memory usage during conversion. This command will download the Hugging Face model and convert it to the MaxText format, saving it to the specified GCS bucket.
+First, ensure you have the necessary dependencies installed. Then, run the conversion script on a CPU machine. For large models, it is recommended to use the `--lazy_load_tensors` flag to reduce memory usage during conversion. This command will download the Hugging Face model and convert it to the MaxText format, saving it to the specified GCS bucket.
 
 ```bash
 python3 -m pip install torch --index-url https://download.pytorch.org/whl/cpu
@@ -108,7 +111,7 @@ python3 -m MaxText.utils.ckpt_conversion.to_maxtext MaxText/configs/base.yml \
 
 ## Run GRPO
 
-Finally, run the command
+Run the following command for GRPO:
 
 ```
 python3 -m src.MaxText.rl.train_rl src/MaxText/configs/rl.yml \
@@ -120,19 +123,16 @@ python3 -m src.MaxText.rl.train_rl src/MaxText/configs/rl.yml \
   hf_access_token=${HF_TOKEN}
 ```
 
-The overview of the what this run will do is as follows:
+The overview of what this run will do is as follows:
 
 1. We load a policy model and a reference model. Both are copies of `Llama3.1-8b-Instruct`.
 2. Evaluate the policy model's performance on GSM8K math reasoning benchmark.
 3. Train the policy model using GRPO.
-4. Evaluate the policy model's performance on GSM8K math reasoning benchmark after the post-training with GRPO.
-
-GSPO (Group Sequence Policy Optimization)
-MaxText can also run the GSPO variant by setting `loss_algo=gspo-token` when invoking `train_rl.py` (or when constructing the pyconfig argv list). 
+4. Evaluate the policy model's performance on GSM8K math reasoning benchmark after the post-training with GRPO. 
 
 ## Run GSPO
 
-Finally, run the command
+Run the following command for GSPO:
 
 ```
 python3 -m src.MaxText.rl.train_rl src/MaxText/configs/rl.yml \
@@ -145,3 +145,10 @@ python3 -m src.MaxText.rl.train_rl src/MaxText/configs/rl.yml \
   loss_algo=gspo-token
 ```
 
+The overview of what this run will do is as follows:
+
+1. We load a policy model and a reference model. Both are copies of `Llama3.1-8b-Instruct`.
+2. Evaluate the policy model's performance on GSM8K math reasoning benchmark.
+3. Train the policy model using GSPO.
+4. Evaluate the policy model's performance on GSM8K math reasoning benchmark after the post-training with GSPO. 
+
diff --git a/docs/tutorials/rl_on_multi_host.md b/docs/tutorials/rl_on_multi_host.md
@@ -14,33 +14,28 @@
  limitations under the License.
  -->
 
-# Try GRPO with Pathways!
+# Reinforcement Learning on Multi-Host TPUs
 
-This tutorial demonstrates step-by-step instructions for setting up the environment and then training the Llama3.1 70B-IT model on the GSM8K math reasoning benchmark using Group Relative Policy Optimization (GRPO). GRPO can enhance your model's problem-solving skills on mathematical word problems, coding problems, etc.
+This tutorial demonstrates step-by-step instructions for setting up the environment and then training the Llama3.1 70B-IT model on the GSM8K math reasoning dataset using [Pathways for orchestration](https://cloud.google.com/ai-hypercomputer/docs/workloads/pathways-on-cloud/pathways-intro) on multi-host TPU-VMs such as `v5p-128`.
 
-GRPO is an RL algorithm designed to enhance the reasoning abilities of LLMs. It is a variant of Proximal Policy Optimization (PPO) that reduces memory usage by eliminating the need for a separate value function model. GRPO works by generating multiple responses for a given prompt, evaluating these responses using a reward model, and then calculating a relative advantage based on the group's performance to update the policy.
+We utilize two RL algorithms, implemented via the Tunix library, to enhance the model's reasoning capabilities:
 
-GSPO support
-Some workloads prefer Group Sequence Policy Optimization (GSPO), which uses the same infrastructure but a different loss.  
-To switch from GRPO to GSPO, add the following override when invoking `train_rl.py` (or when building the `pyconfig` argv list):  
-```
-loss_algo=gspo-token
-```
-No other changes are required—the rest of this tutorial applies equally to GSPO runs.
+* **Group Relative Policy Optimization (GRPO)**: GRPO is an RL algorithm designed to enhance the reasoning abilities of LLMs. It is a variant of Proximal Policy Optimization (PPO) that reduces memory usage by eliminating the need for a separate value function model. GRPO works by generating multiple responses for a given prompt, evaluating these responses using a reward model, and then calculating a relative advantage based on the group's performance to update the policy.
+
+* **Group Sequence Policy Optimization (GSPO)**: GSPO is an RL algorithm that improves training efficiency and performance of LLMs by using sequence-level importance ratios and operations. GSPO defines the importance ratio based on sequence likelihood and performs sequence-level clipping, rewarding, and optimization.
 
-We use Tunix as the library for GRPO. 
-And we use vLLM as the library for efficient model inference and generation.
+For efficient model inference and response generation during this process, we rely on the vLLM library.
 
-Furthermore, we use Pathways for [orchestration](https://cloud.google.com/ai-hypercomputer/docs/workloads/pathways-on-cloud/pathways-intro). Using Pathways, you can also run GRPO in a disaggregated mode where the trainer and the samplers are running on separate mesh. Try out the following recipe `v5p-64`. You can submit jobs to a Pathways enabled GKE cluster.
+Let's get started!
 
 ## Create virtual environment and Install MaxText dependencies
 Follow instructions in [Install MaxText](https://github.com/AI-Hypercomputer/maxtext/blob/main/docs/guides/install_maxtext.md), but 
 recommend creating the virtual environment outside the `maxtext` directory.
 
 
-## Setup the following environment variables before running GRPO
+## Setup environment variables
 
-Setup following environment variables before running GRPO
+Setup following environment variables:
 
 ```bash
 # -- Model configuration --
@@ -118,9 +113,11 @@ bash dependencies/scripts/docker_build_dependency_image.sh MODE=post-training PO
 bash dependencies/scripts/docker_upload_runner.sh CLOUD_IMAGE_NAME=${CLOUD_IMAGE_NAME}
 ```
 
-## Submit your jobs
+## Submit your RL workload via Pathways
 
-Please create a pathways ready GKE cluster as described [here](https://docs.cloud.google.com/ai-hypercomputer/docs/workloads/pathways-on-cloud/create-gke-cluster), and you can submit the `train_rl.py` script via [XPK](https://github.com/AI-Hypercomputer/xpk)
+Please create a pathways ready GKE cluster as described [here](https://docs.cloud.google.com/ai-hypercomputer/docs/workloads/pathways-on-cloud/create-gke-cluster), and you can submit the `train_rl.py` script via [XPK](https://github.com/AI-Hypercomputer/xpk).
+
+### Submit GRPO workload
 ```
 xpk workload create-pathways --workload $WORKLOAD \
 --docker-image <path/to/gcr.io> --cluster $TPU_CLUSTER \
@@ -135,3 +132,20 @@ python3 -m src.MaxText.rl.train_rl src/MaxText/configs/rl.yml \
   base_output_directory=${BASE_OUTPUT_DIRECTORY} \
   hf_access_token=$HF_TOKEN"
 ```
+
+### Submit GSPO workload
+```
+xpk workload create-pathways --workload $WORKLOAD \
+--docker-image <path/to/gcr.io> --cluster $TPU_CLUSTER \
+--tpu-type=$TPU_TYPE --num-slices=1  --zone=$ZONE \
+--project=$PROJECT_ID --priority=high \
+--command "TF_CPP_MIN_LOG_LEVEL=0 JAX_PLATFORMS=proxy JAX_BACKEND_TARGET=grpc://127.0.0.1:29000 ENABLE_PATHWAYS_PERSISTENCE='1' \
+python3 -m src.MaxText.rl.train_rl src/MaxText/configs/rl.yml \
+  model_name=${MODEL} \
+  tokenizer_path=${TOKENIZER} \
+  load_parameters_path=${MAXTEXT_CKPT_PATH} \
+  run_name=${RUN_NAME} \
+  base_output_directory=${BASE_OUTPUT_DIRECTORY} \
+  hf_access_token=$HF_TOKEN \
+  loss_algo=gspo-token"
+```
diff --git a/src/MaxText/examples/rl_llama3_demo.ipynb b/src/MaxText/examples/rl_llama3_demo.ipynb
@@ -63,7 +63,7 @@
     "## Setup\n",
     "\n",
     "Install dependencies and set up the environment:\n",
-    "https://maxtext.readthedocs.io/en/latest/tutorials/grpo.html#from-github"
+    "https://maxtext.readthedocs.io/en/latest/tutorials/rl.html#from-github"
    ]
   },
   {
@@ -256,7 +256,7 @@
    "source": [
     "## 📚 Learn More\n",
     "\n",
-    "- **CLI Usage**: https://maxtext.readthedocs.io/en/latest/tutorials/grpo.html#run-grpo\n",
+    "- **CLI Usage**: https://maxtext.readthedocs.io/en/latest/tutorials/rl.html#run-grpo\n",
     "- **Configuration**: See `src/MaxText/configs/rl.yml` for all available options\n",
     "- **Documentation**: Check `src/MaxText/rl/train_rl.py` for the `rl_train` function implementation"
    ]
diff --git a/src/MaxText/rl/train_rl.py b/src/MaxText/rl/train_rl.py

Original file line number	Diff line number	Diff line change
`@@ -63,7 +63,7 @@`
`63`	`63`	`"## Setup\n",`
`64`	`64`	`"\n",`
`65`	`65`	`"Install dependencies and set up the environment:\n",`
`66`		`- "https://maxtext.readthedocs.io/en/latest/tutorials/grpo.html#from-github"`
	`66`	`+ "https://maxtext.readthedocs.io/en/latest/tutorials/rl.html#from-github"`
`67`	`67`	`]`
`68`	`68`	`},`
`69`	`69`	`{`
`@@ -256,7 +256,7 @@`
`256`	`256`	`"source": [`
`257`	`257`	`"## 📚 Learn More\n",`
`258`	`258`	`"\n",`
`259`		`- "- CLI Usage: https://maxtext.readthedocs.io/en/latest/tutorials/grpo.html#run-grpo\n",`
	`259`	`+ "- CLI Usage: https://maxtext.readthedocs.io/en/latest/tutorials/rl.html#run-grpo\n",`
`260`	`260`	"- Configuration: See `src/MaxText/configs/rl.yml` for all available options\n",
`261`	`261`	"- Documentation: Check `src/MaxText/rl/train_rl.py` for the `rl_train` function implementation"
`262`	`262`	`]`