Add GSPO and small fixes

mydatascience · mydatascience · commit e1158b3f7bf7 · 2025-11-25T01:03:31.000+04:00
Signed-off-by: Vladimir Suvorov &lt;suvorovv@google.com&gt;

Fix

Signed-off-by: Vladimir Suvorov &lt;suvorovv@google.com&gt;

Fix linter

Signed-off-by: Vladimir Suvorov &lt;suvorovv@google.com&gt;

Fix linter

Signed-off-by: Vladimir Suvorov &lt;suvorovv@google.com&gt;

Fix linter

Signed-off-by: Vladimir Suvorov &lt;suvorovv@google.com&gt;

Fix linter

Signed-off-by: Vladimir Suvorov &lt;suvorovv@google.com&gt;

Fix

Signed-off-by: Vladimir Suvorov &lt;suvorovv@google.com&gt;

ipython nb

Signed-off-by: Vladimir Suvorov &lt;suvorovv@google.com&gt;

Fix

Signed-off-by: Vladimir Suvorov &lt;suvorovv@google.com&gt;

Simplify

Signed-off-by: Vladimir Suvorov &lt;suvorovv@google.com&gt;

Fix

Signed-off-by: Vladimir Suvorov &lt;suvorovv@google.com&gt;

Fix

Signed-off-by: Vladimir Suvorov &lt;suvorovv@google.com&gt;

Fix

Signed-off-by: Vladimir Suvorov &lt;suvorovv@google.com&gt;

Fix GRPO

Signed-off-by: Vladimir Suvorov &lt;suvorovv@google.com&gt;

Run GRPO/GSPO

Signed-off-by: Vladimir Suvorov &lt;suvorovv@google.com&gt;

More fixes

Signed-off-by: Vladimir Suvorov &lt;suvorovv@google.com&gt;

More fixes

Signed-off-by: Vladimir Suvorov &lt;suvorovv@google.com&gt;

More fixes

Signed-off-by: Vladimir Suvorov &lt;suvorovv@google.com&gt;

More fixes

Signed-off-by: Vladimir Suvorov &lt;suvorovv@google.com&gt;

More fixes

Signed-off-by: Vladimir Suvorov &lt;suvorovv@google.com&gt;

More fixes

Signed-off-by: Vladimir Suvorov &lt;suvorovv@google.com&gt;

More fixes

Signed-off-by: Vladimir Suvorov &lt;suvorovv@google.com&gt;

Fix for Qwen

Signed-off-by: Vladimir Suvorov &lt;suvorovv@google.com&gt;

Fix colab

Signed-off-by: Vladimir Suvorov &lt;suvorovv@google.com&gt;

Restored

Signed-off-by: Vladimir Suvorov &lt;suvorovv@google.com&gt;

Fix

Signed-off-by: Vladimir Suvorov &lt;suvorovv@google.com&gt;

Fix

Signed-off-by: Vladimir Suvorov &lt;suvorovv@google.com&gt;
diff --git a/README.md b/README.md
@@ -22,7 +22,7 @@
 
 MaxText is a high performance, highly scalable, open-source LLM library and reference implementation written in pure Python/[JAX](https://docs.jax.dev/en/latest/jax-101.html) and targeting Google Cloud TPUs and GPUs for training. 
 
-MaxText provides a library of high performance models to choose from, including Gemma, Llama, DeepSeek, Qwen, and Mistral. For each of these models, MaxText supports pre-training (up to tens of thousands of chips) and scalable post-training, with popular techniques like Supervised Fine-Tuning (SFT) and Group Relative Policy Optimization (GRPO, a type of Reinforcement Learning). 
+MaxText provides a library of high performance models to choose from, including Gemma, Llama, DeepSeek, Qwen, and Mistral. For each of these models, MaxText supports pre-training (up to tens of thousands of chips) and scalable post-training, with popular techniques like Supervised Fine-Tuning (SFT) and Group Relative Policy Optimization (GRPO, a type of Reinforcement Learning) and Group Sequence Policy Optimization (GSPO, a type of Reinforcement Learning) 
 
 MaxText achieves high Model FLOPs Utilization (MFU) and tokens/second from single host to very large clusters while staying simple and largely "optimization-free" thanks to the power of JAX and the XLA compiler.
 
@@ -70,7 +70,7 @@ Our goal is to provide a variety of models (dimension “a”) and techniques (d
 Check out these getting started guides:
 
 * [SFT](https://github.com/AI-Hypercomputer/maxtext/blob/main/end_to_end/tpu/llama3.1/8b/run_sft.sh) (Supervised Fine Tuning)  
-* [GRPO](https://maxtext.readthedocs.io/en/latest/tutorials/grpo.html) (Group Relative Policy Optimization)
+* [GRPO / GSPO](https://maxtext.readthedocs.io/en/latest/tutorials/grpo.html) (Group Relative & Group Sequence Policy Optimization – pass `loss_algo=gspo-token` to run GSPO)
 
 ### Model library
 
diff --git a/docs/tutorials/grpo.md b/docs/tutorials/grpo.md
@@ -20,7 +20,7 @@ This tutorial demonstrates step-by-step instructions for setting up the environm
 
 GRPO is an RL algorithm designed to enhance the reasoning abilities of LLMs. It is a variant of Proximal Policy Optimization (PPO) that reduces memory usage by eliminating the need for a separate value function model. GRPO works by generating multiple responses for a given prompt, evaluating these responses using a reward model, and then calculating a relative advantage based on the group's performance to update the policy.
 
-We use Tunix as the library for GRPO.
+We use Tunix as the library for GRPO/GSPO.
 And we use vLLM as the library for efficient model inference and generation.
 
 In this tutorial we use a single host TPUVM such as `v6e-8/v5p-8`. Let's get started!
@@ -66,3 +66,22 @@ The overview of the what this run will do is as follows:
 2. Evaluate the policy model's performance on GSM8K math reasoning benchmark.
 3. Train the policy model using GRPO.
 4. Evaluate the policy model's performance on GSM8K math reasoning benchmark after the post-training with GRPO.
+
+GSPO (Group Sequence Policy Optimization)
+MaxText can also run the GSPO variant by setting `loss_algo=gspo-token` when invoking `train_rl.py` (or when constructing the pyconfig argv list). 
+
+## Run GSPO
+
+Finally, run the command
+
+```
+python3 -m src.MaxText.rl.train_rl src/MaxText/configs/rl.yml \
+  model_name=llama3.1-8b \
+  tokenizer_path=meta-llama/Llama-3.1-8B-Instruct \
+  load_parameters_path=gs://path/to/checkpoint/0/items \
+  run_name=$WORKLOAD \
+  base_output_directory=$OUTPUT_PATH \
+  hf_access_token=$HF_TOKEN \
+  loss_algo=gspo-token
+```
+
diff --git a/docs/tutorials/grpo_with_pathways.md b/docs/tutorials/grpo_with_pathways.md
@@ -20,6 +20,14 @@ This tutorial demonstrates step-by-step instructions for setting up the environm
 
 GRPO is an RL algorithm designed to enhance the reasoning abilities of LLMs. It is a variant of Proximal Policy Optimization (PPO) that reduces memory usage by eliminating the need for a separate value function model. GRPO works by generating multiple responses for a given prompt, evaluating these responses using a reward model, and then calculating a relative advantage based on the group's performance to update the policy.
 
+GSPO support
+Some workloads prefer Group Sequence Policy Optimization (GSPO), which uses the same infrastructure but a different loss.  
+To switch from GRPO to GSPO, add the following override when invoking `train_rl.py` (or when building the `pyconfig` argv list):  
+```
+loss_algo=gspo-token
+```
+No other changes are required—the rest of this tutorial applies equally to GSPO runs.
+
 We use Tunix as the library for GRPO. 
 And we use vLLM as the library for efficient model inference and generation.
 
diff --git a/src/MaxText/examples/grpo_llama3_1_8b_demo.ipynb b/src/MaxText/examples/grpo_llama3_1_8b_demo.ipynb
@@ -4,19 +4,18 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "# GRPO Llama3.1-8B Demo: Direct Function Call\n",
+    "# GRPO/GSPO Llama3.1-8B Demo\n",
     "\n",
-    "This notebook demonstrates GRPO training by directly calling the `rl_train` function from `rl_trainer.py`.\n",
+    "This notebook demonstrates GRPO (Group Relative Policy Optimization) training using the unified `rl_train` function or GSPO (Group Sequence Policy Optimization) - the change is in loss function which is a parameter\n",
     "\n",
-    "## What is GRPO?\n",
+    "## What is GRPO/GSPO?\n",
     "\n",
-    "GRPO (Group Relative Policy Optimization) is an RL algorithm that enhances reasoning abilities of LLMs by:\n",
+    "GRPO/GSPO is an RL algorithm that enhances reasoning abilities of LLMs by:\n",
     "1. Generating multiple responses for each prompt\n",
     "2. Evaluating responses using reward models  \n",
     "3. Calculating relative advantages to update the policy\n",
     "\n",
-    "\n",
-    "This notebook imports and calls the `rl_train` function \n",
+    "The difference is in the loss function - either it's optimizing each token (GRPO) or the whole sequence(GSPO).\n",
     "\n",
     "## Hardware Requirements\n",
     "\n",
@@ -28,9 +27,24 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "## Setup\n",
+    "### Get Your Hugging Face Token\n",
+    "\n",
+    "To access model checkpoint from the Hugging Face Hub, you need to authenticate with a personal access token.\n",
+    "\n",
+    "**Follow these steps to get your token:**\n",
+    "\n",
+    "1.  **Navigate to the Access Tokens page** in your Hugging Face account settings. You can go there directly by visiting this URL:\n",
+    "    *   [https://huggingface.co/settings/tokens](https://huggingface.co/settings/tokens)\n",
+    "\n",
+    "2.  **Create a new token** by clicking the **\"+ Create new token\"** button.\n",
     "\n",
-    "Install dependencies and set up the environment:"
+    "3.  **Give your token a name** and assign it a **`read` role**. The `read` role is sufficient for downloading models.\n",
+    "\n",
+    "4.  **Copy the generated token**. You will need to paste it in the next step.\n",
+    "\n",
+    "**Follow these steps to store your token:**\n",
+    "\n",
+    "Just put your token in the line below"
    ]
   },
   {
@@ -39,30 +53,17 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "# Clone MaxText repository\n",
-    "!git clone https://github.com/AI-Hypercomputer/maxtext.git\n",
-    "%cd maxtext"
+    "HF_TOKEN = \"\" # Set HF_TOKEN environment variable\n"
    ]
   },
   {
-   "cell_type": "code",
-   "execution_count": null,
+   "cell_type": "markdown",
    "metadata": {},
-   "outputs": [],
    "source": [
-    "# Install dependencies\n",
-    "!chmod +x setup.sh\n",
-    "!./setup.sh\n",
-    "\n",
-    "# Install GRPO-specific dependencies\n",
-    "!./src/MaxText/examples/install_tunix_vllm_requirement.sh\n",
-    "\n",
-    "# Install additional requirements\n",
-    "%pip install --force-reinstall numpy==2.1.2\n",
-    "%pip install nest_asyncio\n",
+    "## Setup\n",
     "\n",
-    "import nest_asyncio\n",
-    "nest_asyncio.apply()  # Fix for Colab event loop"
+    "Install dependencies and set up the environment:\n",
+    "https://maxtext.readthedocs.io/latest/tutorials/grpo.html#from-github"
    ]
   },
   {
@@ -71,9 +72,23 @@
    "source": [
     "## Configuration\n",
     "\n",
-    "Set up the training parameters:"
+    "Set up the training parameters. We do not use Pathways and do use a single host. Defaults are hardcoded for Llama3.1-8B:"
    ]
   },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "!cd ~/maxtext/src/ #  make sure we are in the right directory"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": []
+  },
   {
    "cell_type": "code",
    "execution_count": null,
@@ -82,20 +97,48 @@
    "source": [
     "# Configuration for GRPO training\n",
     "import os\n",
+    "from re import M\n",
+    "import MaxText\n",
+    "\n",
+    "# Set up paths (adjust if needed)\n",
+    "MAXTEXT_REPO_ROOT = os.path.dirname(MaxText.__file__)\n",
+    "RUN_NAME=\"grpo_test\"\n",
+    "# Hardcoded defaults for Llama3.1-8B\n",
+    "MODEL_NAME = \"llama3.1-8b\"\n",
+    "HF_REPO_ID = \"meta-llama/Llama-3.1-8B-Instruct\"\n",
+    "CHAT_TEMPLATE_PATH = f\"{MAXTEXT_REPO_ROOT}/examples/chat_templates/gsm8k_rl.json\"\n",
+    "LOSS_ALGO=\"grpo\" #  or \"gspo-token\" if you want to use GSPO\n",
+    "\n",
+    "# Required: Set these before running\n",
+    "MODEL_CHECKPOINT_PATH = \"\"  # Update this!\n",
+    "if not MODEL_CHECKPOINT_PATH:\n",
+    "    raise RuntimeError(\"MODEL_CHECKPOINT_PATH is not set\")\n",
+    "    \n",
+    "OUTPUT_DIRECTORY = \"/tmp/gpo_output\"  # Update this!\n",
+    "os.environ[\"HF_TOKEN\"] = HF_TOKEN\n",
     "\n",
-    "# Set up paths\n",
-    "MAXTEXT_REPO_ROOT = os.path.expanduser(\"~\") + \"/maxtext\"\n",
-    "print(f\"MaxText Home directory: {MAXTEXT_REPO_ROOT}\")\n",
+    "if HF_TOKEN:\n",
+    "    login(token=HF_TOKEN)\n",
+    "    print(\"Authenticated with Hugging Face\")\n",
+    "else:\n",
+    "    print(\"Authentication failed: Hugging Face token not set\")\n",
     "\n",
-    "# Training configuration\n",
-    "MODEL_CHECKPOINT_PATH = \"gs://maxtext-model-checkpoints/llama3.1-8b/2025-01-23-19-04/scanned/0/items\"\n",
-    "OUTPUT_DIRECTORY = \"/tmp/grpo_output\"\n",
+    "# Optional: Override training parameters\n",
     "STEPS = 10  # Reduced for demo purposes\n",
-    "HF_TOKEN = os.environ.get(\"HF_TOKEN\", \"your_hf_token_here\")\n",
+    "PER_DEVICE_BATCH_SIZE = 1\n",
+    "LEARNING_RATE = 3e-6\n",
+    "NUM_GENERATIONS = 2\n",
+    "GRPO_BETA = 0.08\n",
+    "GRPO_EPSILON = 0.2\n",
+    "CHIPS_PER_VM = 1\n",
     "\n",
-    "print(f\"Model checkpoint: {MODEL_CHECKPOINT_PATH}\")\n",
-    "print(f\"Output directory: {OUTPUT_DIRECTORY}\")\n",
-    "print(f\"Training steps: {STEPS}\")"
+    "print(f\"📁 MaxText Home: {MAXTEXT_REPO_ROOT}\")\n",
+    "print(f\"🤖 Model: {MODEL_NAME}\")\n",
+    "print(f\"📦 Checkpoint: {MODEL_CHECKPOINT_PATH}\")\n",
+    "print(f\"💾 Output: {OUTPUT_DIRECTORY}\")\n",
+    "print(f\"🔑 HF Token: {'✅ Set' if HF_TOKEN else '❌ Missing - set HF_TOKEN env var'}\")\n",
+    "print(f\"📊 Steps: {STEPS}\")\n",
+    "print(f\"Loss Algorithm : {LOSS_ALGO}\")"
    ]
   },
   {
@@ -104,24 +147,25 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "# Import GRPO training function directly\n",
-    "import sys\n",
+    "# Import required modules\n",
     "import os\n",
+    "import sys\n",
     "from pathlib import Path\n",
     "\n",
     "# Add MaxText to Python path\n",
-    "maxtext_path = Path(MAXTEXT_REPO_ROOT) / \"src\" / \"MaxText\"\n",
+    "maxtext_path = Path(MAXTEXT_REPO_ROOT) \n",
     "sys.path.insert(0, str(maxtext_path))\n",
     "\n",
-    "# Import required modules\n",
-    "from MaxText import pyconfig\n",
-    "from MaxText.train_rl import rl_train\n",
+    "from MaxText import pyconfig, max_utils\n",
+    "from MaxText.rl.train_rl import rl_train, setup_configs_and_devices\n",
+    "import jax\n",
     "\n",
-    "print(\"✅ Successfully imported GRPO training function\")\n",
-    "print(f\"📁 MaxText path: {maxtext_path}\")\n",
-    "print(\"\\n\" + \"=\"*80)\n",
-    "print(\"Starting GRPO Training...\")\n",
-    "print(\"=\"*80)"
+    "# Initialize JAX and Pathways\n",
+    "os.environ[\"TF_CPP_MIN_LOG_LEVEL\"] = \"0\"\n",
+    "os.environ[\"SKIP_JAX_PRECOMPILE\"] = \"1\"  # Faster startup for vLLM\n",
+    "\n",
+    "print(\"✅ Successfully imported modules\")\n",
+    "print(f\"📁 MaxText path: {maxtext_path}\")"
    ]
   },
   {
@@ -131,28 +175,40 @@
    "outputs": [],
    "source": [
     "# Build configuration for GRPO training\n",
+    "config_file = os.path.join(MAXTEXT_REPO_ROOT, \"configs/rl.yml\")\n",
+    "\n",
+    "# Verify chat template exists\n",
+    "if not os.path.exists(CHAT_TEMPLATE_PATH)):\n",
+    "    raise FileNotFoundError(f\"Chat template not found: {CHAT_TEMPLATE_PATH}\")\n",
+    "\n",
+    "# Build argv list for pyconfig.initialize()\n",
     "config_argv = [\n",
-    "    \"\",  # Placeholder for argv[0]\n",
-    "    \"src/MaxText/configs/grpo.yml\",  # Base config\n",
-    "    f\"model_name=llama3.1-8b\",\n",
-    "    f\"tokenizer_path=meta-llama/Llama-3.1-8B-Instruct\",\n",
+    "    \"\",  # argv[0] placeholder\n",
+    "    config_file,\n",
+    "    f\"model_name={MODEL_NAME}\",\n",
+    "    f\"tokenizer_path={HF_REPO_ID}\",\n",
+    "    f\"run_name={RUN_NAME}\",\n",
+    "    f\"chat_template_path={CHAT_TEMPLATE_PATH}\",\n",
     "    f\"load_parameters_path={MODEL_CHECKPOINT_PATH}\",\n",
     "    f\"base_output_directory={OUTPUT_DIRECTORY}\",\n",
     "    f\"hf_access_token={HF_TOKEN}\",\n",
     "    f\"steps={STEPS}\",\n",
-    "    \"per_device_batch_size=1\",\n",
-    "    \"learning_rate=3e-6\",\n",
-    "    \"num_generations=2\",\n",
-    "    \"grpo_beta=0.08\",\n",
-    "    \"grpo_epsilon=0.2\",\n",
-    "    \"chips_per_vm=4\"\n",
+    "    f\"per_device_batch_size={PER_DEVICE_BATCH_SIZE}\",\n",
+    "    f\"learning_rate={LEARNING_RATE}\",\n",
+    "    f\"num_generations={NUM_GENERATIONS}\",\n",
+    "    f\"grpo_beta={GRPO_BETA}\",\n",
+    "    f\"grpo_epsilon={GRPO_EPSILON}\",\n",
+    "    f\"chips_per_vm={CHIPS_PER_VM}\",\n",
+    "    f\"loss_algo={LOSS_ALGO}\",\n",
+    "    \"use_pathways=False\"\n",
     "]\n",
     "\n",
-    "# Create configuration object\n",
-    "config = pyconfig.Config()\n",
-    "config.parse_flags(config_argv)\n",
+    "# Initialize configuration\n",
+    "print(f\"🔧 Initializing configuration from: {config_file}\")\n",
+    "config = pyconfig.initialize(config_argv)\n",
+    "max_utils.print_system_information()\n",
     "\n",
-    "print(\"✅ Configuration created successfully\")\n",
+    "print(\"\\n✅ Configuration initialized successfully\")\n",
     "print(f\"📊 Training steps: {config.steps}\")\n",
     "print(f\"📁 Output directory: {config.base_output_directory}\")\n",
     "print(f\"🤖 Model: {config.model_name}\")"
@@ -164,33 +220,46 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "# Execute GRPO training directly\n",
+    "# Execute GRPO/GSPO training\n",
+    "print(\"\\n\" + \"=\"*80)\n",
+    "print(\"🚀 Starting Training...\")\n",
+    "print(\"=\"*80)\n",
+    "print(1)\n",
     "try:\n",
-    "    # Call the rl_train function\n",
-    "    grpo_trainer, rl_cluster = rl_train(config)\n",
+    "    # Call the rl_train function (it handles everything internally)\n",
+    "    rl_train(config)\n",
     "    \n",
     "    print(\"\\n\" + \"=\"*80)\n",
-    "    print(\"✅ GRPO Training Completed Successfully!\")\n",
+    "    print(\"✅ Training Completed Successfully!\")\n",
     "    print(\"=\"*80)\n",
-    "    print(f\"📁 Checkpoints and logs saved to: {config.base_output_directory}\")\n",
-    "    print(f\"🎯 Final model ready for inference!\")\n",
+    "    print(f\"📁 Checkpoints saved to: {config.checkpoint_dir}\")\n",
+    "    print(f\"📊 TensorBoard logs: {config.tensorboard_dir}\")\n",
+    "    print(f\"🎯 Model ready for inference!\")\n",
     "    \n",
     "except Exception as e:\n",
     "    print(\"\\n\" + \"=\"*80)\n",
-    "    print(\"❌ GRPO Training Failed!\")\n",
+    "    print(\"❌Training Failed!\")\n",
     "    print(\"=\"*80)\n",
     "    print(f\"Error: {str(e)}\")\n",
-    "    print(\"\\nPlease check the error message and try again.\")"
+    "    import traceback\n",
+    "    traceback.print_exc()\n",
+    "    print(\"\\n💡 Common issues:\")\n",
+    "    print(\"  - Check that MODEL_CHECKPOINT_PATH points to a valid checkpoint\")\n",
+    "    print(\"  - Ensure HF_TOKEN environment variable is set\")\n",
+    "    print(\"  - Verify OUTPUT_DIRECTORY is writable\")\n",
+    "    print(\"  - Check hardware requirements (TPU/GPU availability)\")"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "### 📚 **Learn More**\n",
-    "- See `src/MaxText/examples/grpo_runner.py` for CLI usage\n",
-    "- Check `src/MaxText/configs/grpo.yml` for configuration options\n",
-    "- Read `src/MaxText/examples/README.md` for more examples"
+    "## 📚 Learn More\n",
+    "\n",
+    "- **CLI Usage**: Run `python3 -m src.MaxText.rl.train_rl src/MaxText/configs/rl.yml --model_name=llama3.1-8b ...`\n",
+    "- **Configuration**: See `src/MaxText/configs/rl.yml` for all available options\n",
+    "- **Documentation**: Check `src/MaxText/rl/train_rl.py` for the `rl_train` function implementation\n",
+    "- **Examples**: See other examples in `src/MaxText/examples/`"
    ]
   }
  ],
diff --git a/src/MaxText/examples/install_tunix_vllm_requirement.sh b/src/MaxText/examples/install_tunix_vllm_requirement.sh
diff --git a/src/MaxText/examples/sft_llama3_demo.ipynb b/src/MaxText/examples/sft_llama3_demo.ipynb