|
| 1 | +# GRPO |
| 2 | + |
| 3 | +We provide a reference GRPO configuration for math benchmarks using the [OpenInstructMath2](https://huggingface.co/datasets/nvidia/OpenMathInstruct-2) dataset. |
| 4 | + |
| 5 | +You can read about the details of the GRPO implementation [here](../../guides/grpo.md). |
| 6 | + |
| 7 | +## GRPO Single Node |
| 8 | + |
| 9 | +To run GRPO on a single GPU for `Qwen/Qwen2.5-1.5B`: |
| 10 | + |
| 11 | +```sh |
| 12 | +# Run the GRPO math example using a 1B parameter model |
| 13 | +uv run python examples/run_grpo_math.py |
| 14 | +``` |
| 15 | + |
| 16 | +By default, this uses the configuration in `examples/configs/grpo_math_1B.yaml`. You can customize parameters with command-line overrides. For example, to run on 8 GPUs: |
| 17 | + |
| 18 | +```sh |
| 19 | +# Run the GRPO math example using a 1B parameter model using 8 GPUs |
| 20 | +uv run python examples/run_grpo_math.py \ |
| 21 | + cluster.gpus_per_node=8 |
| 22 | +``` |
| 23 | + |
| 24 | +You can override any of the parameters listed in the YAML configuration file. For example: |
| 25 | + |
| 26 | +```sh |
| 27 | +uv run python examples/run_grpo_math.py \ |
| 28 | + policy.model_name="meta-llama/Llama-3.2-1B-Instruct" \ |
| 29 | + checkpointing.checkpoint_dir="results/llama1b_math" \ |
| 30 | + logger.wandb_enabled=True \ |
| 31 | + logger.wandb.name="grpo-llama1b_math" \ |
| 32 | + logger.num_val_samples_to_print=10 |
| 33 | +``` |
| 34 | + |
| 35 | +The default configuration uses the DTensor training backend. We also provide a config `examples/configs/grpo_math_1B_megatron.yaml` which is set up to use the Megatron backend out of the box. |
| 36 | + |
| 37 | +To train using this config on a single GPU: |
| 38 | + |
| 39 | +```sh |
| 40 | +# Run a GRPO math example on 1 GPU using the Megatron backend |
| 41 | +uv run python examples/run_grpo_math.py \ |
| 42 | + --config examples/configs/grpo_math_1B_megatron.yaml |
| 43 | +``` |
| 44 | + |
| 45 | +For additional details on supported backends and how to configure the training backend to suit your setup, refer to the [Training Backends documentation](../../design-docs/training-backends.md). |
| 46 | + |
| 47 | +## GRPO Multi-node |
| 48 | + |
| 49 | +```sh |
| 50 | +# Run from the root of NeMo RL repo |
| 51 | +NUM_ACTOR_NODES=2 |
| 52 | + |
| 53 | +# grpo_math_8b uses Llama-3.1-8B-Instruct model |
| 54 | +COMMAND="uv run ./examples/run_grpo_math.py --config examples/configs/grpo_math_8B.yaml cluster.num_nodes=2 checkpointing.checkpoint_dir='results/llama8b_2nodes' logger.wandb_enabled=True logger.wandb.name='grpo-llama8b_math'" \ |
| 55 | +CONTAINER=YOUR_CONTAINER \ |
| 56 | +MOUNTS="$PWD:$PWD" \ |
| 57 | +sbatch \ |
| 58 | + --nodes=${NUM_ACTOR_NODES} \ |
| 59 | + --account=YOUR_ACCOUNT \ |
| 60 | + --job-name=YOUR_JOBNAME \ |
| 61 | + --partition=YOUR_PARTITION \ |
| 62 | + --time=4:0:0 \ |
| 63 | + --gres=gpu:8 \ |
| 64 | + ray.sub |
| 65 | +``` |
| 66 | + |
| 67 | +The required `CONTAINER` can be built by following the instructions in the [Docker documentation](../../docker.md). |
| 68 | + |
| 69 | +## GRPO Qwen2.5-32B |
| 70 | + |
| 71 | +This section outlines how to run GRPO for Qwen2.5-32B with a 16k sequence length. |
| 72 | + |
| 73 | +```sh |
| 74 | +# Run from the root of NeMo RL repo |
| 75 | +NUM_ACTOR_NODES=32 |
| 76 | + |
| 77 | +# Download Qwen before the job starts to avoid spending time downloading during the training loop |
| 78 | +HF_HOME=/path/to/hf_home huggingface-cli download Qwen/Qwen2.5-32B |
| 79 | + |
| 80 | +# Ensure HF_HOME is included in your MOUNTS |
| 81 | +HF_HOME=/path/to/hf_home \ |
| 82 | +COMMAND="uv run ./examples/run_grpo_math.py --config examples/configs/grpo_math_8B.yaml policy.model_name='Qwen/Qwen2.5-32B' policy.generation.vllm_cfg.tensor_parallel_size=4 policy.max_total_sequence_length=16384 cluster.num_nodes=${NUM_ACTOR_NODES} policy.dtensor_cfg.enabled=True policy.dtensor_cfg.tensor_parallel_size=8 policy.dtensor_cfg.sequence_parallel=True policy.dtensor_cfg.activation_checkpointing=True checkpointing.checkpoint_dir='results/qwen2.5-32b' logger.wandb_enabled=True logger.wandb.name='qwen2.5-32b'" \ |
| 83 | +CONTAINER=YOUR_CONTAINER \ |
| 84 | +MOUNTS="$PWD:$PWD" \ |
| 85 | +sbatch \ |
| 86 | + --nodes=${NUM_ACTOR_NODES} \ |
| 87 | + --account=YOUR_ACCOUNT \ |
| 88 | + --job-name=YOUR_JOBNAME \ |
| 89 | + --partition=YOUR_PARTITION \ |
| 90 | + --time=4:0:0 \ |
| 91 | + --gres=gpu:8 \ |
| 92 | + ray.sub |
| 93 | +``` |
| 94 | + |
| 95 | +## GRPO Multi-Turn |
| 96 | + |
| 97 | +We also support multi-turn generation and training (tool use, games, etc.). Reference example for training to play a Sliding Puzzle Game: |
| 98 | + |
| 99 | +```sh |
| 100 | +uv run python examples/run_grpo_sliding_puzzle.py |
| 101 | +``` |
| 102 | + |
0 commit comments