NVIDIA-NeMo
diff --git a/‎README.md‎
Lines changed: 3 additions & 0 deletions b/‎README.md‎
Lines changed: 3 additions & 0 deletions
diff --git a/‎docs/assets/dapo_train_reward.png‎
30.7 KB b/‎docs/assets/dapo_train_reward.png‎
30.7 KB
diff --git a/‎docs/assets/dapo_val_acc.png‎
23.7 KB b/‎docs/assets/dapo_val_acc.png‎
23.7 KB
diff --git a/‎docs/guides/dapo.md‎
Lines changed: 100 additions & 0 deletions b/‎docs/guides/dapo.md‎
Lines changed: 100 additions & 0 deletions
diff --git a/‎docs/index.md‎
Lines changed: 1 addition & 0 deletions b/‎docs/index.md‎
Lines changed: 1 addition & 0 deletions
diff --git a/‎examples/configs/grpo_math_1B.yaml‎
Lines changed: 21 additions & 0 deletions b/‎examples/configs/grpo_math_1B.yaml‎
Lines changed: 21 additions & 0 deletions
diff --git a/‎examples/configs/recipes/llm/dapo-qwen2.5-7b.yaml‎
Lines changed: 104 additions & 0 deletions b/‎examples/configs/recipes/llm/dapo-qwen2.5-7b.yaml‎
Lines changed: 104 additions & 0 deletions
diff --git a/‎examples/configs/vlm_grpo_3B.yaml‎
Lines changed: 13 additions & 0 deletions b/‎examples/configs/vlm_grpo_3B.yaml‎
Lines changed: 13 additions & 0 deletions
diff --git a/‎examples/configs/vlm_grpo_3B_megatron.yaml‎
Lines changed: 13 additions & 0 deletions b/‎examples/configs/vlm_grpo_3B_megatron.yaml‎
Lines changed: 13 additions & 0 deletions
diff --git a/‎examples/converters/convert_dcp_to_hf.py‎
Lines changed: 2 additions & 0 deletions b/‎examples/converters/convert_dcp_to_hf.py‎
Lines changed: 2 additions & 0 deletions
@@ -3,6 +3,9 @@
 [![CICD NeMo RL](https://github.com/NVIDIA-NeMo/RL/actions/workflows/cicd-main.yml/badge.svg?branch=main&event=schedule)](https://github.com/NVIDIA-NeMo/RL/actions/workflows/cicd-main.yml)
 
 ## 📣 News
+* [10/10/2025] **DAPO Algorithm Support**  
+  NeMo RL now supports [Decoupled Clip and Dynamic Sampling Policy Optimization (DAPO)](https://arxiv.org/pdf/2503.14476) algorithm.  
+  DAPO extends GRPO with **Clip-Higher**, **Dynamic Sampling**, **Token-Level Policy Gradient Loss**, and **Overlong Reward Shaping** for more stable and efficient RL training. See the [DAPO guide](docs/guides/dapo.md) for more details.
 * [9/30/2025][Accelerated RL on GCP with NeMo RL!](https://discuss.google.dev/t/accelerating-reinforcement-learning-on-google-cloud-using-nvidia-nemo-rl/269579/4) 
 * [9/27/2025] [FP8 Quantization in NeMo RL](https://github.com/NVIDIA-NeMo/RL/discussions/1216)
 * [9/25/2025] On-policy Distillation (Qwen3-style)
 
@@ -0,0 +1,100 @@
+# An in-depth Walkthrough of DAPO in NeMo RL
+
+This guide covers the [Decoupled Clip and Dynamic Sampling Policy Optimization (DAPO)](https://arxiv.org/pdf/2503.14476) implementation in NeMo RL.
+
+DAPO introduces four key improvements over Group Relative Policy Optimization (GRPO):
+1. **Clip-Higher**, which promotes the diversity of the system and avoids entropy collapse
+2. **Dynamic Sampling**, which improves training efficiency and stability
+3. **Token-Level Policy Gradient Loss**, which is critical in long-CoT RL scenarios
+4. **Overlong Reward Shaping**, which reduces reward noise and stabilizes training
+
+This document focuses on DAPO-specific features: Dynamic Sampling and Overlong Reward Shaping. For foundational concepts on GRPO including data handling, policy training, generation, and loss functions, see the [NeMo RL GRPO Guide](grpo.md).
+
+
+## Quickstart: Launch a DAPO Run
+
+To get started quickly, use the example configuration [examples/configs/recipes/llm/dapo-qwen2.5-7b.yaml](../../examples/configs/recipes/llm/dapo-qwen2.5-7b.yaml). You can launch this using the same script as GRPO:
+
+```bash
+uv run examples/run_grpo_math.py --config examples/configs/recipes/llm/dapo-qwen2.5-7b.yaml {overrides}
+```
+
+**Reminder**: Don't forget to set your HF_HOME, WANDB_API_KEY, and HF_DATASETS_CACHE (if needed). You'll need to do a `huggingface-cli login` as well for LLaMA models.
+
+## Dynamic Sampling
+
+Standard GRPO trains on all generated responses, even when they have identical rewards (zero gradient signal) within a prompt group of generations. Dynamic sampling filters to keep only groups with diverse rewards (`std > 0`), and accumulates them across batches until reaching the target batch size. Dynamic sampling can be enabled by setting `use_dynamic_sampling=True` in your configuration. For implementation details, see the [`dynamic_sampling`](../../nemo_rl/algorithms/grpo.py) function. 
+
+**Algorithm**: For each training step:
+
+1. Sample `batch_multiplier × num_prompts_per_step` prompts from the dataset. The default value of `batch_multiplier` is 1.
+2. Generate `num_generations_per_prompt` responses per prompt and compute rewards.
+3. Compute the baseline and standard deviation for each prompt group.
+4. Filter prompt groups where `std > 0`.
+5. Store these prompts in a cache until reaching the target training batch size of `num_prompts_per_step × num_generations_per_prompt` samples.
+6. Samples are accumulated until the maximum number of allowed batches (`dynamic_sampling_max_gen_batches`) is reached. If the cache still does not meet the target rollout batch size at that point, an error is raised. To resolve this, consider adjusting parameters such as `num_prompts_per_step` or `num_generations_per_prompt` to increase sample diversity, or revisit the complexity of your data.
+7. Perform training on the collected samples with nonzero standard deviation
+
+### About batch_multiplier
+
+`batch_multiplier` (a float ≥ 1.0) controls the initial prompt pool size by sampling `batch_multiplier × num_prompts_per_step` prompts before dynamic sampling. Higher values increase memory and compute requirements, while very low values (e.g., 1.0) may slow the cache accumulation of prompt groups with nonzero standard deviation. The optimal value depends on the dataset, model capacity, and overall training setup.  When **dynamic sampling** is enabled, we also log two additional metrics:
+
+ * `dynamic_sampling_num_gen_batches`: The number of generation rounds required to produce `num_prompts_per_step * num_generations_per_prompt` samples with a nonzero standard deviation. If this number remains consistently high across iterations, try increasing the `batch_multiplier`. The maximum allowed value for this parameter is determined by `dynamic_sampling_max_gen_batches`.
+ * `dynamic_sampling_num_discarded_valid_samples`: The number of samples with a nonzero standard deviation that are discarded because the total exceeds `num_prompts_per_step * num_generations_per_prompt`. If this value is frequently high (e.g., above `0.5 * num_prompts_per_step * num_generations_per_prompt`) and `dynamic_sampling_num_gen_batches` is consistently 1, it suggests that a large fraction of the dataset is being discarded unnecessarily. To improve data efficiency, consider decreasing the `batch_multiplier`.
+
+## Reward Shaping
+DAPO introduces an overlong reward shaping mechanism to reduce reward noise and stabilize training. This approach penalizes responses that exceed a specified length threshold, helping to prevent the model from generating excessively long outputs while maintaining solution quality.
+
+For a detailed explanation of the overlong reward shaping mechanism, please refer to Section 3.4 of the [DAPO paper](https://arxiv.org/pdf/2503.14476). For implementation details, see the [`apply_reward_shaping`](../../nemo_rl/algorithms/reward_functions.py) function.
+
+## Configuration
+
+```yaml
+grpo:
+  use_dynamic_sampling: true  # Enable DAPO dynamic sampling
+  num_prompts_per_step: 512   # Target number of prompts per training step
+  num_generations_per_prompt: 16  # Generations per prompt
+  batch_multiplier: 3    # Dataloader batch size = batch_multiplier × num_prompts_per_step
+  dynamic_sampling_max_gen_batches: 10     # Maximum number of batches to be used for accumulating non-zero std prompts
+  reward_scaling:
+    enabled: true
+    source_min: 0.0
+    source_max: 1.0
+    target_min: -1.0
+    target_max: 1.0
+  
+  reward_shaping:
+    enabled: true
+    overlong_buffer_length: 4096     # Threshold before penalties apply (paper uses 4096)
+    overlong_buffer_penalty: 1.0     # Penalty per excess token
+    max_response_length: 20480       # Hard maximum generation length
+```
+
+**Key Parameters:**
+- **`use_dynamic_sampling`**: When enabled, activates DAPO's dynamic sampling algorithm to filter and accumulate prompt groups with nonzero standard deviation
+- **`batch_multiplier`**: Factor that scales the initial prompt pool size for sampling.
+- **`dynamic_sampling_max_gen_batches`**: Maximum number of batches to be used for accumulating nonzero standard deviation prompts.
+- **`reward_scaling`**: When enabled, clamps each reward in the batch to [source_min, source_max] and linearly rescales it to [target_min, target_max]. Defaults: source_min=0.0, source_max=1.0, target_min=0.0, target_max=1.0.
+- **`reward_shaping`**: When enabled, applies the overlong penalty mechanism described in the Reward Shaping section above. Responses exceeding `max_response_length - overlong_buffer_length` receive penalties proportional to their excess length, helping to reduce reward noise and stabilize training.
+
+> [!NOTE]
+> When dynamic sampling is enabled, monitor the `filtered_reward` metric to track the average reward of the prompts with std > 0.
+
+> [!NOTE]
+> **Clip-Higher** and **Token-Level Policy Gradient Loss** are already supported in NeMo RL and can be configured through the `loss_fn` section of your experiment config:
+> - Set `ratio_clip_max` to enable Clip-Higher (e.g., `ratio_clip_max: 0.28`)
+> - Set `token_level_loss: true` to enable Token-Level Policy Gradient Loss
+> 
+> See the full [DAPO example config](../../examples/configs/recipes/llm/dapo-qwen2.5-7b.yaml) for reference.
+
+## Example Training Results
+Using the [DAPO example config](../../examples/configs/recipes/llm/dapo-qwen2.5-7b.yaml), you can expect to see intermediate plots such as the training reward curve and validation accuracy on AIME24 for Qwen/Qwen2.5-Math-7B. These plots serve as reference outputs to help verify reproducibility. They are not intended to reflect the best accuracy that can be achieved using DAPO for this model.
+
+![DAPO Qwen2.5-7B Training Reward](../assets/dapo_train_reward.png)
+![DAPO Qwen2.5-7B Validation Accuracy](../assets/dapo_val_acc.png)
+
+## References
+
+- **DAPO Paper**: [Decoupled Clip and Dynamic Sampling Policy Optimization](https://arxiv.org/pdf/2503.14476)
+- **GRPO Paper**: [Group Relative Policy Optimization](https://arxiv.org/abs/2402.03300)
+- **[NeMo RL GRPO Guide](grpo.md)**
@@ -26,6 +26,7 @@ guides/sft-openmathinstruct2.md
 adding-new-models.md
 guides/sft.md
 guides/dpo.md
+guides/dapo.md
 guides/grpo.md
 guides/grpo-deepscaler.md
 guides/grpo-sliding-puzzle.md
 
@@ -13,6 +13,21 @@ grpo:
   max_val_samples: 256
   val_batch_size: 256
   seed: 42
+  use_dynamic_sampling: false
+  dynamic_sampling_max_gen_batches: 10
+  batch_multiplier: 1
+  reward_shaping:
+    enabled: false
+    overlong_buffer_length: 128
+    overlong_buffer_penalty: 1
+    max_response_length: ${policy.max_total_sequence_length}
+  reward_scaling:
+    enabled: false
+    source_min: 0.0
+    source_max: 1.0
+    target_min: 0.0
+    target_max: 1.0
+
   async_grpo:
     enabled: false # Set to true to enable async training mode
     # Max age (in training steps) for trajectories used in training
@@ -47,6 +62,7 @@ policy:
   tokenizer:
     name: ${policy.model_name} ## specify if you'd like to use a tokenizer different from the model's default
     chat_template_kwargs: null # can be used to pass kwargs to the chat template, e.g., enable_thinking=true
+  hf_config_overrides: null
   train_global_batch_size: 512
   train_micro_batch_size: 4
   generation_batch_size: 32 # Only used when generating using HF backend
@@ -237,6 +253,11 @@ data:
 env:
   math:
     num_workers: 8
+    math_verify_impl: "hf_math_verify"
+  ## unused in this config but needed for DAPO recipe
+  dapo:
+    num_workers: 8
+    math_verify_impl: "dapo_math_verify"
 
 logger:
   log_dir: "logs"  # Base directory for all logs
 
@@ -0,0 +1,104 @@
+defaults: ../../grpo_math_1B.yaml
+grpo:
+  num_prompts_per_step: 512
+  num_generations_per_prompt: 16
+  batch_multiplier: 3 # Multiplier for dataloader batch size calculation (batch_multiplier × num_prompts_per_step). Following DAPO dynamic sampling, the actual training batch size equals num_prompts_per_step × num_generations_per_prompt.
+  max_rollout_turns: 1 # for multi-turn rollouts. Math Environments just have 1 turn (answering the question)
+  max_num_steps: 10000
+  use_leave_one_out_baseline: false
+  val_period: 20
+  max_val_samples: 960
+  val_batch_size: 960
+  use_dynamic_sampling: true
+  dynamic_sampling_max_gen_batches: 10
+  reward_scaling:
+    enabled: true
+    source_min: 0.0
+    source_max: 1.0
+    target_min: -1.0
+    target_max: 1.0
+  reward_shaping:
+    enabled: true
+    overlong_buffer_length: 2048
+    max_response_length: 14336
+loss_fn:
+  reference_policy_kl_penalty: 0.0
+  ratio_clip_max: 0.28
+  ratio_clip_c: 10
+checkpointing:
+  checkpoint_dir: results/dapo-qwen2.5-7b
+  keep_top_k: 5
+  save_period: 5
+  model_save_format: "dcp"
+policy:
+  model_name: Qwen/Qwen2.5-Math-7B
+  hf_config_overrides:
+    max_position_embeddings: 16384
+  train_micro_batch_size: 1
+  logprob_batch_size: 1
+  max_total_sequence_length: 16384
+  dtensor_cfg:
+    _v2: false
+    context_parallel_size: 4
+  megatron_cfg:
+    empty_unused_memory_level: 1
+    tensor_model_parallel_size: 4
+    pipeline_model_parallel_size: 2
+    context_parallel_size: 2
+    sequence_parallel: true
+    optimizer:
+      lr: 1.0e-06
+      min_lr: 1.0e-06
+      weight_decay: 0.1
+    scheduler:
+      lr_decay_iters: null
+      lr_warmup_iters: 10
+      lr_warmup_init: 1.0e-07
+  sequence_packing:
+    enabled: false
+  make_sequence_length_divisible_by: ${mul:${policy.dtensor_cfg.tensor_parallel_size},
+    ${mul:2, ${policy.dtensor_cfg.context_parallel_size}}}
+  optimizer:
+    kwargs:
+      lr: 1.0e-06
+      weight_decay: 0.1
+  scheduler:
+  - name: torch.optim.lr_scheduler.LinearLR
+    kwargs:
+      start_factor: 1.0e-08
+      end_factor: 1.0
+      total_iters: 10
+  - name: torch.optim.lr_scheduler.ConstantLR
+    kwargs:
+      factor: 1.0
+      total_iters: 10000000000
+  - milestones:
+    - 10
+  generation:
+    max_new_tokens: 16384
+    vllm_cfg:
+      tensor_parallel_size: 2
+      gpu_memory_utilization: 0.7
+      enforce_eager: true
+data:
+  max_input_seq_length: 2048
+  prompt_file: null
+  dataset_name: DAPOMath17K
+env:
+  dapo:
+    num_workers: 16
+  math:
+    num_workers: 16
+    math_verify_impl: "dapo_math_verify"
+
+logger:
+  monitor_gpus: false
+  wandb:
+    project: dapo-dev
+    name: dapo-dev-logger
+  mlflow:
+    experiment_name: dapo-dev
+    run_name: dapo-dev-logger
+cluster:
+  gpus_per_node: 8
+  num_nodes: 16
@@ -14,6 +14,19 @@ grpo:
   max_val_samples: 256
   val_batch_size: 256
   seed: 42
+  use_dynamic_sampling: false
+  batch_multiplier: 1
+  reward_shaping:
+    enabled: false
+    overlong_buffer_length: 512
+    overlong_buffer_penalty: 1
+    max_response_length: ${policy.max_total_sequence_length}
+  reward_scaling:
+    enabled: false
+    source_min: 0.0
+    source_max: 1.0
+    target_min: 0.0
+    target_max: 1.0
   async_grpo:
     enabled: false
     max_trajectory_age_steps: 1
 
@@ -12,6 +12,19 @@ grpo:
   max_val_samples: 256
   val_batch_size: 256
   seed: 42
+  use_dynamic_sampling: false
+  batch_multiplier: 1
+  reward_shaping:
+    enabled: false
+    overlong_buffer_length: 512
+    overlong_buffer_penalty: 1
+    max_response_length: ${policy.max_total_sequence_length}
+  reward_scaling:
+    enabled: false
+    source_min: 0.0
+    source_max: 1.0
+    target_min: 0.0
+    target_max: 1.0
   async_grpo:
     enabled: false
     max_trajectory_age_steps: 1
 
@@ -57,12 +57,14 @@ def main():
     # This is more stable than relying on the current NeMo-RL get_tokenizer() which can
     # change release to release.
     tokenizer_name_or_path = config["policy"]["model_name"]
+    hf_overrides = config["policy"].get("hf_overrides", {}) or {}
 
     hf_ckpt = convert_dcp_to_hf(
         dcp_ckpt_path=args.dcp_ckpt_path,
         hf_ckpt_path=args.hf_ckpt_path,
         model_name_or_path=model_name_or_path,
         tokenizer_name_or_path=tokenizer_name_or_path,
+        hf_overrides=hf_overrides,
     )
     print(f"Saved HF checkpoint to: {hf_ckpt}")