maitrix-org · AndreasXie · May 29, 2025 · May 29, 2025 · Jun 3, 2025 · Jun 4, 2025
diff --git a/examples/Guru/README.md b/examples/Guru/README.md
@@ -0,0 +1,170 @@
+# Instructions on Using Guru Reward Functions in VeRL
+
+This directory provides an example of launching [DAPO](https://arxiv.org/abs/2503.14476) training with Guru reward functions, using the `reward_fn_import.sh` script. The script demonstrates how to integrate a modular reward library (located in `llm-reasoners/reasoners/reward/`) with the VeRL framework for large language models (LLMs).
+
+---
+
+## 1. Overview
+
+- **Goal:**  
+  Run DAPO training for LLMs with a custom reward function.  
+- **Approach:**  
+  - Import a reward function from the local reward library at runtime.  
+  - Pass its path and function name to VeRL via environment variables.  
+  - Launch DAPO training using the `reward_fn_import.sh` script.
+
+---
+
+## 2. Environment Setup
+
+### 2.1 Create a Conda Environment
+
+We recommend using Conda to manage dependencies:
+
+```bash
+conda create -n verl python=3.10 -y
+conda activate verl
+````
+
+### 2.2 Install VeRL Dependencies
+
+After activating the `verl` environment, install VeRL and its core dependencies:
+
+1. **Run the install script** (choose either Megatron or FSDP backend):
+
+   * **With Megatron (default):**
+
+     ```bash
+     bash scripts/install_vllm_sglang_mcore.sh
+     ```
+
+   * **With FSDP (no Megatron):**
+
+     ```bash
+     USE_MEGATRON=0 bash scripts/install_vllm_sglang_mcore.sh
+     ```
+
+   > If you encounter errors, inspect the script and manually install any missing packages.
+
+2. **Clone and install VeRL from source:**
+
+   ```bash
+   git clone https://github.com/volcengine/verl.git
+   cd verl
+   pip install --no-deps -e .
+   ```
+
+3. **Install additional Python packages:**
+
+   ```bash
+   pip install vllm==0.8.5
+   pip install tensordict==0.6.2
+   pip install datasets transformers numpy polars pandas rich tqdm matplotlib sympy pylatexenc requests
+   ```
+
+---
+
+## 3. How the Reward Library Is Used
+
+VeRL accepts a custom reward function by specifying two arguments:
+
+* `custom_reward_function.path`
+  Path to the Python file containing the reward logic.
+
+  ```bash
+  custom_reward_function.path=llm-reasoners/reasoners/reward/__init__.py
+  ```
+
+* `custom_reward_function.name`
+  Name of the function to invoke as the reward entry point.
+
+  ```bash
+  custom_reward_function.name=_default_compute_score
+  ```
+
+You can modify these values in `reward_fn_import.sh` to point to any reward function you prefer. For example, to use a function defined in `naive_dapo.py`:
+
+```bash
+custom_reward_function.path=llm-reasoners/reasoners/reward/naive_dapo.py
+custom_reward_function.name=compute_score
+```
+
+> **Tip:** Ensure your custom function’s signature matches VeRL’s expected interface (see `_default_compute_score` for reference).
+
+---
+
+## 4. How to Run
+
+1. **Prepare Your Data**
+
+   * Download the training dataset (Parquet format) from [Huggingface](https://huggingface.co/datasets/LLM360/guru_RL_verl).
+   * Place all `.parquet` files under `data/parquet/` (or update `DATA_DIR` in the script).
+
+2. **Set the Base Model**
+
+   * By default, `BASE_MODEL` is set to `Qwen/Qwen2.5-7B-Instruct`.
+   * Open `reward_fn_import.sh` and change `BASE_MODEL` if you want to use a different model.
+
+3. **Configure Additional Parameters**
+
+   * Inside `reward_fn_import.sh`, adjust any training hyperparameters (batch size, maximum prompt/response length, etc.) as needed.
+   * Specify your custom reward function via `custom_reward_function.path` and `custom_reward_function.name`.
+
+4. **Launch Training**
+   Run the script:
+
+   ```bash
+   bash reward_fn_import.sh
+   ```
+
+   The script will:
+
+   * Set up environment variables
+   * Collect all Parquet files in `DATA_DIR`
+   * Import the specified reward function
+   * Launch DAPO training with VeRL
+
+---
+
+## 5. Example Command
+
+Simply run:
+
+```bash
+bash reward_fn_import.sh
+```
+
+This will start DAPO training using:
+
+* `Qwen/Qwen2.5-7B-Instruct` as the base model
+* The reward function defined in `__init__.py` (default entry point `_default_compute_score`)
+* All training data in `data/parquet/`
+
+---
+
+## 6. Customization Tips
+
+* **Switch Reward Functions**
+  Edit these lines in `reward_fn_import.sh`:
+
+  ```bash
+  custom_reward_function.path=llm-reasoners/reasoners/reward/your_module.py
+  custom_reward_function.name=your_compute_function
+  ```
+
+* **Change Base Model**
+
+  ```bash
+  BASE_MODEL="your_org/your_model-name"
+  ```
+
+* **Adjust Training Parameters**
+  Modify variables like `BATCH_SIZE`, `MAX_PROMPT_LENGTH`, or `LEARNING_RATE` in the script.
+
+* **Use a Different Data Directory**
+  Set:
+
+  ```bash
+  DATA_DIR="path/to/your/parquets"
+  ```
+
diff --git a/examples/Guru/reward_fn_import.sh b/examples/Guru/reward_fn_import.sh
@@ -0,0 +1,114 @@
+#!/bin/bash
+export DATA_DIR="data/parquet/"
+
+export train_files="[$(\
+  ls "${DATA_DIR}"/*.parquet \
+    | xargs -n1 basename \
+    | sed "s|^|'${DATA_DIR}|;s|$|'|" \
+    | paste -sd, -
+)]"
+echo "train_files = $train_files"
+export test_dir="data/test/"
+export test_files=$train_files
+export BASE_MODEL="Qwen/Qwen2.5-7B-Instruct"
+
+export WANDB_PROJECT="Reward-fn-import"
+export WANDB_EXPERIMENT_NAME="local-${BASE_MODEL##*/}-$(date +%s)"
+
+export VLLM_ATTENTION_BACKEND="XFORMERS"
+export HYDRA_FULL_ERROR=1
+
+# 4. Clean up existing Ray state
+echo "Stopping any existing Ray cluster…"
+ray stop || true
+rm -rf /tmp/ray/ray_current_cluster
+
+project_name='DAPO'
+exp_name='DAPO-Qwen2.5-7B'
+
+adv_estimator=grpo
+
+use_kl_in_reward=False
+kl_coef=0.0
+use_kl_loss=False
+kl_loss_coef=0.0
+
+clip_ratio_low=0.2
+clip_ratio_high=0.28
+
+max_prompt_length=1100
+max_response_length=6000
+enable_overlong_buffer=True
+overlong_buffer_len=512
+overlong_penalty_factor=1.0
+
+loss_agg_mode="token-mean"
+
+enable_filter_groups=False
+max_num_gen_batches=10
+train_prompt_bsz=16
+gen_prompt_bsz=$((train_prompt_bsz * 3))
+n_resp_per_prompt=2
+train_prompt_mini_bsz=32
+
+# Paths
+MODEL_PATH=${BASE_MODEL}
+
+# Algorithm
+temperature=1.0
+top_p=1.0
+top_k=-1 # 0 for HF rollout, -1 for vLLM rollout
+
+# Performance Related Parameter
+sp_size=8
+use_dynamic_bsz=True
+actor_ppo_max_token_len=$((max_prompt_length + max_response_length))
+infer_ppo_max_token_len=$((max_prompt_length + max_response_length))
+offload=True
+gen_tp=4
+
+custom_reward_function="reward_score/__init__.py"
+name="_default_compute_score"
+
+# 7. Launch training script
+echo "Launching training…"
+python3 -m verl.trainer.main_ppo \
+    algorithm.adv_estimator=grpo \
+    data.train_files=$train_files \
+    data.val_files=$train_files \
+    data.train_batch_size=128 \
+    data.max_prompt_length=4096 \
+    data.max_response_length=4096 \
+    data.filter_overlong_prompts=True \
+    data.truncation='error' \
+    actor_rollout_ref.model.path=Qwen/Qwen2-7B-Instruct \
+    actor_rollout_ref.actor.optim.lr=1e-6 \
+    actor_rollout_ref.model.use_remove_padding=True \
+    actor_rollout_ref.actor.ppo_mini_batch_size=128 \
+    actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=40 \
+    actor_rollout_ref.actor.use_kl_loss=True \
+    actor_rollout_ref.actor.kl_loss_coef=0.001 \
+    actor_rollout_ref.actor.kl_loss_type=low_var_kl \
+    actor_rollout_ref.actor.entropy_coeff=0 \
+    actor_rollout_ref.model.enable_gradient_checkpointing=True \
+    actor_rollout_ref.actor.fsdp_config.param_offload=False \
+    actor_rollout_ref.actor.fsdp_config.optimizer_offload=False \
+    actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=40 \
+    actor_rollout_ref.rollout.tensor_model_parallel_size=1 \
+    actor_rollout_ref.rollout.name=vllm \
+    actor_rollout_ref.rollout.gpu_memory_utilization=0.6 \
+    actor_rollout_ref.rollout.n=5 \
+    actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu=40 \
+    actor_rollout_ref.ref.fsdp_config.param_offload=True \
+    algorithm.use_kl_in_reward=False \
+    trainer.critic_warmup=0 \
+    trainer.logger=['console','wandb'] \
+    trainer.project_name='verl_grpo_example_gsm8k' \
+    trainer.experiment_name='qwen2_7b_function_rm' \
+    trainer.n_gpus_per_node=8 \
+    trainer.nnodes=1 \
+    trainer.save_freq=20 \
+    trainer.test_freq=5 \
+    custom_reward_function.path=llm-reasoners/reasoners/reward/__init__.py \
+    custom_reward_function.name=_default_compute_score \
+    trainer.total_epochs=15 $@