Skip to content
Open

Guru #158

Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
170 changes: 170 additions & 0 deletions examples/Guru/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,170 @@
# Instructions on Using Guru Reward Functions in VeRL

This directory provides an example of launching [DAPO](https://arxiv.org/abs/2503.14476) training with Guru reward functions, using the `reward_fn_import.sh` script. The script demonstrates how to integrate a modular reward library (located in `llm-reasoners/reasoners/reward/`) with the VeRL framework for large language models (LLMs).

---

## 1. Overview

- **Goal:**
Run DAPO training for LLMs with a custom reward function.
- **Approach:**
- Import a reward function from the local reward library at runtime.
- Pass its path and function name to VeRL via environment variables.
- Launch DAPO training using the `reward_fn_import.sh` script.

---

## 2. Environment Setup

### 2.1 Create a Conda Environment

We recommend using Conda to manage dependencies:

```bash
conda create -n verl python=3.10 -y
conda activate verl
````

### 2.2 Install VeRL Dependencies

After activating the `verl` environment, install VeRL and its core dependencies:

1. **Run the install script** (choose either Megatron or FSDP backend):

* **With Megatron (default):**

```bash
bash scripts/install_vllm_sglang_mcore.sh
```

* **With FSDP (no Megatron):**

```bash
USE_MEGATRON=0 bash scripts/install_vllm_sglang_mcore.sh
```

> If you encounter errors, inspect the script and manually install any missing packages.

2. **Clone and install VeRL from source:**

```bash
git clone https://github.com/volcengine/verl.git
cd verl
pip install --no-deps -e .
```

3. **Install additional Python packages:**

```bash
pip install vllm==0.8.5
pip install tensordict==0.6.2
pip install datasets transformers numpy polars pandas rich tqdm matplotlib sympy pylatexenc requests
```

---

## 3. How the Reward Library Is Used

VeRL accepts a custom reward function by specifying two arguments:

* `custom_reward_function.path`
Path to the Python file containing the reward logic.

```bash
custom_reward_function.path=llm-reasoners/reasoners/reward/__init__.py
```

* `custom_reward_function.name`
Name of the function to invoke as the reward entry point.

```bash
custom_reward_function.name=_default_compute_score
```

You can modify these values in `reward_fn_import.sh` to point to any reward function you prefer. For example, to use a function defined in `naive_dapo.py`:

```bash
custom_reward_function.path=llm-reasoners/reasoners/reward/naive_dapo.py
custom_reward_function.name=compute_score
```

> **Tip:** Ensure your custom function’s signature matches VeRL’s expected interface (see `_default_compute_score` for reference).

---

## 4. How to Run

1. **Prepare Your Data**

* Download the training dataset (Parquet format) from [Huggingface](https://huggingface.co/datasets/LLM360/guru_RL_verl).
* Place all `.parquet` files under `data/parquet/` (or update `DATA_DIR` in the script).

2. **Set the Base Model**

* By default, `BASE_MODEL` is set to `Qwen/Qwen2.5-7B-Instruct`.
* Open `reward_fn_import.sh` and change `BASE_MODEL` if you want to use a different model.

3. **Configure Additional Parameters**

* Inside `reward_fn_import.sh`, adjust any training hyperparameters (batch size, maximum prompt/response length, etc.) as needed.
* Specify your custom reward function via `custom_reward_function.path` and `custom_reward_function.name`.

4. **Launch Training**
Run the script:

```bash
bash reward_fn_import.sh
```

The script will:

* Set up environment variables
* Collect all Parquet files in `DATA_DIR`
* Import the specified reward function
* Launch DAPO training with VeRL

---

## 5. Example Command

Simply run:

```bash
bash reward_fn_import.sh
```

This will start DAPO training using:

* `Qwen/Qwen2.5-7B-Instruct` as the base model
* The reward function defined in `__init__.py` (default entry point `_default_compute_score`)
* All training data in `data/parquet/`

---

## 6. Customization Tips

* **Switch Reward Functions**
Edit these lines in `reward_fn_import.sh`:

```bash
custom_reward_function.path=llm-reasoners/reasoners/reward/your_module.py
custom_reward_function.name=your_compute_function
```

* **Change Base Model**

```bash
BASE_MODEL="your_org/your_model-name"
```

* **Adjust Training Parameters**
Modify variables like `BATCH_SIZE`, `MAX_PROMPT_LENGTH`, or `LEARNING_RATE` in the script.

* **Use a Different Data Directory**
Set:

```bash
DATA_DIR="path/to/your/parquets"
```

114 changes: 114 additions & 0 deletions examples/Guru/reward_fn_import.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,114 @@
#!/bin/bash
export DATA_DIR="data/parquet/"

export train_files="[$(\
ls "${DATA_DIR}"/*.parquet \
| xargs -n1 basename \
| sed "s|^|'${DATA_DIR}|;s|$|'|" \
| paste -sd, -
)]"
echo "train_files = $train_files"
export test_dir="data/test/"
export test_files=$train_files
export BASE_MODEL="Qwen/Qwen2.5-7B-Instruct"

export WANDB_PROJECT="Reward-fn-import"
export WANDB_EXPERIMENT_NAME="local-${BASE_MODEL##*/}-$(date +%s)"

export VLLM_ATTENTION_BACKEND="XFORMERS"
export HYDRA_FULL_ERROR=1

# 4. Clean up existing Ray state
echo "Stopping any existing Ray cluster…"
ray stop || true
rm -rf /tmp/ray/ray_current_cluster

project_name='DAPO'
exp_name='DAPO-Qwen2.5-7B'

adv_estimator=grpo

use_kl_in_reward=False
kl_coef=0.0
use_kl_loss=False
kl_loss_coef=0.0

clip_ratio_low=0.2
clip_ratio_high=0.28

max_prompt_length=1100
max_response_length=6000
enable_overlong_buffer=True
overlong_buffer_len=512
overlong_penalty_factor=1.0

loss_agg_mode="token-mean"

enable_filter_groups=False
max_num_gen_batches=10
train_prompt_bsz=16
gen_prompt_bsz=$((train_prompt_bsz * 3))
n_resp_per_prompt=2
train_prompt_mini_bsz=32

# Paths
MODEL_PATH=${BASE_MODEL}

# Algorithm
temperature=1.0
top_p=1.0
top_k=-1 # 0 for HF rollout, -1 for vLLM rollout

# Performance Related Parameter
sp_size=8
use_dynamic_bsz=True
actor_ppo_max_token_len=$((max_prompt_length + max_response_length))
infer_ppo_max_token_len=$((max_prompt_length + max_response_length))
offload=True
gen_tp=4

custom_reward_function="reward_score/__init__.py"
name="_default_compute_score"

# 7. Launch training script
echo "Launching training…"
python3 -m verl.trainer.main_ppo \
algorithm.adv_estimator=grpo \
data.train_files=$train_files \
data.val_files=$train_files \
data.train_batch_size=128 \
data.max_prompt_length=4096 \
data.max_response_length=4096 \
data.filter_overlong_prompts=True \
data.truncation='error' \
actor_rollout_ref.model.path=Qwen/Qwen2-7B-Instruct \
actor_rollout_ref.actor.optim.lr=1e-6 \
actor_rollout_ref.model.use_remove_padding=True \
actor_rollout_ref.actor.ppo_mini_batch_size=128 \
actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=40 \
actor_rollout_ref.actor.use_kl_loss=True \
actor_rollout_ref.actor.kl_loss_coef=0.001 \
actor_rollout_ref.actor.kl_loss_type=low_var_kl \
actor_rollout_ref.actor.entropy_coeff=0 \
actor_rollout_ref.model.enable_gradient_checkpointing=True \
actor_rollout_ref.actor.fsdp_config.param_offload=False \
actor_rollout_ref.actor.fsdp_config.optimizer_offload=False \
actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=40 \
actor_rollout_ref.rollout.tensor_model_parallel_size=1 \
actor_rollout_ref.rollout.name=vllm \
actor_rollout_ref.rollout.gpu_memory_utilization=0.6 \
actor_rollout_ref.rollout.n=5 \
actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu=40 \
actor_rollout_ref.ref.fsdp_config.param_offload=True \
algorithm.use_kl_in_reward=False \
trainer.critic_warmup=0 \
trainer.logger=['console','wandb'] \
trainer.project_name='verl_grpo_example_gsm8k' \
trainer.experiment_name='qwen2_7b_function_rm' \
trainer.n_gpus_per_node=8 \
trainer.nnodes=1 \
trainer.save_freq=20 \
trainer.test_freq=5 \
custom_reward_function.path=llm-reasoners/reasoners/reward/__init__.py \
custom_reward_function.name=_default_compute_score \
trainer.total_epochs=15 $@
Loading