Skip to content

Commit e159707

Browse files
authored
Add Guru-Math report. (#432)
1 parent 154bd71 commit e159707

File tree

3 files changed

+31
-0
lines changed

3 files changed

+31
-0
lines changed

benchmark/reports/guru_math.md

Lines changed: 31 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,31 @@
1+
# Guru-Math Benchmark Results
2+
3+
## 1. Task Introduction
4+
5+
Guru-Math is the mathematics task derived from the [Guru](https://huggingface.co/datasets/LLM360/guru-RL-92k) dataset, comprising 54.4k samples. The data primarily focuses on competition-level problems and symbolic reasoning.
6+
7+
## 2. Experimental Settings
8+
9+
We evaluate the performance of the following methods within the Trinity-RFT framework using version [0.3.3](https://github.com/modelscope/Trinity-RFT/releases/tag/v0.3.3) (verl==0.5.0, vllm==0.10.2). For comparison, we ported relevant code from [Reasoning360](https://github.com/LLM360/Reasoning360) to be compatible with verl==0.5.0.
10+
11+
Within both Trinity-RFT and veRL, we evaluate performance using the GRPO algorithm on this task. We fine-tune a base `Qwen2.5-7B` model that has not undergone prior fine-tuning.
12+
13+
To ensure a fair comparison, Trinity-RFT employs the following training hyperparameters: `batch_size=60`, `sync_interval=8`, `lr_warmup_steps=80` (aligned with veRL's hyperparameters of `train_batch_size=480`, `ppo_mini_batch_size=60`, and `lr_warmup_steps=10`), `lr=1e-6`, `kl_coef=0.0`, `weight_decay=0.1`, and `total_epochs=1`. Additionally, to assess the impact of one-step offset training in Trinity-RFT, we conduct an extra experiment with `sync_offset=1` under the same configuration.
14+
15+
## 3. Results and Analysis
16+
17+
The table below presents a comparison of training times between Trinity-RFT and veRL on the Guru-Math dataset.
18+
19+
| Method | Training Time (seconds) | Relative Time (%) |
20+
|--------|-------------------------|-------------------|
21+
| veRL | 47,123 | 100.00 |
22+
| Trinity-RFT | 54,045 | 114.69 |
23+
| Trinity-RFT (one-step offset) | 45,053 | 95.61 |
24+
25+
The figure below shows the reward curve on the Guru-Math dataset.
26+
27+
![](../../docs/sphinx_doc/assets/guru_math_reward.png)
28+
29+
To evaluate training effectiveness, we test models on the `AMC23`, `AIME2024`, `AIME2025`, `MATH500`, and `Minerva` benchmarks. The figure below compares the performance of checkpoints obtained from Trinity-RFT and veRL after training on the Guru-Math dataset. Each point in the plot corresponds to a checkpoint, with the x-axis representing its cumulative training time (in seconds) and the y-axis indicating its accuracy on the respective evaluation benchmark.
30+
31+
![](../../docs/sphinx_doc/assets/guru_math_eval.png)
156 KB
Loading
45.1 KB
Loading

0 commit comments

Comments
 (0)