diff --git a/benchmark/reports/guru_math.md b/benchmark/reports/guru_math.md new file mode 100644 index 0000000000..2eb251c8ec --- /dev/null +++ b/benchmark/reports/guru_math.md @@ -0,0 +1,31 @@ +# Guru-Math Benchmark Results + +## 1. Task Introduction + +Guru-Math is the mathematics task derived from the [Guru](https://huggingface.co/datasets/LLM360/guru-RL-92k) dataset, comprising 54.4k samples. The data primarily focuses on competition-level problems and symbolic reasoning. + +## 2. Experimental Settings + +We evaluate the performance of the following methods within the Trinity-RFT framework using version [0.3.3](https://github.com/modelscope/Trinity-RFT/releases/tag/v0.3.3) (verl==0.5.0, vllm==0.10.2). For comparison, we ported relevant code from [Reasoning360](https://github.com/LLM360/Reasoning360) to be compatible with verl==0.5.0. + +Within both Trinity-RFT and veRL, we evaluate performance using the GRPO algorithm on this task. We fine-tune a base `Qwen2.5-7B` model that has not undergone prior fine-tuning. + +To ensure a fair comparison, Trinity-RFT employs the following training hyperparameters: `batch_size=60`, `sync_interval=8`, `lr_warmup_steps=80` (aligned with veRL's hyperparameters of `train_batch_size=480`, `ppo_mini_batch_size=60`, and `lr_warmup_steps=10`), `lr=1e-6`, `kl_coef=0.0`, `weight_decay=0.1`, and `total_epochs=1`. Additionally, to assess the impact of one-step offset training in Trinity-RFT, we conduct an extra experiment with `sync_offset=1` under the same configuration. + +## 3. Results and Analysis + +The table below presents a comparison of training times between Trinity-RFT and veRL on the Guru-Math dataset. + +| Method | Training Time (seconds) | Relative Time (%) | +|--------|-------------------------|-------------------| +| veRL | 47,123 | 100.00 | +| Trinity-RFT | 54,045 | 114.69 | +| Trinity-RFT (one-step offset) | 45,053 | 95.61 | + +The figure below shows the reward curve on the Guru-Math dataset. + +![](../../docs/sphinx_doc/assets/guru_math_reward.png) + +To evaluate training effectiveness, we test models on the `AMC23`, `AIME2024`, `AIME2025`, `MATH500`, and `Minerva` benchmarks. The figure below compares the performance of checkpoints obtained from Trinity-RFT and veRL after training on the Guru-Math dataset. Each point in the plot corresponds to a checkpoint, with the x-axis representing its cumulative training time (in seconds) and the y-axis indicating its accuracy on the respective evaluation benchmark. + +![](../../docs/sphinx_doc/assets/guru_math_eval.png) diff --git a/docs/sphinx_doc/assets/guru_math_eval.png b/docs/sphinx_doc/assets/guru_math_eval.png new file mode 100644 index 0000000000..2f97663bd9 Binary files /dev/null and b/docs/sphinx_doc/assets/guru_math_eval.png differ diff --git a/docs/sphinx_doc/assets/guru_math_reward.png b/docs/sphinx_doc/assets/guru_math_reward.png new file mode 100644 index 0000000000..65aad0a96f Binary files /dev/null and b/docs/sphinx_doc/assets/guru_math_reward.png differ