You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/source-en/rst_source/start/llm-eval.rst
+47-48Lines changed: 47 additions & 48 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -3,10 +3,12 @@ Evaluation 2: Reasoner Scenario
3
3
4
4
Introduction
5
5
------------
6
-
We provide an integrated evaluation toolkit for long chain-of-thought (CoT) mathematical reasoning.
7
-
The `toolkit <https://github.com/RLinf/LLMEvalKit>`_ includes both code and datasets, allowing researchers to benchmark trained LLMs on math-related reasoning tasks.
8
6
9
-
**Acknowledgements:** This evaluation toolkit is adapted from `Qwen2.5-Math <https://github.com/QwenLM/Qwen2.5-Math>`_.
7
+
We provide an integrated evaluation toolkit for long chain-of-thought (CoT) mathematical reasoning tasks.
8
+
The `toolkit <https://github.com/RLinf/LLMEvalKit>`_ includes both code and datasets,
9
+
making it convenient for researchers to evaluate trained large language models on mathematical reasoning.
10
+
11
+
**Acknowledgements:** This evaluation toolkit is adapted from the `Qwen2.5-Math <https://github.com/QwenLM/Qwen2.5-Math>`_ project.
10
12
11
13
Environment Setup
12
14
-----------------
@@ -16,7 +18,7 @@ First, clone the repository:
16
18
17
19
git clone https://github.com/RLinf/LLMEvalKit.git
18
20
19
-
To use the package, install the required dependencies:
21
+
Install dependencies:
20
22
21
23
.. code-block:: bash
22
24
@@ -30,27 +32,25 @@ If you are using our Docker image, you only need to additionally install:
30
32
pip install timeout-decorator
31
33
32
34
Quick Start
33
-
-----------
35
+
-----------------
34
36
35
-
Step 1: Convert Checkpoints
37
+
Model Conversion
36
38
^^^^^^^^^^^^^^^^^^^^^^^^^^^
39
+
During training, models are saved in Megatron format. You can use the conversion scripts located at ``RLinf/toolkits/ckpt_convertor/`` to convert them to Huggingface format.
37
40
38
-
Checkpoints saved during model training are in Megatron format. To facilitate evaluation, you can convert them to Huggingface format using the provided conversion scripts located in ``toolkits/ckpt_convertor/``.
39
-
40
-
You have two options for using the scripts:
41
+
You have two ways to use the scripts:
41
42
42
-
**Method 1: Edit the Script Files**
43
+
**Method 1: Edit the script files**
43
44
44
-
Manually open either ``mg2hf_7b.sh`` or ``mg2hf_1.5b.sh`` and set the following variables to your desired locations.
45
+
Manually open ``mg2hf_7b.sh`` or ``mg2hf_1.5b.sh``, and set the following variables to your desired paths.
# for aime24 and aime25, use PROMPT_TYPE="r1-distilled-qwen";
@@ -93,25 +89,25 @@ To evaluate the model on a single dataset, use the following command. Make sure
93
89
--use_vllm \
94
90
--save_outputs
95
91
96
-
**Batch Evaluation**
97
-
98
-
For an automated batch evaluation on multiple datasets, use the ``main_eval.sh`` script. This will sequentially evaluate the model on the AIME24, AIME25, and GPQA-diamond datasets.
92
+
For **batch evaluation**, you can run the ``main_eval.sh`` script. This script will sequentially evaluate the model on the AIME24, AIME25, and GPQA-diamond datasets.
You can specify ``CUDA_VISIBLE_DEVICES`` in the script for more flexible GPU management.
99
+
103
100
104
-
Note: you can manually change ``CUDA_VISIBLE_DEVICES`` within the ``main_eval.sh`` script to manage GPU usage flexibly.
101
+
Evaluation Results
102
+
------------------------------
105
103
106
-
Results
107
-
-------
108
-
The results are printed to the console and stored in ``OUTPUT_DIR``.
109
-
Stored outputs include:
104
+
Results will be printed in the terminal and saved in ``OUTPUT_DIR``. Batch evaluation defaults to saving in the ``LLMEvalKit/evaluation/outputs`` directory.
2. Complete model outputs (``xx.jsonl``): includes complete reasoning process and prediction results
113
109
114
-
Example Metadata:
110
+
Metadata example:
115
111
116
112
.. code-block:: javascript
117
113
@@ -125,9 +121,9 @@ Example Metadata:
125
121
"time_use_in_minite":"62:06"
126
122
}
127
123
128
-
``acc`` reports the **average accuracy across all sampled responses**, which serves as the main evaluation metric.
124
+
The field ``acc`` represents the **average accuracy across all sampled responses**, which is the main evaluation metric.
129
125
130
-
Example Model Output:
126
+
Model output example:
131
127
132
128
.. code-block:: javascript
133
129
@@ -144,8 +140,9 @@ Example Model Output:
144
140
"score": [true] // whether the extracted answers are correct
145
141
}
146
142
147
-
Datasets
148
-
--------
143
+
Supported Datasets
144
+
------------------------------
145
+
149
146
The toolkit currently supports the following evaluation datasets:
150
147
151
148
.. list-table:: Supported Datasets
@@ -155,17 +152,19 @@ The toolkit currently supports the following evaluation datasets:
155
152
* - Dataset
156
153
- Description
157
154
* - ``aime24``
158
-
- Problems from the **American Invitational Mathematics Examination (AIME) 2024**, focusing on high-school Olympiad-level mathematics reasoning.
155
+
- Problems from **AIME 2024** (American Invitational Mathematics Examination), focusing on high-school Olympiad-level mathematical reasoning.
159
156
* - ``aime25``
160
-
- Problems from the **AIME 2025**, same format as AIME24 but with different test set.
157
+
- Problems from **AIME 2025**, same format as AIME24 but with a different test set.
161
158
* - ``gpqa_diamond``
162
-
- A subset of **GPQA (Graduate-level Google-Proof Q&A)** with the most challenging questions (Diamond split). Covers multi-disciplinary topics (e.g., mathematics, physics, computer science) requiring deep reasoning beyond memorization.
159
+
- The most challenging subset (Diamond split) of **GPQA (Graduate-level Google-Proof Q&A)**,
160
+
containing cross-disciplinary problems (e.g., mathematics, physics, computer science) that require deep reasoning capabilities rather than memorization.
You can set it to **0-1**, **0-3** or **0-7** to use 2/4/8 GPUs depending on your available resources.
57
-
Refer to :doc:`../tutorials/user/yaml` for a more detailed explanation of the placement configuration.
58
49
59
-
.. code-block:: yaml
50
+
Before running the script, please modify the ``./examples/reasoning/config/math/qwen2.5-1.5b-single-gpu.yaml`` file
51
+
according to your model and dataset download paths.
60
52
61
-
cluster:
62
-
num_nodes: 1
63
-
component_placement:
64
-
actor,rollout: 0
53
+
Specifically, set the model configuration to the path where the ``DeepSeek-R1-Distill-Qwen-1.5B`` checkpoint is located, and set the data configuration to the path where the ``AReaL-boba-106k.jsonl`` dataset is located.
65
54
66
-
Finally, before running the script, you need to modify the corresponding configuration options in the YAML file according to the download paths of the model and dataset. Specifically, update:
67
-
68
-
- ``rollout.model.model_path``
55
+
- ``rollout.model.model_path``
69
56
- ``data.train_data_paths``
70
57
- ``data.val_data_paths``
71
58
- ``actor.tokenizer.tokenizer_model``
72
59
73
-
After these modifications, launch the following script to start training!
60
+
**Step 3: Launch training**
74
61
62
+
After completing the above modifications, run the following script to launch training:
0 commit comments