|
2 | 2 |
|
3 | 3 | This document explains how to use an evaluation script for assessing model capabilities. |
4 | 4 |
|
5 | | -## Start Evaluation |
| 5 | +## Prepare for Evaluation |
6 | 6 |
|
7 | | -To run the evaluation, you can use the default configuration file or specify a custom one. |
| 7 | +To prepare for evaluation, first ensure your model is in the correct format, which may involve an optional conversion of PyTorch DCP checkpoints to the Hugging Face format. Following this, you need to prepare the evaluation configuration, which includes defining prompt templates and any custom settings required to run the evaluation. |
8 | 8 |
|
9 | | -### Start Script |
| 9 | +### Convert DCP to HF (Optional) |
| 10 | +If you have trained a model and saved the checkpoint in the Pytorch DCP format, you first need to convert it to the Hugging Face format before running evaluation. |
10 | 11 |
|
11 | | -**Evaluate Standard Models:** |
12 | | - |
13 | | -To run evaluation using a model directly from Hugging Face Hub or a local path already in HF format, use the `run_eval.py` script. |
| 12 | +Use the `examples/convert_dcp_to_hf.py` script. You'll need the path to the training configuration file (`config.yaml`), the DCP checkpoint directory, and specify an output path for the HF format model. |
14 | 13 |
|
15 | 14 | ```sh |
16 | | -# To run the evaluation with default config (examples/configs/eval.yaml) |
17 | | -uv run python examples/run_eval.py |
| 15 | +# Example for a GRPO checkpoint at step 170 |
| 16 | +uv run python examples/convert_dcp_to_hf.py \ |
| 17 | + --config results/grpo/step_170/config.yaml \ |
| 18 | + --dcp-ckpt-path results/grpo/step_170/policy/weights/ \ |
| 19 | + --hf-ckpt-path results/grpo/hf |
| 20 | +``` |
| 21 | +> **Note:** Adjust the paths according to your training output directory structure. |
18 | 22 |
|
19 | | -# Specify a custom config file |
20 | | -uv run python examples/run_eval.py --config path/to/custom_config.yaml |
| 23 | +Once the conversion is complete, you can override the `generation.model_name` to point to the directory containing the converted HF model in [this section](#run-the-evaluation-script). |
21 | 24 |
|
22 | | -# Override specific config values via command line (e.g., model name) |
23 | | -uv run python examples/run_eval.py generation.model_name="Qwen/Qwen2.5-Math-7B-Instruct" |
24 | | -``` |
| 25 | +### Prepare the Evaluation Configuration |
| 26 | +**Override with Custom Settings** |
25 | 27 |
|
26 | | -**Evaluate Models Trained with DCP Checkpoints (GRPO/SFT):** |
| 28 | +To run the evaluation, you can use the [default configuration file](../../examples/configs/eval.yaml). Alternatively, you can specify a custom one or override some settings via the command line. |
27 | 29 |
|
28 | | -If you have trained a model using GRPO or SFT and saved the checkpoint in the Pytorch DCP format, you first need to convert it to the Hugging Face format before running evaluation. |
| 30 | +The default configuration employs greedy sampling to evaluate Qwen2.5-Math-1.5B-Instruct on AIME-2024. |
29 | 31 |
|
30 | | -1. **Convert DCP to HF:** |
31 | | - Use the `examples/convert_dcp_to_hf.py` script. You'll need the path to the training configuration file (`config.yaml`), the DCP checkpoint directory, and specify an output path for the HF format model. |
| 32 | +**Prompt Template Configuration** |
32 | 33 |
|
33 | | - ```sh |
34 | | - # Example for a GRPO checkpoint at step 170 |
35 | | - uv run python examples/convert_dcp_to_hf.py \ |
36 | | - --config results/grpo/step_170/config.yaml \ |
37 | | - --dcp-ckpt-path results/grpo/step_170/policy/weights/ \ |
38 | | - --hf-ckpt-path results/grpo/hf |
39 | | - ``` |
40 | | - *Note: Adjust the paths according to your training output directory structure.* |
| 34 | +Always remember to use the same prompt and chat_template that were used during training. |
41 | 35 |
|
42 | | -2. **Run Evaluation on Converted Model:** |
43 | | - Once the conversion is complete, run the evaluation script, overriding the `generation.model_name` to point to the directory containing the converted HF model. |
| 36 | +For open-source models, we recommend setting `tokenizer.chat_template=default`, `data.prompt_file=null` and `data.system_prompt_file=null` to allow them to use their native chat templates. |
44 | 37 |
|
45 | | - ```sh |
46 | | - # Example using the converted HF model from the previous step |
47 | | - uv run python examples/run_eval.py generation.model_name=$PWD/results/grpo/hf |
48 | | - ``` |
| 38 | +## Run the Evaluation Script |
49 | 39 |
|
50 | | -### Example Output |
| 40 | +We will use the `run_eval.py` script to run an evaluation using a model directly from the Hugging Face Hub or from a local path that is already in Hugging Face format. |
51 | 41 |
|
| 42 | +Note that the evaluation script only supports the Hugging Face format model. If you haven't converted your DCP format model, you should back to [Convert DCP to HF](#convert-dcp-to-hf-optional) and follow the guide to convert your model. |
| 43 | + |
| 44 | +```sh |
| 45 | +# Run evaluation script with default config (examples/configs/eval.yaml) |
| 46 | +uv run python examples/run_eval.py |
| 47 | + |
| 48 | +# Run evaluation script with converted model |
| 49 | +uv run python examples/run_eval.py generation.model_name=$PWD/results/grpo/hf |
| 50 | + |
| 51 | +# Run evaluation script with custom config file |
| 52 | +uv run python examples/run_eval.py --config path/to/custom_config.yaml |
| 53 | + |
| 54 | +# Override specific config values via command line |
| 55 | +# Example: Evaluation of DeepScaleR-1.5B-Preview on MATH-500 using 8 GPUs |
| 56 | +# Pass@1 accuracy averaged over 16 samples for each problem |
| 57 | +uv run python examples/run_eval.py \ |
| 58 | + generation.model_name=agentica-org/DeepScaleR-1.5B-Preview \ |
| 59 | + generation.temperature=0.6 \ |
| 60 | + generation.top_p=0.95 \ |
| 61 | + generation.vllm_cfg.max_model_len=32768 \ |
| 62 | + data.dataset_name=HuggingFaceH4/MATH-500 \ |
| 63 | + data.dataset_key=test \ |
| 64 | + eval.num_tests_per_prompt=16 \ |
| 65 | + cluster.gpus_per_node=8 |
52 | 66 | ``` |
53 | | -============================================================ |
54 | | -model_name='Qwen2.5-Math-1.5B-Instruct' dataset_name='aime_2024' |
55 | | -score=0.10 (3.0/30) |
56 | | -============================================================ |
57 | | -``` |
| 67 | +> **Note:** Evaluation results may vary slightly due to various factors, such as sampling parameters, random seed, inference engine version, and inference engine settings. |
58 | 68 |
|
59 | | -## Example Configuration File |
| 69 | +## Example Evaluation Output |
60 | 70 |
|
61 | | -You can find an example evaluation configuration file [here](../../examples/configs/eval.yaml). |
| 71 | +When you complete the evaluation, you will receive a summary similar to the following. |
62 | 72 |
|
63 | | -### Prompt Template Configuration |
| 73 | +``` |
| 74 | +============================================================ |
| 75 | +model_name='Qwen2.5-Math-1.5B-Instruct' dataset_name='aime_2024' |
| 76 | +max_new_tokens=2048 temperature=0.0 top_p=1.0 top_k=-1 |
64 | 77 |
|
65 | | -Always remember to use the same `prompt_file` and `system_prompt_file` that were used during training. |
| 78 | +metric='pass@1' num_tests_per_prompt=1 |
66 | 79 |
|
67 | | -For open-source models, we recommend setting `prompt_file=null` and `system_prompt_file=null` to allow them to use their native chat templates. |
| 80 | +score=0.1000 (3.0/30) |
| 81 | +============================================================ |
| 82 | +``` |
0 commit comments