feat: improve eval (#325)

yuki-97 · terrykong · web-flow · commit 790888fe5f45 · 2025-05-09T15:41:51.000Z
Signed-off-by: Yuki Huang &lt;yukih@nvidia.com&gt;
Signed-off-by: Terry Kong &lt;terryk@nvidia.com&gt;
Co-authored-by: Terry Kong &lt;terryk@nvidia.com&gt;
diff --git a/README.md b/README.md
@@ -14,6 +14,9 @@
   - [DPO](#dpo)
     - [DPO Single Node](#dpo-single-node)
     - [DPO Multi-node](#dpo-multi-node)
+  - [Evaluation](#evaluation)
+    - [Convert Model Format (Optional)](#convert-model-format-optional)
+    - [Run Evaluation](#run-evaluation)
   - [Set Up Clusters](#set-up-clusters)
   - [Citation](#citation)
   - [Contributing](#contributing)
@@ -241,7 +244,7 @@ uv run python examples/run_dpo.py \
   logger.wandb.name="llama-dpo-sft"
 ```
 
-Refer to [dpo.yaml](../examples/configs/dpo.yaml) for a full list of parameters that can be overridden. For an in-depth explanation of how to add your own DPO dataset, refer to the [DPO documentation](docs/guides/dpo.md).
+Refer to `examples/configs/dpo.yaml` for a full list of parameters that can be overridden. For an in-depth explanation of how to add your own DPO dataset, refer to the [DPO documentation](docs/guides/dpo.md).
 
 ### DPO Multi-node
 
@@ -266,6 +269,52 @@ sbatch \
     ray.sub
 ```
 
+## Evaluation
+
+We provide evaluation tools to assess model capabilities.
+
+### Convert Model Format (Optional)
+
+If you have trained a model and saved the checkpoint in the Pytorch DCP format, you first need to convert it to the Hugging Face format before running evaluation:
+
+```sh
+# Example for a GRPO checkpoint at step 170
+uv run python examples/convert_dcp_to_hf.py \
+    --config results/grpo/step_170/config.yaml \
+    --dcp-ckpt-path results/grpo/step_170/policy/weights/ \
+    --hf-ckpt-path results/grpo/hf
+```
+> **Note:** Adjust the paths according to your training output directory structure.
+
+For an in-depth explanation of checkpointing, refer to the [Checkpointing documentation](docs/design-docs/checkpointing.md).
+
+### Run Evaluation
+
+Run evaluation script with converted model:
+
+```sh
+uv run python examples/run_eval.py generation.model_name=$PWD/results/grpo/hf
+```
+
+Run evaluation script with custom settings:
+
+```sh
+# Example: Evaluation of DeepScaleR-1.5B-Preview on MATH-500 using 8 GPUs
+#          Pass@1 accuracy averaged over 16 samples for each problem
+uv run python examples/run_eval.py \
+    generation.model_name=agentica-org/DeepScaleR-1.5B-Preview \
+    generation.temperature=0.6 \
+    generation.top_p=0.95 \
+    generation.vllm_cfg.max_model_len=32768 \
+    data.dataset_name=HuggingFaceH4/MATH-500 \
+    data.dataset_key=test \
+    eval.num_tests_per_prompt=16 \
+    cluster.gpus_per_node=8
+```
+> **Note:** Evaluation results may vary slightly due to various factors, such as sampling parameters, random seed, inference engine version, and inference engine settings.
+
+Refer to `examples/configs/eval.yaml` for a full list of parameters that can be overridden. For an in-depth explanation of evaluation, refer to the [Evaluation documentation](docs/guides/eval.md).
+
 ## Set Up Clusters
 
 For detailed instructions on how to set up and launch NeMo RL on Slurm or Kubernetes clusters, please refer to the dedicated [Cluster Start](docs/cluster.md) documentation.
diff --git a/docs/guides/dpo.md b/docs/guides/dpo.md
@@ -167,3 +167,7 @@ The DPO implementation in NeMo RL supports several key parameters that can be ad
 - `dpo.sft_average_log_probs`: Whether to average log probabilities over tokens in the SFT loss term
 
 These parameters can be adjusted in the config file or via command-line overrides to optimize training for your specific use case.
+
+## Evaluate the Trained Model
+
+Upon completion of the training process, you can refer to our [evaluation guide](eval.md) to assess model capabilities.
diff --git a/docs/guides/eval.md b/docs/guides/eval.md
@@ -2,66 +2,81 @@
 
 This document explains how to use an evaluation script for assessing model capabilities.
 
-## Start Evaluation
+## Prepare for Evaluation
 
-To run the evaluation, you can use the default configuration file or specify a custom one.
+To prepare for evaluation, first ensure your model is in the correct format, which may involve an optional conversion of PyTorch DCP checkpoints to the Hugging Face format. Following this, you need to prepare the evaluation configuration, which includes defining prompt templates and any custom settings required to run the evaluation.
 
-### Start Script
+### Convert DCP to HF (Optional)
+If you have trained a model and saved the checkpoint in the Pytorch DCP format, you first need to convert it to the Hugging Face format before running evaluation.
 
-**Evaluate Standard Models:**
-
-To run evaluation using a model directly from Hugging Face Hub or a local path already in HF format, use the `run_eval.py` script.
+Use the `examples/convert_dcp_to_hf.py` script. You'll need the path to the training configuration file (`config.yaml`), the DCP checkpoint directory, and specify an output path for the HF format model.
 
 ```sh
-# To run the evaluation with default config (examples/configs/eval.yaml)
-uv run python examples/run_eval.py
+# Example for a GRPO checkpoint at step 170
+uv run python examples/convert_dcp_to_hf.py \
+    --config results/grpo/step_170/config.yaml \
+    --dcp-ckpt-path results/grpo/step_170/policy/weights/ \
+    --hf-ckpt-path results/grpo/hf
+```
+> **Note:** Adjust the paths according to your training output directory structure.
 
-# Specify a custom config file
-uv run python examples/run_eval.py --config path/to/custom_config.yaml
+Once the conversion is complete, you can override the `generation.model_name` to point to the directory containing the converted HF model in [this section](#run-the-evaluation-script).
 
-# Override specific config values via command line (e.g., model name)
-uv run python examples/run_eval.py generation.model_name="Qwen/Qwen2.5-Math-7B-Instruct"
-```
+### Prepare the Evaluation Configuration
+**Override with Custom Settings**
 
-**Evaluate Models Trained with DCP Checkpoints (GRPO/SFT):**
+To run the evaluation, you can use the [default configuration file](../../examples/configs/eval.yaml). Alternatively, you can specify a custom one or override some settings via the command line.
 
-If you have trained a model using GRPO or SFT and saved the checkpoint in the Pytorch DCP format, you first need to convert it to the Hugging Face format before running evaluation.
+The default configuration employs greedy sampling to evaluate Qwen2.5-Math-1.5B-Instruct on AIME-2024.
 
-1.  **Convert DCP to HF:**
-    Use the `examples/convert_dcp_to_hf.py` script. You'll need the path to the training configuration file (`config.yaml`), the DCP checkpoint directory, and specify an output path for the HF format model.
+**Prompt Template Configuration**
 
-    ```sh
-    # Example for a GRPO checkpoint at step 170
-    uv run python examples/convert_dcp_to_hf.py \
-        --config results/grpo/step_170/config.yaml \
-        --dcp-ckpt-path results/grpo/step_170/policy/weights/ \
-        --hf-ckpt-path results/grpo/hf
-    ```
-    *Note: Adjust the paths according to your training output directory structure.*
+Always remember to use the same prompt and chat_template that were used during training.
 
-2.  **Run Evaluation on Converted Model:**
-    Once the conversion is complete, run the evaluation script, overriding the `generation.model_name` to point to the directory containing the converted HF model.
+For open-source models, we recommend setting `tokenizer.chat_template=default`, `data.prompt_file=null` and `data.system_prompt_file=null` to allow them to use their native chat templates.
 
-    ```sh
-    # Example using the converted HF model from the previous step
-    uv run python examples/run_eval.py generation.model_name=$PWD/results/grpo/hf
-    ```
+## Run the Evaluation Script
 
-### Example Output
+We will use the `run_eval.py` script to run an evaluation using a model directly from the Hugging Face Hub or from a local path that is already in Hugging Face format.
 
+Note that the evaluation script only supports the Hugging Face format model. If you haven't converted your DCP format model, you should back to [Convert DCP to HF](#convert-dcp-to-hf-optional) and follow the guide to convert your model.
+
+```sh
+# Run evaluation script with default config (examples/configs/eval.yaml)
+uv run python examples/run_eval.py
+
+# Run evaluation script with converted model
+uv run python examples/run_eval.py generation.model_name=$PWD/results/grpo/hf
+
+# Run evaluation script with custom config file
+uv run python examples/run_eval.py --config path/to/custom_config.yaml
+
+# Override specific config values via command line
+# Example: Evaluation of DeepScaleR-1.5B-Preview on MATH-500 using 8 GPUs
+#          Pass@1 accuracy averaged over 16 samples for each problem
+uv run python examples/run_eval.py \
+    generation.model_name=agentica-org/DeepScaleR-1.5B-Preview \
+    generation.temperature=0.6 \
+    generation.top_p=0.95 \
+    generation.vllm_cfg.max_model_len=32768 \
+    data.dataset_name=HuggingFaceH4/MATH-500 \
+    data.dataset_key=test \
+    eval.num_tests_per_prompt=16 \
+    cluster.gpus_per_node=8
 ```
-============================================================
-model_name='Qwen2.5-Math-1.5B-Instruct' dataset_name='aime_2024'
-score=0.10 (3.0/30)
-============================================================
-```
+> **Note:** Evaluation results may vary slightly due to various factors, such as sampling parameters, random seed, inference engine version, and inference engine settings.
 
-## Example Configuration File
+## Example Evaluation Output
 
-You can find an example evaluation configuration file [here](../../examples/configs/eval.yaml).
+When you complete the evaluation, you will receive a summary similar to the following.
 
-### Prompt Template Configuration
+```
+============================================================
+model_name='Qwen2.5-Math-1.5B-Instruct' dataset_name='aime_2024'
+max_new_tokens=2048 temperature=0.0 top_p=1.0 top_k=-1
 
-Always remember to use the same `prompt_file` and `system_prompt_file` that were used during training.
+metric='pass@1' num_tests_per_prompt=1
 
-For open-source models, we recommend setting `prompt_file=null` and `system_prompt_file=null` to allow them to use their native chat templates.
+score=0.1000 (3.0/30)
+============================================================
+```
diff --git a/docs/guides/grpo.md b/docs/guides/grpo.md
@@ -181,3 +181,7 @@ $$
 By multiplying the first term of the loss function by the importance weights $\frac{\pi_\text{training}(x)}{\pi_\text{inference}(x)}$, we can correct for the distribution mismatch between $\pi_{\text{training}}$ and $\pi_{\text{inference}}$ while still sampling from $\pi_{\text{inference}}$.
 
 To enable the importance sampling correction, set the config `use_importance_sampling_correction=True` in the `ClippedPGLossConfig`. By default, we set this config to False to align with standard GRPO.
+
+## Evaluate the Trained Model
+
+Upon completion of the training process, you can refer to our [evaluation guide](eval.md) to assess model capabilities.
diff --git a/docs/guides/sft.md b/docs/guides/sft.md
@@ -75,4 +75,8 @@ NeMo RL SFT uses Hugging Face chat templates to format the individual examples.
 By default, NeMo RL has support for `Squad` and `OpenAssistant` datasets. Both of these datasets are downloaded from Hugging Face and preprocessed on-the-fly, so there's no need to provide a path to any datasets on disk.
 
 Adding a new dataset is a straightforward process.
-As long as your custom dataset has the `formatted_ds` and `task_spec` attributes described above, it can serve as a drop-in replacement for Squad and OpenAssistant.
+As long as your custom dataset has the `formatted_ds` and `task_spec` attributes described above, it can serve as a drop-in replacement for Squad and OpenAssistant.
+
+## Evaluate the Trained Model
+
+Upon completion of the training process, you can refer to our [evaluation guide](eval.md) to assess model capabilities.
diff --git a/examples/configs/eval.yaml b/examples/configs/eval.yaml
@@ -1,10 +1,15 @@
 # Evaluation Configuration
+eval:
+  metric: "pass@1" # only pass@1 is supported now
+  num_tests_per_prompt: 1 # every prompt will be tested num_tests_per_prompt times and use the average score as the final score
+  seed: 42
+
 generation:
   backend: "vllm" # only vllm is supported for evaluation
   max_new_tokens: ${generation.vllm_cfg.max_model_len}
   temperature: 0.0
   top_p: 1.0
-  top_k: -1 # disable
+  top_k: -1 # -1 means disable
   num_prompts_per_step: -1 # -1 means pass all prompts at once
   model_name: "Qwen/Qwen2.5-Math-1.5B-Instruct"
   stop_token_ids: null
diff --git a/nemo_rl/evals/eval.py b/nemo_rl/evals/eval.py
@@ -19,6 +19,7 @@
 from torch.utils.data import DataLoader
 from transformers import AutoTokenizer
 
+from nemo_rl.algorithms.utils import set_seed
 from nemo_rl.data import MathDataConfig
 from nemo_rl.data.datasets import AllTaskProcessedDataset, eval_collate_fn
 from nemo_rl.data.llm_message_utils import get_keys_from_message_log
@@ -33,7 +34,14 @@
 # ===============================================================================
 
 
+class EvalConfig(TypedDict):
+    metric: str
+    num_tests_per_prompt: int
+    seed: int
+
+
 class MasterConfig(TypedDict):
+    eval: EvalConfig
     generate: GenerationConfig
     data: MathDataConfig
     env: MathEnvConfig
@@ -66,9 +74,25 @@ def setup(
         VLLM model, data loader, and config.
     """
     # Extract individual configs for easier access
+    eval_config = master_config["eval"]
     generation_config = master_config["generation"]
     cluster_config = master_config["cluster"]
 
+    # Set seed for reproducibility
+    set_seed(eval_config["seed"])
+
+    # Check settings
+    metric = eval_config["metric"]
+    num_tests_per_prompt = eval_config["num_tests_per_prompt"]
+    temperature = generation_config["temperature"]
+    top_k = generation_config["top_k"]
+    # TODO @yukih: support pass@k and cons@k
+    assert metric in ["pass@1"], f"Invalid metric: {metric}"
+    if num_tests_per_prompt > 1:
+        assert temperature > 0 and top_k != 1, (
+            "temperature > 0 and top_k != 1 are required for multiple samples"
+        )
+
     # ==========================
     #           Data
     # ==========================
@@ -137,15 +161,29 @@ def run_env_eval(vllm_generation, dataloader, env, master_config):
         env: Environment that scores responses.
         master_config: Configuration settings.
     """
+    # Extract for easier access
+    generation_config = master_config["generation"]
+    eval_config = master_config["eval"]
+    metric = eval_config["metric"]
+    num_tests_per_prompt = eval_config["num_tests_per_prompt"]
+
     # Run evaluation loop
     score, count = 0.0, 0
     for batch in dataloader:
+        # update stats
+        count += batch.size * num_tests_per_prompt
+
+        # measure multiple samples
+        if num_tests_per_prompt > 1:
+            batch = batch.repeat_interleave(num_tests_per_prompt)
+
         # get input prompt from message_log
         prompts = []
         for message_log in batch["message_log"]:
             content = [message["content"] for message in message_log]
             content = "\n".join(content)
             prompts.append(content)
+
         # generate by vllm
         inputs = BatchedDataDict({"prompts": prompts})
         outputs = vllm_generation.generate_text(inputs)["texts"]
@@ -166,19 +204,28 @@ def run_env_eval(vllm_generation, dataloader, env, master_config):
         ]
         env_return = ray.get(env.step.remote(to_env, batch["extra_env_info"]))
 
-        score += env_return.rewards.sum().item()
-        count += len(env_return.rewards)
+        # update stats
+        if metric == "pass@1":
+            score += env_return.rewards.sum().item()
+        else:
+            raise ValueError(f"Invalid metric: {metric}")
 
     # Cleanup before printing results
     ray.get(env.shutdown.remote())
     vllm_generation.shutdown()
 
     # Print results
     dataset_name = os.path.basename(master_config["data"]["dataset_name"])
-    model_name = os.path.basename(master_config["generation"]["model_name"])
+    model_name = os.path.basename(generation_config["model_name"])
+    max_new_tokens = generation_config["vllm_cfg"]["max_model_len"]
+    temperature = generation_config["temperature"]
+    top_p = generation_config["top_p"]
+    top_k = generation_config["top_k"]
     average_score = score / count
 
     print("\n" + "=" * 60)
     print(f"{model_name=} {dataset_name=}")
-    print(f"score={average_score:.2f} ({score}/{count})")
+    print(f"{max_new_tokens=} {temperature=} {top_p=} {top_k=}\n")
+    print(f"{metric=} {num_tests_per_prompt=}\n")
+    print(f"score={average_score:.4f} ({score}/{count})")
     print("=" * 60 + "\n")
diff --git a/tests/unit/models/generation/test_vllm_generation.py b/tests/unit/models/generation/test_vllm_generation.py
@@ -800,14 +800,14 @@ def test_vllm_weight_update_memory(cluster, tokenizer, enable_dtensor):
     # Check memory stats
     assert current_allocated == 0.0, "Memory should be 0 after refit completed"
     assert current_reserved == 0.0, "Memory should be 0 after refit completed"
-    # memory threshold: memory during non-streaming weight update on 1B model on 2 GPUs
+    # memory threshold: memory during non-streaming weight update on 0.6B model on 2 GPUs
     # memory during streaming weight update should less than this baseline threshold
     if enable_dtensor:
-        assert peak_allocated < 8074, "Peak allocated memory should < 8074 MB"
-        assert peak_reserved < 8088, "Peak reserved memory should < 8088 MB"
+        assert peak_allocated < 4005, "Peak allocated memory should < 4005 MB"
+        assert peak_reserved < 4016, "Peak reserved memory should < 4016 MB"
     else:
-        assert peak_allocated < 11286, "Peak allocated memory should < 11286 MB"
-        assert peak_reserved < 11298, "Peak reserved memory should < 11298 MB"
+        assert peak_allocated < 5736, "Peak allocated memory should < 5736 MB"
+        assert peak_reserved < 5748, "Peak reserved memory should < 5748 MB"
 
     # Clean up
     vllm_policy.shutdown()