Skip to content

Commit 6dea3bf

Browse files
athittenmarta-sd
authored andcommitted
Add more details to the eval docs (#64)
Signed-off-by: Abhishree <abhishreetm@gmail.com>
1 parent 9e9ef95 commit 6dea3bf

File tree

2 files changed

+18
-8
lines changed

2 files changed

+18
-8
lines changed

docs/evaluation-doc.md

Lines changed: 17 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -66,12 +66,20 @@ It also provides predefined configurations for evaluating the chat endpoint, suc
6666
- `wikilingua`
6767

6868

69-
When specifying the task, you can either use the task name from the list above or prepend it with the harness name. For example:
69+
When specifying the task in the `EvaluationConfig` (detailed code examples in [Evaluate Models Locally on Your Workstation](#evaluate-models-locally-on-your-workstation) section), you can either use the task name from the list above or prepend it with the harness name. For example:
7070

7171
```python
72-
task = "mmlu"
73-
task = "lm-evaluation-harness.mmlu"
74-
task = "lm_evaluation_harness.mmlu"
72+
eval_config = EvaluationConfig(type="mmlu")
73+
eval_config = EvaluationConfig(type="lm-evaluation-harness.mmlu")
74+
eval_config = EvaluationConfig(type="lm_evaluation_harness.mmlu")
75+
```
76+
77+
Subtask of a benchmark (for ex `mmlu_str_high_school_european_history` under `mmlu`), can also be specified similar to above:
78+
79+
```python
80+
eval_config = EvaluationConfig(type="mmlu_str_high_school_european_history")
81+
eval_config = EvaluationConfig(type="lm-evaluation-harness.mmlu_str_high_school_european_history")
82+
eval_config = EvaluationConfig(type="lm_evaluation_harness.mmlu_str_high_school_european_history")
7583
```
7684

7785
To enable additional evaluation harnesses, like `simple-evals`, `BFCL`, `garak`, `BigCode`, or `safety-harness`, you need to install them. For example:
@@ -84,8 +92,8 @@ For more information on enabling additional evaluation harnesses, see ["Add On-D
8492
If multiple harnesses are installed in your environment and they define a task with the same name, you must use the `<harness>.<task>` format to avoid ambiguity. For example:
8593

8694
```python
87-
task = "lm-evaluation-harness.mmlu"
88-
task = "simple-evals.mmlu"
95+
eval_config = EvaluationConfig(type="lm-evaluation-harness.mmlu")
96+
eval_config = EvaluationConfig(type="simple-evals.mmlu")
8997
```
9098

9199
To evaluate your model on a task without a pre-defined config, see ["Run Evaluation Using Task Without Pre-Defined Config"](custom-task.md).
@@ -94,7 +102,7 @@ To evaluate your model on a task without a pre-defined config, see ["Run Evaluat
94102

95103
This section outlines the steps to deploy and evaluate a checkpoint trained by Nemo Framework directly using Python commands. This method is quick and easy, making it ideal for evaluation on a local workstation with GPUs, as it facilitates easier debugging. However, for running evaluations on clusters, it is recommended to use NeMo Run for its ease of use (see next section).
96104

97-
The entry point for deployment is the `deploy` method defined in `nemo/collections/llm/api.py`. Below is an example command for deployment. It uses a Hugging Face LLaMA 3 8B checkpoint that has been converted to NeMo format. To evaluate a checkpoint saved during [pretraining](https://docs.nvidia.com/nemo-framework/user-guide/latest/nemo-2.0/quickstart.html#pretraining) or [fine-tuning](https://docs.nvidia.com/nemo-framework/user-guide/latest/nemo-2.0/quickstart.html#fine-tuning), provide the path to the saved checkpoint using the `nemo_checkpoint` argument in the `deploy` command below.
105+
The entry point for deployment is the `deploy` method defined in `nemo_eval/api.py`. Below is an example command for deployment. It uses a Hugging Face LLaMA 3 8B checkpoint that has been converted to NeMo format. To evaluate a checkpoint saved during [pretraining](https://docs.nvidia.com/nemo-framework/user-guide/latest/nemo-2.0/quickstart.html#pretraining) or [fine-tuning](https://docs.nvidia.com/nemo-framework/user-guide/latest/nemo-2.0/quickstart.html#fine-tuning), provide the path to the saved checkpoint using the `nemo_checkpoint` argument in the `deploy` command below.
98106

99107
```python
100108
from nemo_eval.api import deploy
@@ -131,6 +139,8 @@ if __name__ == "__main__":
131139
132140
> **Note:** Please refer to `deploy` and `evaluate` method in `nemo_eval/api.py` to review all available argument options, as the provided commands are only examples and do not include all arguments or their default values. For more detailed information on the arguments used in the ApiEndpoint and ConfigParams classes for evaluation, see the source code at [nemo_eval/utils/api.py](https://github.com/NVIDIA-NeMo/Eval/blob/main/src/nemo_eval/utils/api.py).
133141
142+
> **Tip:** If you encounter TimeoutError on the eval client side, please increase the `request_timeout` parameter in `ConfigParams` class to a larger value like `1000` or `1200` seconds (the default is 300).
143+
134144
## Run Evaluations with NeMo Run
135145

136146
This section explains how to run evaluations with NeMo Run. For detailed information about [NeMo Run](https://github.com/NVIDIA/NeMo-Run), please refer to its documentation. Below is a concise guide focused on using NeMo Run to perform evaluations in NeMo.

docs/evaluation-with-ray.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,7 @@ This guide explains how to deploy and evaluate NeMo Framework models, trained wi
66

77
Deployment with Ray Serve provides support for multiple replicas of your model across available GPUs, enabling higher throughput and better resource utilization during evaluation. This approach is particularly beneficial for evaluation scenarios where you need to process large datasets efficiently and would like to accelerate evaluation.
88

9-
> **Note:** Multi-instance evaluation with Ray is currently supported only on single-node with model parallelism. Support for multi-node will be added in upcoming releases.
9+
> **Note:** Multi-instance evaluation with Ray is currently supported only on single-node with model parallelism. Support for multi-node will be added in upcoming releases. Also the current support for Ray is limited to generation benchmarks and support for logprob benchmarks will be added in upcoming releases. For more details on generation benchmarks v/s logprob bechmarks refer to ["Evaluate Checkpoints Trained by NeMo Framework"](evaluation-doc.md) section.
1010
1111
### Key Benefits of Ray Deployment
1212

0 commit comments

Comments
 (0)