Added doc for Val/Eval and lm_eval integration (#1573)

wesleytruong · web-flow · commit 72b16b13abc8 · 2025-08-16T22:18:33.000-07:00
This pr adds documentation for how to get started with
- In training validation in TorchTitan
- Third party evaluation with `lm_eval`
diff --git a/docs/evaluation.md b/docs/evaluation.md
@@ -0,0 +1,38 @@
+# Validation and Evaluation
+
+`torchtitan` provides direct and indirect support for validation to support user's training goals. Direct support is provided by the `Validator` class which interacts directly with the training loop, and indirect support is provided through [HuggingFace checkpoint conversion](https://github.com/pytorch/torchtitan/blob/main/docs/checkpoint.md#huggingface) for users who want to do evaluation using external tools such as ELeutherAI's `lm_eval`.
+
+## Validation
+For users who want to perform validation directly during the training loop, we provide the `Validator` class which can be conveniently overloaded through the `TrainSpec` or configured in `JobConfig`. The validator class has access to and reuses many of the trainer's functions such as its parallelization, including pipelining.
+
+Below is an example validation config:
+
+```toml
+[validation]
+enabled = true
+dataset = "c4_validation"
+freq = 500
+steps = -1 # consumes the entire validation set
+```
+
+## Third-Party Evaluation
+With `./scripts/checkpoint_conversion/convert_to_hf.py`, `torchtitan` offers support for converting checkpoints from DCP to safetensors format. Using this script, users can perform efficient evaluation separate from their training using external libraries that support HuggingFace e.g. `lm_eval` with `vllm` backend.
+
+### Example usage of `lm_eval` with `vllm`:
+To use this specific setup make sure to include a HuggingFace `config.json` file which is not provided by conversion script or `last_save_in_hf` option. The HF config file can be downloaded by running `python ./scripts/download_hf_assets.py --repo_id meta-llama/Llama-3.1-8B --assets config`.
+
+Note that pip installing `lm-eval` may result in breaking `torchtitan` dev environment so we recommend creating a separate env.
+```bash
+pip install "lm-eval[vllm]"
+lm_eval --model vllm \
+    --model_args pretrained=./outputs/checkpoint/step-1000,tensor_parallel_size=8,dtype=auto,gpu_memory_utilization=0.8, \
+    --tasks mmlu \
+    --batch_size auto
+```
+|      Groups      |Version|Filter|n-shot|Metric|   |Value |   |Stderr|
+|------------------|------:|------|------|------|---|-----:|---|-----:|
+|mmlu              |      2|none  |      |acc   |↑  |0.6209|±  |0.0038|
+| - humanities     |      2|none  |      |acc   |↑  |0.5481|±  |0.0066|
+| - other          |      2|none  |      |acc   |↑  |0.7045|±  |0.0078|
+| - social sciences|      2|none  |      |acc   |↑  |0.7351|±  |0.0078|
+| - stem           |      2|none  |      |acc   |↑  |0.5357|±  |0.0085|
diff --git a/torchtitan/models/llama3/train_configs/llama3_405b.toml b/torchtitan/models/llama3/train_configs/llama3_405b.toml
@@ -60,3 +60,9 @@ mode = "full" # ["none", "selective", "full"]
 enable_fsdp_float8_all_gather = true
 precompute_float8_dynamic_scale_for_fsdp = true
 filter_fqns = ["output"]
+
+[validation]
+enabled = false
+dataset = "c4_validation"
+freq = 500
+steps = -1
diff --git a/torchtitan/models/llama3/train_configs/llama3_70b.toml b/torchtitan/models/llama3/train_configs/llama3_70b.toml
@@ -59,3 +59,9 @@ mode = "full"
 enable_fsdp_float8_all_gather = false
 precompute_float8_dynamic_scale_for_fsdp = false
 filter_fqns = ["output"]
+
+[validation]
+enabled = false
+dataset = "c4_validation"
+freq = 500
+steps = -1
diff --git a/torchtitan/models/llama3/train_configs/llama3_8b.toml b/torchtitan/models/llama3/train_configs/llama3_8b.toml
@@ -60,3 +60,9 @@ selective_ac_option = "op"  # "int" = ac every positive int layer or 'op', ac ba
 enable_fsdp_float8_all_gather = false
 precompute_float8_dynamic_scale_for_fsdp = false
 filter_fqns = ["output"]
+
+[validation]
+enabled = false
+dataset = "c4_validation"
+freq = 100
+steps = -1