Skip to content

Commit b409d95

Browse files
Update README (#19)
Co-authored-by: Nathan Habib <[email protected]>
1 parent 8aaf51c commit b409d95

File tree

1 file changed

+84
-26
lines changed

1 file changed

+84
-26
lines changed

README.md

Lines changed: 84 additions & 26 deletions
Original file line numberDiff line numberDiff line change
@@ -1,41 +1,97 @@
11
# LightEval 🌤️
2+
A lightweight LLM evaluation
23

34
## Context
4-
LightEval is an evaluation suite which gathers a selection of features from widely used benchmarks recently proposed:
5-
- from the [Eleuther AI Harness](https://github.com/EleutherAI/lm-evaluation-harness), we use the nice request management
6-
- from [HELM](https://crfm.stanford.edu/helm/latest/), we keep the qualitative and rich metrics
7-
- from our previous internal evaluation suite, we keep the easy edition, evaluation loading and speed.
5+
LightEval is a lightweight LLM evaluation suite that Hugging Face has been using internally with the recently released LLM data processing library [datatrove](https://github.com/huggingface/datatrove) and LLM training library [nanotron](https://github.com/huggingface/nanotron).
86

9-
It is still an early, internal version - it should be nice to use but don't expect 100% stability!
7+
We're releasing it with the community in the spirit of building in the open.
108

9+
Note that it is still very much early so don't expect 100% stability ^^'
1110
In case of problems or question, feel free to open an issue!
1211

12+
## News
13+
- **Feb 08, 2024**: Release of `lighteval``
14+
15+
## Deep thanks
16+
`lighteval` was originally built on top of the great [Eleuther AI Harness](https://github.com/EleutherAI/lm-evaluation-harness) (which is powering the [Open LLM Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard)). We also took a lot of inspiration from the amazing [HELM](https://crfm.stanford.edu/helm/latest/), notably for metrics.
17+
18+
Through adding more and more logging functionalities, and making it compatible with increasingly different workflows and model codebases (including 3D parallelism) as well as allowing custom evaluation experiments, metrics and benchmarks, we ended up needing to change the code more and more deeply until `lighteval` became the small standalone library that it is now.
19+
20+
However, we are very grateful to the Harness and HELM teams for their continued work on better evaluations.
21+
22+
## How to navigate this project
23+
`lighteval` is supposed to be used as a standalone evaluation library.
24+
- To run the evaluations, you can use `run_evals_accelerate.py` or `run_evals_nanotron.py`.
25+
- [src/lighteval](https://github.com/huggingface/lighteval/tree/main/src/lighteval) contains the core of the lib itself
26+
- [lighteval](https://github.com/huggingface/lighteval/tree/main/src/lighteval) contains the core of the library, divided in the following section
27+
- [main_accelerate.py](https://github.com/huggingface/lighteval/blob/main/src/main_accelerate.py) and [main_accelerate.py](https://github.com/huggingface/lighteval/blob/main/src/main_nanotron.py) are our entry points to run evaluation
28+
- [logging](https://github.com/huggingface/lighteval/tree/main/src/lighteval/logging): Our loggers, to display experiment information and push it to the hub after a run
29+
- [metrics](https://github.com/huggingface/lighteval/tree/main/src/lighteval/metrics): All the available metrics you can use. They are described in metrics, and divided between sample metrics (applied at the sample level, such as a prediction accuracy) and corpus metrics (applied over the whole corpus). You'll also find available normalisation functions.
30+
- [models](https://github.com/huggingface/lighteval/tree/main/src/lighteval/models): Possible models to use. We cover transformers (base_model), with adapter or delta weights, as well as TGI models locally deployed (it's likely the code here is out of date though), and brrr/nanotron models.
31+
- [tasks](https://github.com/huggingface/lighteval/tree/main/src/lighteval/tasks): Available tasks. The complete list is in `tasks_table.jsonl`, and you'll find all the prompts in `ŧasks_prompt_formatting.py`.
32+
- [tasks_examples](https://github.com/huggingface/lighteval/tree/main/tasks_examples) contains a list of available tasks you can launch. We advise using tasks in the `recommended_set`, as it's possible that some of the other tasks need double checking.
33+
- [tests](https://github.com/huggingface/lighteval/tree/main/tests) contains our test suite, that we run at each PR to prevent regressions in metrics/prompts/tasks, for a subset of important tasks.
34+
1335
## How to install and use
36+
37+
Note:
38+
- Use the Eleuther AI Harness (`lm_eval`) to share comparable numbers with everyone (e.g. on the Open LLM Leaderboard).
39+
- Use `lighteval` during training with the nanotron/datatrove LLM training stack and/or for quick eval/benchmark experimentations.
40+
1441
### Installation
15-
0) Create your virtual environment using virtualenv or conda depending on your preferences. We require Python3.10
42+
Create your virtual environment using virtualenv or conda depending on your preferences. We require Python3.10 or above.
43+
```bash
44+
conda create -n lighteval python==3.10
45+
```
1646

17-
1) Clone the package using `git clone`, then `cd lighteval-harness`, `pip install -e .` Once the dependencies are installed, `cd src`.
18-
Optional:
19-
- if you want to run your models using accelerate, tgi or optimum, do quantization, or use adapter weights, you will need to specify the optional dependencies group fitting your use case (`accelerate`,`tgi`,`optimum`,`quantization`,`adapters`,`nanotron`) at install time using the following command `pip install -e .[optional1,optional2]`.
20-
- to load and push big models/datasets, your machine likely needs Git LFS. You can install it with `sudo apt-get install git-lfs`
21-
- If you want to run bigbench evaluations, install bigbench `pip install "bigbench@https://storage.googleapis.com/public_research_data/bigbench/bigbench-0.0.1.tar.gz"`
47+
Clone the package
48+
```bash
49+
git clone
50+
cd lighteval-harness
51+
```
52+
53+
Install the dependencies. For the default installation, you just need:
54+
```bash
55+
pip install -e .
56+
```
57+
58+
If you want to run your models using accelerate, tgi or optimum, do quantization, or use adapter weights, you will need to specify the optional dependencies group fitting your use case (`accelerate`,`tgi`,`optimum`,`quantization`,`adapters`,`nanotron`) at install time
59+
```bash
60+
pip install -e .[optional1,optional2]
61+
```
2262

23-
2) Add your user token to the environment variable `HUGGING_FACE_HUB_TOKEN` if you want to push your results to the hub
63+
The setup we tested most is:
64+
```bash
65+
pip install -e .[accelerate,quantization,adapters]
66+
```
2467

25-
For the linting:
68+
If you want to push your results to the hub, don't forget to add your user token to the environment variable `HUGGING_FACE_HUB_TOKEN`.
69+
70+
Lastly, if you intend to push to the code base, you'll need to install the precommit hook for styling tests.
2671
```bash
72+
pip install pre-commit
2773
pre-commit install
28-
pre-commit run --config .pre-commit-config.yaml --all-files
2974
```
3075

76+
Optional steps.
77+
- to load and push big models/datasets, your machine likely needs Git LFS. You can install it with `sudo apt-get install git-lfs`
78+
- If you want to run bigbench evaluations, install bigbench `pip install "bigbench@https://storage.googleapis.com/public_research_data/bigbench/bigbench-0.0.1.tar.gz"`
79+
80+
81+
### Testing that everything was installed correctly
82+
If you want to test your install, you can run your first evaluation on GPUs (8GPU, single node), using
83+
```bash
84+
mkdir tmp
85+
python -m accelerate launch --multi_gpu --num_processes=8 run_evals_accelerate.py --model_args "pretrained=gpt2" --tasks tasks_examples/open_llm_leaderboard_tasks.txt --override_batch_size 1 --save_details --output_dir="tmp/"
86+
```
3187

3288
### Usage
3389
- Launching on CPU
34-
- `python src/main.py --model_args="pretrained=<path to your model on the hub>" <task parameters> --output_dir output_dir`
35-
- Using data parallelism on several GPUs
90+
- `python run_evals_accelerate.py --model_args="pretrained=<path to your model on the hub>" <task parameters> --output_dir output_dir`
91+
- Using data parallelism on several GPUs (recommended)
3692
- If you want to use data parallelism, first configure accelerate (`accelerate config`).
37-
- `accelerate launch <accelerate parameters> src/main.py --model_args="pretrained=<path to your model on the hub>" <task parameters> --output_dir=<your output dir>`
38-
for instance: `python -m accelerate launch --multi_gpu --num_processes=8 src/main.py --model_args "pretrained=gpt2" --tasks tasks_examples/open_llm_leaderboard_tasks.txt --override_batch_size 1 --save_details --output_dir=tmp/`
93+
- `accelerate launch <accelerate parameters> run_evals_accelerate.py --model_args="pretrained=<path to your model on the hub>" <task parameters> --output_dir=<your output dir>`
94+
for instance: `python -m accelerate launch --multi_gpu --num_processes=8 run_evals_accelerate.py --model_args "pretrained=gpt2" --tasks tasks_examples/open_llm_leaderboard_tasks.txt --override_batch_size 1 --save_details --output_dir=tmp/`
3995
- Note: if you use model_parallel, accelerate will use 2 processes for model parallel, num_processes for data parallel
4096

4197
The task parameters indicate which tasks you want to launch. You can select:
@@ -44,14 +100,19 @@ The task parameters indicate which tasks you want to launch. You can select:
44100

45101
Example
46102
If you want to compare hellaswag from helm and the harness on Gpt-6j, you can do
47-
`python src/main.py --model hf_causal --model_args="pretrained=EleutherAI/gpt-j-6b" --tasks helm|hellaswag|0|0,lighteval|hellaswag|0|0 --output_dir output_dir`
103+
`python run_evals_accelerate.py --model hf_causal --model_args="pretrained=EleutherAI/gpt-j-6b" --tasks helm|hellaswag|0|0,lighteval|hellaswag|0|0 --output_dir output_dir`
104+
105+
## Customisation
106+
### Adding a new metric
107+
If you want to add a new metric, first check if you can use one of the parametrized functions in `src.lighteval.metrics.metrics_corpus` or `src.lighteval.metrics.metrics_sample`. If not, add it to either of these files depending on the level at which it is applied.
108+
Then, follow the example in `src.lighteval.metrics.metrics` to register your metric.
48109

49-
## Adding a new task
110+
### Adding a new task
50111
To add a new task, first **add its dataset** on the hub.
51112

52-
Then, **find a suitable prompt function** or **create a new prompt function** in `src/prompt_formatting.py`. This function must output a `Doc` object, which should contain `query`, your prompt, and either `gold`, the gold output, or `choices` and `gold_index`, the list of choices and index or indices of correct answers. If your query contains an instruction which should not be repeated in a few shot setup, add it to an `instruction` field.
113+
Then, **find a suitable prompt function** or **create a new prompt function** in `src.lighteval.tasks.task_prompt_formatting.py`. This function must output a `Doc` object, which should contain `query`, your prompt, and either `gold`, the gold output, or `choices` and `gold_index`, the list of choices and index or indices of correct answers. If your query contains an instruction which should not be repeated in a few shot setup, add it to an `instruction` field.
53114

54-
Lastly, create a **line summary** of your evaluation, in `metadata_table.json`. This summary should contain the following fields:
115+
Lastly, create a **line summary** of your evaluation, in `src/lighteval/tasks/tasks_table.jsonl`. This summary should contain the following fields:
55116
- `name` (str), your evaluation name
56117
- `suite` (list), the suite(s) to which your evaluation should belong. This field allows us to compare different tasks implementation, and is used a task selection to differentiate the versions to launch. At the moment, you'll find the keywords ["helm", "bigbench", "original", "lighteval"]; you can add also add new ones (for test, we recommend using "custom").
57118
- `prompt_function` (str), the name of the prompt function you defined in the step above
@@ -146,9 +207,6 @@ To keep compatibility with the Harness for some specific tasks, we ported their
146207
These metrics need both the generation and its logprob. They are not working at the moment, as this fn is not in the AI Harness.
147208
- `prediction_perplexity` (HELM): Measure of the logprob of a given input.
148209

149-
## Adding a new metric
150-
If you want to add a new metric, first check if you can use one of the parametrized functions in `src.lighteval.metrics.metrics_corpus` or `metrics_sample`. If not, add it to either of these files depending on the level at which it is applied. Then, follow the example in `src.lighteval.metrics.metrics` to register your metric.
151-
152210
## Examples of scripts to launch lighteval on the cluster
153211
### Evaluate a whole suite on one node, 8 GPUs
154212
1) Create a config file for accelerate
@@ -197,7 +255,7 @@ source <path_to_your_venv>/activate #or conda activate yourenv
197255
cd <path_to_your_lighteval>/lighteval
198256

199257
export CUDA_LAUNCH_BLOCKING=1
200-
srun accelerate launch --multi_gpu --num_processes=8 src/main.py --model_args "pretrained=your model name" --tasks tasks_examples/open_llm_leaderboard_tasks.txt --override_batch_size 1 --save_details --output_dir=your output dir
258+
srun accelerate launch --multi_gpu --num_processes=8 run_evals_accelerate.py --model_args "pretrained=your model name" --tasks tasks_examples/open_llm_leaderboard_tasks.txt --override_batch_size 1 --save_details --output_dir=your output dir
201259
```
202260

203261
## Releases

0 commit comments

Comments
 (0)