You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+84-26Lines changed: 84 additions & 26 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,41 +1,97 @@
1
1
# LightEval 🌤️
2
+
A lightweight LLM evaluation
2
3
3
4
## Context
4
-
LightEval is an evaluation suite which gathers a selection of features from widely used benchmarks recently proposed:
5
-
- from the [Eleuther AI Harness](https://github.com/EleutherAI/lm-evaluation-harness), we use the nice request management
6
-
- from [HELM](https://crfm.stanford.edu/helm/latest/), we keep the qualitative and rich metrics
7
-
- from our previous internal evaluation suite, we keep the easy edition, evaluation loading and speed.
5
+
LightEval is a lightweight LLM evaluation suite that Hugging Face has been using internally with the recently released LLM data processing library [datatrove](https://github.com/huggingface/datatrove) and LLM training library [nanotron](https://github.com/huggingface/nanotron).
8
6
9
-
It is still an early, internal version - it should be nice to use but don't expect 100% stability!
7
+
We're releasing it with the community in the spirit of building in the open.
10
8
9
+
Note that it is still very much early so don't expect 100% stability ^^'
11
10
In case of problems or question, feel free to open an issue!
12
11
12
+
## News
13
+
-**Feb 08, 2024**: Release of `lighteval``
14
+
15
+
## Deep thanks
16
+
`lighteval` was originally built on top of the great [Eleuther AI Harness](https://github.com/EleutherAI/lm-evaluation-harness) (which is powering the [Open LLM Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard)). We also took a lot of inspiration from the amazing [HELM](https://crfm.stanford.edu/helm/latest/), notably for metrics.
17
+
18
+
Through adding more and more logging functionalities, and making it compatible with increasingly different workflows and model codebases (including 3D parallelism) as well as allowing custom evaluation experiments, metrics and benchmarks, we ended up needing to change the code more and more deeply until `lighteval` became the small standalone library that it is now.
19
+
20
+
However, we are very grateful to the Harness and HELM teams for their continued work on better evaluations.
21
+
22
+
## How to navigate this project
23
+
`lighteval` is supposed to be used as a standalone evaluation library.
24
+
- To run the evaluations, you can use `run_evals_accelerate.py` or `run_evals_nanotron.py`.
25
+
-[src/lighteval](https://github.com/huggingface/lighteval/tree/main/src/lighteval) contains the core of the lib itself
26
+
-[lighteval](https://github.com/huggingface/lighteval/tree/main/src/lighteval) contains the core of the library, divided in the following section
27
+
- [main_accelerate.py](https://github.com/huggingface/lighteval/blob/main/src/main_accelerate.py) and [main_accelerate.py](https://github.com/huggingface/lighteval/blob/main/src/main_nanotron.py) are our entry points to run evaluation
28
+
- [logging](https://github.com/huggingface/lighteval/tree/main/src/lighteval/logging): Our loggers, to display experiment information and push it to the hub after a run
29
+
- [metrics](https://github.com/huggingface/lighteval/tree/main/src/lighteval/metrics): All the available metrics you can use. They are described in metrics, and divided between sample metrics (applied at the sample level, such as a prediction accuracy) and corpus metrics (applied over the whole corpus). You'll also find available normalisation functions.
30
+
- [models](https://github.com/huggingface/lighteval/tree/main/src/lighteval/models): Possible models to use. We cover transformers (base_model), with adapter or delta weights, as well as TGI models locally deployed (it's likely the code here is out of date though), and brrr/nanotron models.
31
+
- [tasks](https://github.com/huggingface/lighteval/tree/main/src/lighteval/tasks): Available tasks. The complete list is in `tasks_table.jsonl`, and you'll find all the prompts in `ŧasks_prompt_formatting.py`.
32
+
-[tasks_examples](https://github.com/huggingface/lighteval/tree/main/tasks_examples) contains a list of available tasks you can launch. We advise using tasks in the `recommended_set`, as it's possible that some of the other tasks need double checking.
33
+
-[tests](https://github.com/huggingface/lighteval/tree/main/tests) contains our test suite, that we run at each PR to prevent regressions in metrics/prompts/tasks, for a subset of important tasks.
34
+
13
35
## How to install and use
36
+
37
+
Note:
38
+
- Use the Eleuther AI Harness (`lm_eval`) to share comparable numbers with everyone (e.g. on the Open LLM Leaderboard).
39
+
- Use `lighteval` during training with the nanotron/datatrove LLM training stack and/or for quick eval/benchmark experimentations.
40
+
14
41
### Installation
15
-
0) Create your virtual environment using virtualenv or conda depending on your preferences. We require Python3.10
42
+
Create your virtual environment using virtualenv or conda depending on your preferences. We require Python3.10 or above.
43
+
```bash
44
+
conda create -n lighteval python==3.10
45
+
```
16
46
17
-
1) Clone the package using `git clone`, then `cd lighteval-harness`, `pip install -e .` Once the dependencies are installed, `cd src`.
18
-
Optional:
19
-
- if you want to run your models using accelerate, tgi or optimum, do quantization, or use adapter weights, you will need to specify the optional dependencies group fitting your use case (`accelerate`,`tgi`,`optimum`,`quantization`,`adapters`,`nanotron`) at install time using the following command `pip install -e .[optional1,optional2]`.
20
-
- to load and push big models/datasets, your machine likely needs Git LFS. You can install it with `sudo apt-get install git-lfs`
21
-
- If you want to run bigbench evaluations, install bigbench `pip install "bigbench@https://storage.googleapis.com/public_research_data/bigbench/bigbench-0.0.1.tar.gz"`
47
+
Clone the package
48
+
```bash
49
+
git clone
50
+
cd lighteval-harness
51
+
```
52
+
53
+
Install the dependencies. For the default installation, you just need:
54
+
```bash
55
+
pip install -e .
56
+
```
57
+
58
+
If you want to run your models using accelerate, tgi or optimum, do quantization, or use adapter weights, you will need to specify the optional dependencies group fitting your use case (`accelerate`,`tgi`,`optimum`,`quantization`,`adapters`,`nanotron`) at install time
59
+
```bash
60
+
pip install -e .[optional1,optional2]
61
+
```
22
62
23
-
2) Add your user token to the environment variable `HUGGING_FACE_HUB_TOKEN` if you want to push your results to the hub
If you want to push your results to the hub, don't forget to add your user token to the environment variable `HUGGING_FACE_HUB_TOKEN`.
69
+
70
+
Lastly, if you intend to push to the code base, you'll need to install the precommit hook for styling tests.
26
71
```bash
72
+
pip install pre-commit
27
73
pre-commit install
28
-
pre-commit run --config .pre-commit-config.yaml --all-files
29
74
```
30
75
76
+
Optional steps.
77
+
- to load and push big models/datasets, your machine likely needs Git LFS. You can install it with `sudo apt-get install git-lfs`
78
+
- If you want to run bigbench evaluations, install bigbench `pip install "bigbench@https://storage.googleapis.com/public_research_data/bigbench/bigbench-0.0.1.tar.gz"`
79
+
80
+
81
+
### Testing that everything was installed correctly
82
+
If you want to test your install, you can run your first evaluation on GPUs (8GPU, single node), using
-`python src/main.py --model_args="pretrained=<path to your model on the hub>" <task parameters> --output_dir output_dir`
35
-
- Using data parallelism on several GPUs
90
+
-`python run_evals_accelerate.py --model_args="pretrained=<path to your model on the hub>" <task parameters> --output_dir output_dir`
91
+
- Using data parallelism on several GPUs (recommended)
36
92
- If you want to use data parallelism, first configure accelerate (`accelerate config`).
37
-
-`accelerate launch <accelerate parameters> src/main.py --model_args="pretrained=<path to your model on the hub>" <task parameters> --output_dir=<your output dir>`
-`accelerate launch <accelerate parameters> run_evals_accelerate.py --model_args="pretrained=<path to your model on the hub>" <task parameters> --output_dir=<your output dir>`
If you want to add a new metric, first check if you can use one of the parametrized functions in `src.lighteval.metrics.metrics_corpus` or `src.lighteval.metrics.metrics_sample`. If not, add it to either of these files depending on the level at which it is applied.
108
+
Then, follow the example in `src.lighteval.metrics.metrics` to register your metric.
48
109
49
-
## Adding a new task
110
+
###Adding a new task
50
111
To add a new task, first **add its dataset** on the hub.
51
112
52
-
Then, **find a suitable prompt function** or **create a new prompt function** in `src/prompt_formatting.py`. This function must output a `Doc` object, which should contain `query`, your prompt, and either `gold`, the gold output, or `choices` and `gold_index`, the list of choices and index or indices of correct answers. If your query contains an instruction which should not be repeated in a few shot setup, add it to an `instruction` field.
113
+
Then, **find a suitable prompt function** or **create a new prompt function** in `src.lighteval.tasks.task_prompt_formatting.py`. This function must output a `Doc` object, which should contain `query`, your prompt, and either `gold`, the gold output, or `choices` and `gold_index`, the list of choices and index or indices of correct answers. If your query contains an instruction which should not be repeated in a few shot setup, add it to an `instruction` field.
53
114
54
-
Lastly, create a **line summary** of your evaluation, in `metadata_table.json`. This summary should contain the following fields:
115
+
Lastly, create a **line summary** of your evaluation, in `src/lighteval/tasks/tasks_table.jsonl`. This summary should contain the following fields:
55
116
-`name` (str), your evaluation name
56
117
-`suite` (list), the suite(s) to which your evaluation should belong. This field allows us to compare different tasks implementation, and is used a task selection to differentiate the versions to launch. At the moment, you'll find the keywords ["helm", "bigbench", "original", "lighteval"]; you can add also add new ones (for test, we recommend using "custom").
57
118
-`prompt_function` (str), the name of the prompt function you defined in the step above
@@ -146,9 +207,6 @@ To keep compatibility with the Harness for some specific tasks, we ported their
146
207
These metrics need both the generation and its logprob. They are not working at the moment, as this fn is not in the AI Harness.
147
208
-`prediction_perplexity` (HELM): Measure of the logprob of a given input.
148
209
149
-
## Adding a new metric
150
-
If you want to add a new metric, first check if you can use one of the parametrized functions in `src.lighteval.metrics.metrics_corpus` or `metrics_sample`. If not, add it to either of these files depending on the level at which it is applied. Then, follow the example in `src.lighteval.metrics.metrics` to register your metric.
151
-
152
210
## Examples of scripts to launch lighteval on the cluster
0 commit comments