Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
67 commits
Select commit Hold shift + click to select a range
2696a49
use inspect-ai to evaluate aime25 and gsm8k
NathanHB Oct 7, 2025
578d530
revert file
NathanHB Oct 7, 2025
21fa870
working for 3 tasks
NathanHB Oct 7, 2025
27b2af1
parallel evals of tasks
NathanHB Oct 7, 2025
b9a610d
adds gpqa diamond to inspect
NathanHB Oct 8, 2025
25c1128
move tasks to individual files
NathanHB Oct 13, 2025
0d42edf
move tasks to individual files
NathanHB Oct 13, 2025
6cc3c04
enable extended tasks as well
NathanHB Oct 13, 2025
4c38951
run precomit hook
NathanHB Oct 13, 2025
d2fd5e1
fix mkqa
NathanHB Oct 13, 2025
2ddb0f9
chaange extended suite to lighteval
NathanHB Oct 13, 2025
ee97122
chaange extended suite to lighteval
NathanHB Oct 14, 2025
e2c8e22
add metdata to tasks
NathanHB Oct 14, 2025
c980ddb
add metdata to tasks
NathanHB Oct 14, 2025
57fe390
remove license notice and put docstring on top of file
NathanHB Oct 14, 2025
ee081f2
homogenize tags
NathanHB Oct 14, 2025
1ed1602
add docstring for all multilingual tasks
NathanHB Oct 14, 2025
f4b0e27
add docstring for all multilingual tasks
NathanHB Oct 14, 2025
81d9e4e
add name and dataset to metadata
NathanHB Oct 15, 2025
b734532
use TASKS_TABLE for multilingual tasks
NathanHB Oct 15, 2025
c3911fc
use TASKS_TABLE for default tasks
NathanHB Oct 15, 2025
e439f70
use TASKS_TABLE for default tasks
NathanHB Oct 15, 2025
6447ee7
loads all tasks correclty
NathanHB Oct 15, 2025
88754bf
move community tasks to default tasks and update doc
NathanHB Oct 16, 2025
5445f5c
move community tasks to default tasks and update doc
NathanHB Oct 16, 2025
f53bd76
Merge remote-tracking branch 'origin/main' into nathan-reorg-tasks
NathanHB Oct 16, 2025
6a0c615
revert uneeded changes
NathanHB Oct 16, 2025
1435e38
fix doc build
NathanHB Oct 16, 2025
15f41f2
fix doc build
NathanHB Oct 16, 2025
74e5c0f
remove custom tasks and let user decide if loading multilingual tasks
NathanHB Oct 16, 2025
aad136c
load-tasks multilingual fix
NathanHB Oct 16, 2025
242bc43
update doc
NathanHB Oct 16, 2025
6806bf8
remove uneeded file
NathanHB Oct 16, 2025
e94fa59
update readme
NathanHB Oct 16, 2025
8800d1a
update readme
NathanHB Oct 16, 2025
970f33b
update readme
NathanHB Oct 16, 2025
b8c26dc
fix test
NathanHB Oct 16, 2025
764de72
add back the custom tasks
NathanHB Oct 17, 2025
a326ea8
add back the custom tasks
NathanHB Oct 17, 2025
81081cd
fix tasks
NathanHB Oct 17, 2025
74b40f6
fix tasks
NathanHB Oct 17, 2025
083fb1b
fix tasks
NathanHB Oct 17, 2025
2dab2bf
fix tests
NathanHB Oct 17, 2025
57ca0e5
fix tests
NathanHB Oct 17, 2025
480e40a
add inspect-ai
NathanHB Oct 20, 2025
ade2900
add tasks
NathanHB Oct 29, 2025
079ceaf
add gpqa
NathanHB Oct 29, 2025
8d00799
make model config work
NathanHB Oct 29, 2025
cea5e99
Update src/lighteval/metrics/metrics.py
NathanHB Oct 29, 2025
fb47bb7
init
NathanHB Oct 30, 2025
2736bc9
Merge branch 'nathan-move-to-inspectai' of github.com:huggingface/lig…
NathanHB Oct 30, 2025
d5e6c9f
Merge branch 'main' into nathan-move-to-inspectai
NathanHB Oct 30, 2025
e55a9af
fix tests
NathanHB Oct 30, 2025
ba41f1c
Merge branch 'nathan-move-to-inspectai' of github.com:huggingface/lig…
NathanHB Oct 30, 2025
59c5dcc
fix tests
NathanHB Oct 30, 2025
40254db
fix tests
NathanHB Oct 30, 2025
53275fe
fix tests
NathanHB Oct 30, 2025
72e5c2b
add correct system prompt for hle
NathanHB Oct 30, 2025
7fc1753
add correct system prompt for hle
NathanHB Oct 30, 2025
260d744
review suggestions
NathanHB Nov 3, 2025
835b799
add doc
NathanHB Nov 3, 2025
c216a27
change buttons
NathanHB Nov 3, 2025
21e6020
change buttons
NathanHB Nov 3, 2025
7e65400
change buttons
NathanHB Nov 3, 2025
0a4f6be
move benchmark finder to openeval org
NathanHB Nov 3, 2025
b661d0d
better help for eval
NathanHB Nov 3, 2025
f142b39
better help for eval
NathanHB Nov 3, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
9 changes: 4 additions & 5 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -25,7 +25,7 @@
<a href="https://huggingface.co/docs/lighteval/main/en/index" target="_blank">
<img alt="Documentation" src="https://img.shields.io/badge/Documentation-4F4F4F?style=for-the-badge&logo=readthedocs&logoColor=white" />
</a>
<a href="https://huggingface.co/spaces/SaylorTwift/benchmark_finder" target="_blank">
<a href="https://huggingface.co/spaces/OpenEvals/open_benchmark_index" target="_blank">
<img alt="Open Benchmark Index" src="https://img.shields.io/badge/Open%20Benchmark%20Index-4F4F4F?style=for-the-badge&logo=huggingface&logoColor=white" />
</a>
</p>
Expand All @@ -44,7 +44,7 @@ sample-by-sample results* to debug and see how your models stack-up.

Lighteval supports **1000+ evaluation tasks** across multiple domains and
languages. Use [this
space](https://huggingface.co/spaces/SaylorTwift/benchmark_finder) to find what
space](https://huggingface.co/spaces/OpenEvals/open_benchmark_index) to find what
you need, or, here's an overview of some *popular benchmarks*:


Expand Down Expand Up @@ -107,6 +107,7 @@ huggingface-cli login

Lighteval offers the following entry points for model evaluation:

- `lighteval eval`: Evaluation models using [inspect-ai](https://inspect.aisi.org.uk/) as a backend (prefered).
- `lighteval accelerate`: Evaluate models on CPU or one or more GPUs using [🤗
Accelerate](https://github.com/huggingface/accelerate)
- `lighteval nanotron`: Evaluate models in distributed settings using [⚡️
Expand All @@ -126,9 +127,7 @@ Did not find what you need ? You can always make your custom model API by follow
Here's a **quick command** to evaluate using the *Accelerate backend*:

```shell
lighteval accelerate \
"model_name=gpt2" \
"leaderboard|truthfulqa:mc|0"
lighteval eval "hf-inference-providers/openai/gpt-oss-20b" "lighteval|gpqa:diamond|0"
```

Or use the **Python API** to run a model *already loaded in memory*!
Expand Down
2 changes: 2 additions & 0 deletions docs/source/_toctree.yml
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,8 @@
title: Quicktour
title: Getting started
- sections:
- local: inspect-ai
title: Examples using Inspect-AI
- local: saving-and-reading-results
title: Save and read results
- local: caching
Expand Down
12 changes: 7 additions & 5 deletions docs/source/available-tasks.mdx
Original file line number Diff line number Diff line change
@@ -1,28 +1,30 @@
# Available tasks

Browse and inspect tasks available in LightEval.
<iframe
src="https://saylortwift-benchmark-finder.hf.space"
src="https://openevals-benchmark-finder.hf.space"
frameborder="0"
width="850"
height="450"
></iframe>



You can get a list of all available tasks by running:
List all tasks:

```bash
lighteval tasks list
```

### Inspect Specific Tasks
### Inspect specific tasks

You can inspect a specific task to see its configuration, metrics, and requirements by running:
Inspect a task to view its config, metrics, and requirements:

```bash
lighteval tasks inspect <task_name>
```

For example:
Example:
```bash
lighteval tasks inspect "lighteval|truthfulqa:mc|0"
```
42 changes: 23 additions & 19 deletions docs/source/index.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,7 @@ and see how your models stack up.

### 🚀 **Multi-Backend Support**
Evaluate your models using the most popular and efficient inference backends:
- `eval`: Use [inspect-ai](https://inspect.aisi.org.uk/) as backend to evaluate and inspect your models ! (prefered way)
- `transformers`: Evaluate models on CPU or one or more GPUs using [🤗
Accelerate](https://github.com/huggingface/transformers)
- `nanotron`: Evaluate models in distributed settings using [⚡️
Expand Down Expand Up @@ -45,26 +46,29 @@ pip install lighteval

### Basic Usage

```bash
# Evaluate a model using Transformers backend
lighteval accelerate \
"model_name=openai-community/gpt2" \
"leaderboard|truthfulqa:mc|0"
```
#### Find a task

<iframe
src="https://openevals-open-benchmark-index.hf.space"
frameborder="0"
width="850"
height="450"
></iframe>

### Save Results
#### Run your benchmark and push details to the hub

```bash
# Save locally
lighteval accelerate \
"model_name=openai-community/gpt2" \
"leaderboard|truthfulqa:mc|0" \
--output-dir ./results

# Push to Hugging Face Hub
lighteval accelerate \
"model_name=openai-community/gpt2" \
"leaderboard|truthfulqa:mc|0" \
--push-to-hub \
--results-org your-username
lighteval eval "hf-inference-providers/openai/gpt-oss-20b" \
"lighteval|gpqa:diamond|0" \
--bundle-dir gpt-oss-bundle \
--repo-id OpenEvals/evals
```

Resulting Space:

<iframe
src="https://openevals-evals.static.hf.space"
frameborder="0"
width="850"
height="450"
></iframe>
120 changes: 120 additions & 0 deletions docs/source/inspect-ai.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,120 @@
# Evaluate your model with Inspect-AI

Pick the right benchmarks with our benchmark finder:
Search by language, task type, dataset name, or keywords.

> [!WARNING]
> Not all tasks are compatible with inspect-ai's API as of yet, we are working on converting all of them !


<iframe
src="https://openevals-open-benchmark-index.hf.space"
frameborder="0"
width="850"
height="450"
></iframe>

Once you've chosen a benchmark, run it with `lighteval eval`. Below are examples for common setups.

### Examples

1. Evaluate a model via Hugging Face Inference Providers.

```bash
lighteval eval "hf-inference-providers/openai/gpt-oss-20b" "lighteval|gpqa:diamond|0"
```

2. Run multiple evals at the same time.

```bash
lighteval eval "hf-inference-providers/openai/gpt-oss-20b" "lighteval|gpqa:diamond|0,lighteval|aime25|0"
```

3. Compare providers for the same model.

```bash
lighteval eval \
hf-inference-providers/openai/gpt-oss-20b:fireworks-ai \
hf-inference-providers/openai/gpt-oss-20b:together \
hf-inference-providers/openai/gpt-oss-20b:nebius \
"lighteval|gpqa:diamond|0"
```

4. Evaluate a vLLM or SGLang model.

```bash
lighteval eval vllm/HuggingFaceTB/SmolLM-135M-Instruct "lighteval|gpqa:diamond|0"
```

5. See the impact of few-shot on your model.

```bash
lighteval eval hf-inference-providers/openai/gpt-oss-20b "lighteval|gsm8k|0,lighteval|gsm8k|5"
```

6. Optimize custom server connections.

```bash
lighteval eval hf-inference-providers/openai/gpt-oss-20b "lighteval|gsm8k|0" \
--max-connections 50 \
--timeout 30 \
--retry-on-error 1 \
--max-retries 1 \
--max-samples 10
```

7. Use multiple epochs for more reliable results.

```bash
lighteval eval hf-inference-providers/openai/gpt-oss-20b "lighteval|aime25|0" --epochs 16 --epochs-reducer "pass_at_4"
```

8. Push to the Hub to share results.

```bash
lighteval eval hf-inference-providers/openai/gpt-oss-20b "lighteval|hle|0" \
--bundle-dir gpt-oss-bundle \
--repo-id OpenEvals/evals \
--max-samples 100
```

Resulting Space:

<iframe
src="https://openevals-evals.static.hf.space"
frameborder="0"
width="850"
height="450"
></iframe>

9. Change model behaviour

You can use any argument defined in inspect-ai's API.

```bash
lighteval eval hf-inference-providers/openai/gpt-oss-20b "lighteval|aime25|0" --temperature 0.1
```

10. Use model-args to use any inference provider specific argument.

```bash
lighteval eval google/gemini-2.5-pro "lighteval|aime25|0" --model-args location=us-east5
```

```bash
lighteval eval openai/gpt-4o "lighteval|gpqa:diamond|0" --model-args service_tier=flex,client_timeout=1200
```


LightEval prints a per-model results table:

```
Completed all tasks in 'lighteval-logs' successfully

| Model |gpqa|gpqa:diamond|
|---------------------------------------|---:|-----------:|
|vllm/HuggingFaceTB/SmolLM-135M-Instruct|0.01| 0.01|

results saved to lighteval-logs
run "inspect view --log-dir lighteval-logs" to view the results
```
2 changes: 1 addition & 1 deletion docs/source/quicktour.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@ Lighteval can be used with several different commands, each optimized for differ
## Find your benchmark

<iframe
src="https://saylortwift-benchmark-finder.hf.space"
src="https://openevals-open-benchmark-index.hf.space"
frameborder="0"
width="850"
height="450"
Expand Down
1 change: 1 addition & 0 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -57,6 +57,7 @@ keywords = ["evaluation", "nlp", "llm"]
dependencies = [
# Base dependencies
"transformers>=4.54.0",
"inspect-ai",
"accelerate",
"huggingface_hub[hf_xet]>=0.30.2",
"torch>=2.0,<3.0",
Expand Down
2 changes: 2 additions & 0 deletions src/lighteval/__main__.py
Original file line number Diff line number Diff line change
Expand Up @@ -29,6 +29,7 @@
import lighteval.main_baseline
import lighteval.main_custom
import lighteval.main_endpoint
import lighteval.main_inspect
import lighteval.main_nanotron
import lighteval.main_sglang
import lighteval.main_tasks
Expand Down Expand Up @@ -69,6 +70,7 @@
app.command(rich_help_panel="Evaluation Backends")(lighteval.main_vllm.vllm)
app.command(rich_help_panel="Evaluation Backends")(lighteval.main_custom.custom)
app.command(rich_help_panel="Evaluation Backends")(lighteval.main_sglang.sglang)
app.command(rich_help_panel="Evaluation Backends")(lighteval.main_inspect.eval)
app.add_typer(
lighteval.main_endpoint.app,
name="endpoint",
Expand Down
Loading
Loading