huggingface · NathanHB · Nov 3, 2025 · Oct 7, 2025 · Oct 7, 2025 · Oct 7, 2025
diff --git a/README.md b/README.md
@@ -25,7 +25,7 @@
   <a href="https://huggingface.co/docs/lighteval/main/en/index" target="_blank">
     <img alt="Documentation" src="https://img.shields.io/badge/Documentation-4F4F4F?style=for-the-badge&logo=readthedocs&logoColor=white" />
   </a>
-  <a href="https://huggingface.co/spaces/SaylorTwift/benchmark_finder" target="_blank">
+  <a href="https://huggingface.co/spaces/OpenEvals/open_benchmark_index" target="_blank">
     <img alt="Open Benchmark Index" src="https://img.shields.io/badge/Open%20Benchmark%20Index-4F4F4F?style=for-the-badge&logo=huggingface&logoColor=white" />
   </a>
 </p>
@@ -44,7 +44,7 @@ sample-by-sample results* to debug and see how your models stack-up.
 
 Lighteval supports **1000+ evaluation tasks** across multiple domains and
 languages. Use [this
-space](https://huggingface.co/spaces/SaylorTwift/benchmark_finder) to find what
+space](https://huggingface.co/spaces/OpenEvals/open_benchmark_index) to find what
 you need, or, here's an overview of some *popular benchmarks*:
 
 
@@ -107,6 +107,7 @@ huggingface-cli login
 
 Lighteval offers the following entry points for model evaluation:
 
+- `lighteval eval`: Evaluation models using [inspect-ai](https://inspect.aisi.org.uk/) as a backend (prefered).
 - `lighteval accelerate`: Evaluate models on CPU or one or more GPUs using [🤗
   Accelerate](https://github.com/huggingface/accelerate)
 - `lighteval nanotron`: Evaluate models in distributed settings using [⚡️
@@ -126,9 +127,7 @@ Did not find what you need ? You can always make your custom model API by follow
 Here's a **quick command** to evaluate using the *Accelerate backend*:
 
 ```shell
-lighteval accelerate \
-    "model_name=gpt2" \
-    "leaderboard|truthfulqa:mc|0"
+lighteval eval "hf-inference-providers/openai/gpt-oss-20b" "lighteval|gpqa:diamond|0"
 ```
 
 Or use the **Python API** to run a model *already loaded in memory*!

diff --git a/docs/source/_toctree.yml b/docs/source/_toctree.yml
@@ -7,6 +7,8 @@
     title: Quicktour
   title: Getting started
 - sections:
+  - local: inspect-ai
+    title: Examples using Inspect-AI
   - local: saving-and-reading-results
     title: Save and read results
   - local: caching

diff --git a/docs/source/available-tasks.mdx b/docs/source/available-tasks.mdx
@@ -1,28 +1,30 @@
+# Available tasks
 
+Browse and inspect tasks available in LightEval.
 <iframe
-	src="https://saylortwift-benchmark-finder.hf.space"
+	src="https://openevals-benchmark-finder.hf.space"
 	frameborder="0"
 	width="850"
 	height="450"
 ></iframe>
 
 
 
-You can get a list of all available tasks by running:
+List all tasks:
 
 ```bash
 lighteval tasks list
 ```
 
-### Inspect Specific Tasks
+### Inspect specific tasks
 
-You can inspect a specific task to see its configuration, metrics, and requirements by running:
+Inspect a task to view its config, metrics, and requirements:
 
 ```bash
 lighteval tasks inspect <task_name>
 ```
 
-For example:
+Example:
 ```bash
 lighteval tasks inspect "lighteval|truthfulqa:mc|0"
 ```
diff --git a/docs/source/index.mdx b/docs/source/index.mdx
@@ -9,6 +9,7 @@ and see how your models stack up.
 
 ### 🚀 **Multi-Backend Support**
 Evaluate your models using the most popular and efficient inference backends:
+- `eval`: Use [inspect-ai](https://inspect.aisi.org.uk/) as backend to evaluate and inspect your models ! (prefered way)
 - `transformers`: Evaluate models on CPU or one or more GPUs using [🤗
   Accelerate](https://github.com/huggingface/transformers)
 - `nanotron`: Evaluate models in distributed settings using [⚡️
@@ -45,26 +46,29 @@ pip install lighteval
 
 ### Basic Usage
 
-```bash
-# Evaluate a model using Transformers backend
-lighteval accelerate \
-    "model_name=openai-community/gpt2" \
-    "leaderboard|truthfulqa:mc|0"
-```
+#### Find a task
+
+<iframe
+	src="https://openevals-open-benchmark-index.hf.space"
+	frameborder="0"
+	width="850"
+	height="450"
+></iframe>
 
-### Save Results
+#### Run your benchmark and push details to the hub
 
 ```bash
-# Save locally
-lighteval accelerate \
-    "model_name=openai-community/gpt2" \
-    "leaderboard|truthfulqa:mc|0" \
-    --output-dir ./results
-
-# Push to Hugging Face Hub
-lighteval accelerate \
-    "model_name=openai-community/gpt2" \
-    "leaderboard|truthfulqa:mc|0" \
-    --push-to-hub \
-    --results-org your-username
+lighteval eval "hf-inference-providers/openai/gpt-oss-20b" \
+  "lighteval|gpqa:diamond|0" \
+    --bundle-dir gpt-oss-bundle \
+    --repo-id OpenEvals/evals
 ```
+
+Resulting Space:
+
+<iframe
+    src="https://openevals-evals.static.hf.space"
+    frameborder="0"
+    width="850"
+    height="450"
+></iframe>
diff --git a/docs/source/inspect-ai.mdx b/docs/source/inspect-ai.mdx
@@ -0,0 +1,120 @@
+# Evaluate your model with Inspect-AI
+
+Pick the right benchmarks with our benchmark finder:
+Search by language, task type, dataset name, or keywords.
+
+> [!WARNING]
+> Not all tasks are compatible with inspect-ai's API as of yet, we are working on converting all of them !
+
+
+<iframe
+	src="https://openevals-open-benchmark-index.hf.space"
+	frameborder="0"
+	width="850"
+	height="450"
+></iframe>
+
+Once you've chosen a benchmark, run it with `lighteval eval`. Below are examples for common setups.
+
+### Examples
+
+1. Evaluate a model via Hugging Face Inference Providers.
+
+```bash
+lighteval eval "hf-inference-providers/openai/gpt-oss-20b" "lighteval|gpqa:diamond|0"
+```
+
+2. Run multiple evals at the same time.
+
+```bash
+lighteval eval "hf-inference-providers/openai/gpt-oss-20b" "lighteval|gpqa:diamond|0,lighteval|aime25|0"
+```
+
+3. Compare providers for the same model.
+
+```bash
+lighteval eval \
+    hf-inference-providers/openai/gpt-oss-20b:fireworks-ai \
+    hf-inference-providers/openai/gpt-oss-20b:together \
+    hf-inference-providers/openai/gpt-oss-20b:nebius \
+    "lighteval|gpqa:diamond|0"
+```
+
+4. Evaluate a vLLM or SGLang model.
+
+```bash
+lighteval eval vllm/HuggingFaceTB/SmolLM-135M-Instruct "lighteval|gpqa:diamond|0"
+```
+
+5. See the impact of few-shot on your model.
+
+```bash
+lighteval eval hf-inference-providers/openai/gpt-oss-20b "lighteval|gsm8k|0,lighteval|gsm8k|5"
+```
+
+6. Optimize custom server connections.
+
+```bash
+lighteval eval hf-inference-providers/openai/gpt-oss-20b "lighteval|gsm8k|0" \
+    --max-connections 50 \
+    --timeout 30 \
+    --retry-on-error 1 \
+    --max-retries 1 \
+    --max-samples 10
+```
+
+7. Use multiple epochs for more reliable results.
+
+```bash
+lighteval eval hf-inference-providers/openai/gpt-oss-20b "lighteval|aime25|0" --epochs 16 --epochs-reducer "pass_at_4"
+```
+
+8. Push to the Hub to share results.
+
+```bash
+lighteval eval hf-inference-providers/openai/gpt-oss-20b "lighteval|hle|0" \
+    --bundle-dir gpt-oss-bundle \
+    --repo-id OpenEvals/evals \
+    --max-samples 100
+```
+
+Resulting Space:
+
+<iframe
+	src="https://openevals-evals.static.hf.space"
+	frameborder="0"
+	width="850"
+	height="450"
+></iframe>
+
+9. Change model behaviour
+
+You can use any argument defined in inspect-ai's API.
+
+```bash
+lighteval eval hf-inference-providers/openai/gpt-oss-20b "lighteval|aime25|0" --temperature 0.1
+```
+
+10. Use model-args to use any inference provider specific argument.
+
+```bash
+lighteval eval google/gemini-2.5-pro "lighteval|aime25|0" --model-args location=us-east5
+```
+
+```bash
+lighteval eval openai/gpt-4o "lighteval|gpqa:diamond|0" --model-args service_tier=flex,client_timeout=1200
+```
+
+
+LightEval prints a per-model results table:
+
+```
+Completed all tasks in 'lighteval-logs' successfully
+
+|                 Model                 |gpqa|gpqa:diamond|
+|---------------------------------------|---:|-----------:|
+|vllm/HuggingFaceTB/SmolLM-135M-Instruct|0.01|        0.01|
+
+results saved to lighteval-logs
+run "inspect view --log-dir lighteval-logs" to view the results
+```
diff --git a/docs/source/quicktour.mdx b/docs/source/quicktour.mdx
@@ -11,7 +11,7 @@ Lighteval can be used with several different commands, each optimized for differ
 ## Find your benchmark
 
 <iframe
-	src="https://saylortwift-benchmark-finder.hf.space"
+	src="https://openevals-open-benchmark-index.hf.space"
 	frameborder="0"
 	width="850"
 	height="450"

diff --git a/pyproject.toml b/pyproject.toml
@@ -57,6 +57,7 @@ keywords = ["evaluation", "nlp", "llm"]
 dependencies = [
     # Base dependencies
     "transformers>=4.54.0",
+    "inspect-ai",
     "accelerate",
     "huggingface_hub[hf_xet]>=0.30.2",
     "torch>=2.0,<3.0",

diff --git a/src/lighteval/__main__.py b/src/lighteval/__main__.py
@@ -29,6 +29,7 @@
 import lighteval.main_baseline
 import lighteval.main_custom
 import lighteval.main_endpoint
+import lighteval.main_inspect
 import lighteval.main_nanotron
 import lighteval.main_sglang
 import lighteval.main_tasks
@@ -69,6 +70,7 @@
 app.command(rich_help_panel="Evaluation Backends")(lighteval.main_vllm.vllm)
 app.command(rich_help_panel="Evaluation Backends")(lighteval.main_custom.custom)
 app.command(rich_help_panel="Evaluation Backends")(lighteval.main_sglang.sglang)
+app.command(rich_help_panel="Evaluation Backends")(lighteval.main_inspect.eval)
 app.add_typer(
     lighteval.main_endpoint.app,
     name="endpoint",