Skip to content

Commit 95b20f9

Browse files
authored
Merge pull request #3428 from EleutherAI/registry_
refactor: lazy registry; lightweight core
2 parents f83f960 + 69741cf commit 95b20f9

File tree

20 files changed

+1515
-397
lines changed

20 files changed

+1515
-397
lines changed

.github/workflows/new_tasks.yml

Lines changed: 10 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -16,7 +16,7 @@ jobs:
1616
name: Scan for changed tasks
1717
steps:
1818
- name: checkout
19-
uses: actions/checkout@v4
19+
uses: actions/checkout@v6
2020
with:
2121
fetch-depth: 2 # OR "2" -> To retrieve the preceding commit.
2222

@@ -25,7 +25,7 @@ jobs:
2525
# and prepends the filter name to the standard output names.
2626
- name: Check task folders
2727
id: changed-tasks
28-
uses: tj-actions/changed-files@v46.0.5
28+
uses: tj-actions/changed-files@24d32ffd492484c1d75e0c0b894501ddb9d30d62
2929
with:
3030
# tasks checks the tasks folder and api checks the api folder for changes
3131
files_yaml: |
@@ -44,28 +44,24 @@ jobs:
4444
echo "One or more test file(s) has changed."
4545
echo "List of all the files that have changed: ${{ steps.changed-tasks.outputs.tasks_all_modified_files }}"
4646
47-
- name: Set up Python 3.10
47+
- name: Install uv
4848
if: steps.changed-tasks.outputs.tasks_any_modified == 'true' || steps.changed-tasks.outputs.api_any_modified == 'true'
49-
uses: actions/setup-python@v5
49+
uses: astral-sh/setup-uv@v7
5050
with:
51-
python-version: '3.10'
52-
cache: 'pip'
53-
cache-dependency-path: pyproject.toml
51+
enable-cache: true
52+
python-version: "3.10"
53+
activate-environment: true
5454
- name: Install dependencies
5555
if: steps.changed-tasks.outputs.tasks_any_modified == 'true' || steps.changed-tasks.outputs.api_any_modified == 'true'
5656
run: |
57-
python -m pip install --upgrade pip
58-
pip install -e '.[dev,ifeval,unitxt,math,longbench]' --extra-index-url https://download.pytorch.org/whl/cpu
59-
# Install optional git dependencies
60-
# pip install bleurt@https://github.com/google-research/bleurt/archive/b610120347ef22b494b6d69b4316e303f5932516.zip#egg=bleurt
61-
# if [ -f requirements.txt ]; then pip install -r requirements.txt; fi
57+
uv pip install -e '.[dev,ifeval,unitxt,math,longbench,hf]' --extra-index-url https://download.pytorch.org/whl/cpu
6258
- name: Test with pytest
6359
# if new tasks are added, run tests on them
6460
if: steps.changed-tasks.outputs.tasks_any_modified == 'true'
65-
run: python -m pytest tests/test_tasks.py -s -vv
61+
run: pytest -x -s -vv tests/test_tasks.py
6662
# if api is modified, run tests on it
6763
- name: Test more tasks with pytest
6864
env:
6965
API: true
7066
if: steps.changed-tasks.outputs.api_any_modified == 'true'
71-
run: python -m pytest tests/test_tasks.py -s -vv
67+
run: pytest -x -s -vv -n=auto tests/test_tasks.py

.github/workflows/unit_tests.yml

Lines changed: 22 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -21,17 +21,23 @@ jobs:
2121

2222
steps:
2323
- name: Checkout Code
24-
uses: actions/checkout@v4
25-
- name: Set up Python 3.10
26-
uses: actions/setup-python@v5
24+
uses: actions/checkout@v6
2725
with:
28-
python-version: '3.10'
29-
cache: pip
30-
cache-dependency-path: pyproject.toml
26+
fetch-depth: 0
27+
- name: Install uv
28+
uses: astral-sh/setup-uv@v7
29+
with:
30+
enable-cache: true
31+
python-version: "3.10"
32+
activate-environment: true
33+
- name: Install pip
34+
run: uv pip install pip
3135
- name: Pre-Commit
3236
env:
3337
SKIP: "no-commit-to-branch,mypy"
3438
uses: pre-commit/[email protected]
39+
with:
40+
extra_args: --from-ref ${{ github.event.pull_request.base.sha || 'HEAD~1' }} --to-ref HEAD
3541
# Job 2
3642
testcpu:
3743
name: CPU Tests
@@ -43,13 +49,13 @@ jobs:
4349
timeout-minutes: 30
4450
steps:
4551
- name: Checkout Code
46-
uses: actions/checkout@v4
47-
- name: Set up Python ${{ matrix.python-version }}
48-
uses: actions/setup-python@v5
52+
uses: actions/checkout@v6
53+
- name: Install uv
54+
uses: astral-sh/setup-uv@v7
4955
with:
56+
enable-cache: true
5057
python-version: ${{ matrix.python-version }}
51-
cache: pip
52-
cache-dependency-path: pyproject.toml
58+
activate-environment: true
5359

5460
# Cache HuggingFace cache directory for CPU tests
5561
- name: Cache HuggingFace cache (CPU tests)
@@ -63,17 +69,16 @@ jobs:
6369
6470
- name: Install dependencies
6571
run: |
66-
python -m pip install --upgrade pip
67-
pip install -e '.[dev,unitxt]' --extra-index-url https://download.pytorch.org/whl/cpu
68-
pip install hf_xet
72+
uv pip install -e '.[dev,unitxt,hf]' --extra-index-url https://download.pytorch.org/whl/cpu
73+
uv pip install hf_xet
6974
7075
- name: Test with pytest
71-
run: python -m pytest --showlocals -s -vv -n=auto --ignore=tests/models/test_openvino.py --ignore=tests/models/test_hf_steered.py
72-
continue-on-error: true # Continue workflow even if tests fail
76+
run: pytest -x --showlocals -s -vv -n=auto --ignore=tests/models/test_openvino.py --ignore=tests/models/test_hf_steered.py --ignore=tests/scripts/test_zeno_visualize.py
7377

7478
# Save test artifacts
7579
- name: Archive test artifacts
76-
uses: actions/upload-artifact@v4
80+
if: always() # Upload artifacts even if tests fail
81+
uses: actions/upload-artifact@v5
7782
with:
7883
name: output_testcpu${{ matrix.python-version }}
7984
path: |

.gitignore

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -45,3 +45,6 @@ examples/wandb/
4545

4646
# PyInstaller
4747
*.spec
48+
49+
#uv
50+
uv.lock

.pre-commit-config.yaml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -27,7 +27,7 @@ repos:
2727
- id: mixed-line-ending
2828
args: [ --fix=lf ]
2929
- repo: https://github.com/astral-sh/ruff-pre-commit
30-
rev: v0.13.2
30+
rev: v0.14.6
3131
hooks:
3232
# Run the linter.
3333
- id: ruff-check
@@ -46,7 +46,7 @@ repos:
4646
4747
args: [ --check-filenames, --check-hidden, --ignore-words=ignore.txt ]
4848
- repo: https://github.com/jackdewinter/pymarkdown
49-
rev: v0.9.32
49+
rev: v0.9.33
5050
hooks:
5151
- id: pymarkdown
5252
exclude: ^(lm_eval/tasks/.*|docs/footguns\.md)$

README.md

Lines changed: 83 additions & 23 deletions
Original file line numberDiff line numberDiff line change
@@ -63,7 +63,35 @@ cd lm-evaluation-harness
6363
pip install -e .
6464
```
6565

66-
We also provide a number of optional dependencies for extended functionality. A detailed table is available at the end of this document.
66+
### Installing Model Backends
67+
68+
The base installation provides the core evaluation framework. **Model backends must be installed separately** using optional extras:
69+
70+
For HuggingFace transformers models:
71+
72+
```bash
73+
pip install "lm_eval[hf]"
74+
```
75+
76+
For vLLM inference:
77+
78+
```bash
79+
pip install "lm_eval[vllm]"
80+
```
81+
82+
For API-based models (OpenAI, Anthropic, etc.):
83+
84+
```bash
85+
pip install "lm_eval[api]"
86+
```
87+
88+
Multiple backends can be installed together:
89+
90+
```bash
91+
pip install "lm_eval[hf,vllm,api]"
92+
```
93+
94+
A detailed table of all optional extras is available at the end of this document.
6795

6896
## Basic Usage
6997

@@ -75,6 +103,9 @@ A list of supported tasks (or groupings of tasks) can be viewed with `lm-eval --
75103

76104
### Hugging Face `transformers`
77105

106+
> [!Important]
107+
> To use the HuggingFace backend, first install: `pip install "lm_eval[hf]"`
108+
78109
To evaluate a model hosted on the [HuggingFace Hub](https://huggingface.co/models) (e.g. GPT-J-6B) on `hellaswag` you can use the following command (this assumes you are using a CUDA-compatible GPU):
79110

80111
```bash
@@ -307,9 +338,9 @@ lm_eval --model vllm \
307338
--batch_size auto
308339
```
309340

310-
To use vllm, do `pip install lm_eval[vllm]`. For a full list of supported vLLM configurations, please reference our [vLLM integration](https://github.com/EleutherAI/lm-evaluation-harness/blob/e74ec966556253fbe3d8ecba9de675c77c075bce/lm_eval/models/vllm_causallms.py) and the vLLM documentation.
341+
To use vllm, do `pip install "lm_eval[vllm]"`. For a full list of supported vLLM configurations, please reference our [vLLM integration](https://github.com/EleutherAI/lm-evaluation-harness/blob/e74ec966556253fbe3d8ecba9de675c77c075bce/lm_eval/models/vllm_causallms.py) and the vLLM documentation.
311342

312-
vLLM occasionally differs in output from Huggingface. We treat Huggingface as the reference implementation, and provide a [script](./scripts/model_comparator.py) for checking the validity of vllm results against HF.
343+
vLLM occasionally differs in output from Huggingface. We treat Huggingface as the reference implementation and provide a [script](./scripts/model_comparator.py) for checking the validity of vllm results against HF.
313344

314345
> [!Tip]
315346
> For fastest performance, we recommend using `--batch_size auto` for vLLM whenever possible, to leverage its continuous batching functionality!
@@ -336,14 +367,17 @@ lm_eval --model sglang \
336367
```
337368

338369
> [!Tip]
339-
> When encountering out of memory (OOM) errors (especially for multiple-choice tasks), try these solutions:
370+
> When encountering out-of-memory (OOM) errors (especially for multiple-choice tasks), try these solutions:
340371
>
341372
> 1. Use a manual `batch_size`, rather than `auto`.
342373
> 2. Lower KV cache pool memory usage by adjusting `mem_fraction_static` - Add to your model arguments for example `--model_args pretrained=...,mem_fraction_static=0.7`.
343374
> 3. Increase tensor parallel size `tp_size` (if using multiple GPUs).
344375
345376
### Model APIs and Inference Servers
346377

378+
> [!Important]
379+
> To use API-based models, first install: `pip install "lm_eval[api]"`
380+
347381
Our library also supports the evaluation of models served via several commercial APIs, and we hope to implement support for the most commonly used performant local/self-hosted inference servers.
348382

349383
To call a hosted model, use:
@@ -581,7 +615,7 @@ To get started with development, first clone the repository and install the dev
581615
```bash
582616
git clone https://github.com/EleutherAI/lm-evaluation-harness
583617
cd lm-evaluation-harness
584-
pip install -e ".[dev]"
618+
pip install -e ".[dev,hf]"
585619
````
586620

587621
### Implementing new tasks
@@ -607,24 +641,50 @@ The best way to get support is to open an issue on this repo or join the [Eleuth
607641

608642
Extras dependencies can be installed via `pip install -e ".[NAME]"`
609643

610-
| NAME | Description | NAME | Description |
611-
|----------------------|--------------------------------|----------------|---------------------------------------|
612-
| tasks | All task-specific dependencies | api | API models (Anthropic, OpenAI, local) |
613-
| acpbench | ACP Bench tasks | audiolm_qwen | Qwen2 audio models |
614-
| ifeval | IFEval task | | |
615-
| japanese_leaderboard | Japanese LLM tasks | gptq | AutoGPTQ models |
616-
| longbench | LongBench tasks | gptqmodel | GPTQModel models |
617-
| math | Math answer checking | hf_transfer | Speed up HF downloads |
618-
| multilingual | Multilingual tokenizers | ibm_watsonx_ai | IBM watsonx.ai models |
619-
| ruler | RULER tasks | ipex | Intel IPEX backend |
620-
| | | | |
621-
| dev | Linting & contributions | mamba | Mamba SSM models |
622-
| promptsource | PromptSource prompts | neuronx | AWS inf2 instances |
623-
| sentencepiece | Sentencepiece tokenizer | optimum | Intel OpenVINO models |
624-
| testing | Run test suite | sae_lens | SAELens model steering |
625-
| unitxt | Run unitxt tasks | | |
626-
| wandb | Weights & Biases | sparsify | Sparsify model steering |
627-
| zeno | Result visualization | vllm | vLLM models |
644+
### Model Backends
645+
646+
These extras install dependencies required to run specific model backends:
647+
648+
| NAME | Description |
649+
|----------------|--------------------------------------------------|
650+
| hf | HuggingFace Transformers (torch, transformers, accelerate, peft) |
651+
| vllm | vLLM fast inference |
652+
| api | API models (OpenAI, Anthropic, local servers) |
653+
| gptq | AutoGPTQ quantized models |
654+
| gptqmodel | GPTQModel quantized models |
655+
| ibm_watsonx_ai | IBM watsonx.ai models |
656+
| ipex | Intel IPEX backend |
657+
| optimum | Intel OpenVINO models |
658+
| neuronx | AWS Inferentia2 instances |
659+
| sparsify | Sparsify model steering |
660+
| sae_lens | SAELens model steering |
661+
662+
### Task Dependencies
663+
664+
These extras install dependencies required for specific evaluation tasks:
665+
666+
| NAME | Description |
667+
|----------------------|--------------------------------|
668+
| tasks | All task-specific dependencies |
669+
| acpbench | ACP Bench tasks |
670+
| audiolm_qwen | Qwen2 audio models |
671+
| ifeval | IFEval task |
672+
| japanese_leaderboard | Japanese LLM tasks |
673+
| longbench | LongBench tasks |
674+
| math | Math answer checking |
675+
| multilingual | Multilingual tokenizers |
676+
| ruler | RULER tasks |
677+
678+
### Development & Utilities
679+
680+
| NAME | Description |
681+
|---------------|--------------------------------|
682+
| dev | Linting & contributions |
683+
| hf_transfer | Speed up HF downloads |
684+
| sentencepiece | Sentencepiece tokenizer |
685+
| unitxt | Unitxt tasks |
686+
| wandb | Weights & Biases logging |
687+
| zeno | Zeno result visualization |
628688

629689
## Cite as
630690

lm_eval/__init__.py

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -4,6 +4,15 @@
44

55
__version__ = "0.4.9.2"
66

7+
# Enable hf_transfer if available
8+
try:
9+
import hf_transfer # type: ignore
10+
import huggingface_hub.constants # type: ignore
11+
12+
huggingface_hub.constants.HF_HUB_ENABLE_HF_TRANSFER = True
13+
except ImportError:
14+
pass
15+
716

817
# Lazy-load .evaluator module to improve CLI startup
918
def __getattr__(name):

0 commit comments

Comments
 (0)