Skip to content

Commit 69741cf

Browse files
committed
add docstyle rule; update readme; linting
1 parent 33fedbf commit 69741cf

File tree

11 files changed

+305
-188
lines changed

11 files changed

+305
-188
lines changed

.github/workflows/unit_tests.yml

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -22,6 +22,8 @@ jobs:
2222
steps:
2323
- name: Checkout Code
2424
uses: actions/checkout@v6
25+
with:
26+
fetch-depth: 0
2527
- name: Install uv
2628
uses: astral-sh/setup-uv@v7
2729
with:
@@ -34,6 +36,8 @@ jobs:
3436
env:
3537
SKIP: "no-commit-to-branch,mypy"
3638
uses: pre-commit/[email protected]
39+
with:
40+
extra_args: --from-ref ${{ github.event.pull_request.base.sha || 'HEAD~1' }} --to-ref HEAD
3741
# Job 2
3842
testcpu:
3943
name: CPU Tests

README.md

Lines changed: 83 additions & 23 deletions
Original file line numberDiff line numberDiff line change
@@ -63,7 +63,35 @@ cd lm-evaluation-harness
6363
pip install -e .
6464
```
6565

66-
We also provide a number of optional dependencies for extended functionality. A detailed table is available at the end of this document.
66+
### Installing Model Backends
67+
68+
The base installation provides the core evaluation framework. **Model backends must be installed separately** using optional extras:
69+
70+
For HuggingFace transformers models:
71+
72+
```bash
73+
pip install "lm_eval[hf]"
74+
```
75+
76+
For vLLM inference:
77+
78+
```bash
79+
pip install "lm_eval[vllm]"
80+
```
81+
82+
For API-based models (OpenAI, Anthropic, etc.):
83+
84+
```bash
85+
pip install "lm_eval[api]"
86+
```
87+
88+
Multiple backends can be installed together:
89+
90+
```bash
91+
pip install "lm_eval[hf,vllm,api]"
92+
```
93+
94+
A detailed table of all optional extras is available at the end of this document.
6795

6896
## Basic Usage
6997

@@ -75,6 +103,9 @@ A list of supported tasks (or groupings of tasks) can be viewed with `lm-eval --
75103

76104
### Hugging Face `transformers`
77105

106+
> [!Important]
107+
> To use the HuggingFace backend, first install: `pip install "lm_eval[hf]"`
108+
78109
To evaluate a model hosted on the [HuggingFace Hub](https://huggingface.co/models) (e.g. GPT-J-6B) on `hellaswag` you can use the following command (this assumes you are using a CUDA-compatible GPU):
79110

80111
```bash
@@ -307,9 +338,9 @@ lm_eval --model vllm \
307338
--batch_size auto
308339
```
309340

310-
To use vllm, do `pip install lm_eval[vllm]`. For a full list of supported vLLM configurations, please reference our [vLLM integration](https://github.com/EleutherAI/lm-evaluation-harness/blob/e74ec966556253fbe3d8ecba9de675c77c075bce/lm_eval/models/vllm_causallms.py) and the vLLM documentation.
341+
To use vllm, do `pip install "lm_eval[vllm]"`. For a full list of supported vLLM configurations, please reference our [vLLM integration](https://github.com/EleutherAI/lm-evaluation-harness/blob/e74ec966556253fbe3d8ecba9de675c77c075bce/lm_eval/models/vllm_causallms.py) and the vLLM documentation.
311342

312-
vLLM occasionally differs in output from Huggingface. We treat Huggingface as the reference implementation, and provide a [script](./scripts/model_comparator.py) for checking the validity of vllm results against HF.
343+
vLLM occasionally differs in output from Huggingface. We treat Huggingface as the reference implementation and provide a [script](./scripts/model_comparator.py) for checking the validity of vllm results against HF.
313344

314345
> [!Tip]
315346
> For fastest performance, we recommend using `--batch_size auto` for vLLM whenever possible, to leverage its continuous batching functionality!
@@ -336,14 +367,17 @@ lm_eval --model sglang \
336367
```
337368

338369
> [!Tip]
339-
> When encountering out of memory (OOM) errors (especially for multiple-choice tasks), try these solutions:
370+
> When encountering out-of-memory (OOM) errors (especially for multiple-choice tasks), try these solutions:
340371
>
341372
> 1. Use a manual `batch_size`, rather than `auto`.
342373
> 2. Lower KV cache pool memory usage by adjusting `mem_fraction_static` - Add to your model arguments for example `--model_args pretrained=...,mem_fraction_static=0.7`.
343374
> 3. Increase tensor parallel size `tp_size` (if using multiple GPUs).
344375
345376
### Model APIs and Inference Servers
346377

378+
> [!Important]
379+
> To use API-based models, first install: `pip install "lm_eval[api]"`
380+
347381
Our library also supports the evaluation of models served via several commercial APIs, and we hope to implement support for the most commonly used performant local/self-hosted inference servers.
348382

349383
To call a hosted model, use:
@@ -581,7 +615,7 @@ To get started with development, first clone the repository and install the dev
581615
```bash
582616
git clone https://github.com/EleutherAI/lm-evaluation-harness
583617
cd lm-evaluation-harness
584-
pip install -e ".[dev]"
618+
pip install -e ".[dev,hf]"
585619
````
586620

587621
### Implementing new tasks
@@ -607,24 +641,50 @@ The best way to get support is to open an issue on this repo or join the [Eleuth
607641

608642
Extras dependencies can be installed via `pip install -e ".[NAME]"`
609643

610-
| NAME | Description | NAME | Description |
611-
|----------------------|--------------------------------|----------------|---------------------------------------|
612-
| tasks | All task-specific dependencies | api | API models (Anthropic, OpenAI, local) |
613-
| acpbench | ACP Bench tasks | audiolm_qwen | Qwen2 audio models |
614-
| ifeval | IFEval task | | |
615-
| japanese_leaderboard | Japanese LLM tasks | gptq | AutoGPTQ models |
616-
| longbench | LongBench tasks | gptqmodel | GPTQModel models |
617-
| math | Math answer checking | hf_transfer | Speed up HF downloads |
618-
| multilingual | Multilingual tokenizers | ibm_watsonx_ai | IBM watsonx.ai models |
619-
| ruler | RULER tasks | ipex | Intel IPEX backend |
620-
| | | | |
621-
| dev | Linting & contributions | mamba | Mamba SSM models |
622-
| promptsource | PromptSource prompts | neuronx | AWS inf2 instances |
623-
| sentencepiece | Sentencepiece tokenizer | optimum | Intel OpenVINO models |
624-
| testing | Run test suite | sae_lens | SAELens model steering |
625-
| unitxt | Run unitxt tasks | | |
626-
| wandb | Weights & Biases | sparsify | Sparsify model steering |
627-
| zeno | Result visualization | vllm | vLLM models |
644+
### Model Backends
645+
646+
These extras install dependencies required to run specific model backends:
647+
648+
| NAME | Description |
649+
|----------------|--------------------------------------------------|
650+
| hf | HuggingFace Transformers (torch, transformers, accelerate, peft) |
651+
| vllm | vLLM fast inference |
652+
| api | API models (OpenAI, Anthropic, local servers) |
653+
| gptq | AutoGPTQ quantized models |
654+
| gptqmodel | GPTQModel quantized models |
655+
| ibm_watsonx_ai | IBM watsonx.ai models |
656+
| ipex | Intel IPEX backend |
657+
| optimum | Intel OpenVINO models |
658+
| neuronx | AWS Inferentia2 instances |
659+
| sparsify | Sparsify model steering |
660+
| sae_lens | SAELens model steering |
661+
662+
### Task Dependencies
663+
664+
These extras install dependencies required for specific evaluation tasks:
665+
666+
| NAME | Description |
667+
|----------------------|--------------------------------|
668+
| tasks | All task-specific dependencies |
669+
| acpbench | ACP Bench tasks |
670+
| audiolm_qwen | Qwen2 audio models |
671+
| ifeval | IFEval task |
672+
| japanese_leaderboard | Japanese LLM tasks |
673+
| longbench | LongBench tasks |
674+
| math | Math answer checking |
675+
| multilingual | Multilingual tokenizers |
676+
| ruler | RULER tasks |
677+
678+
### Development & Utilities
679+
680+
| NAME | Description |
681+
|---------------|--------------------------------|
682+
| dev | Linting & contributions |
683+
| hf_transfer | Speed up HF downloads |
684+
| sentencepiece | Sentencepiece tokenizer |
685+
| unitxt | Unitxt tasks |
686+
| wandb | Weights & Biases logging |
687+
| zeno | Zeno result visualization |
628688

629689
## Cite as
630690

lm_eval/__init__.py

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -4,6 +4,15 @@
44

55
__version__ = "0.4.9.2"
66

7+
# Enable hf_transfer if available
8+
try:
9+
import hf_transfer # type: ignore
10+
import huggingface_hub.constants # type: ignore
11+
12+
huggingface_hub.constants.HF_HUB_ENABLE_HF_TRANSFER = True
13+
except ImportError:
14+
pass
15+
716

817
# Lazy-load .evaluator module to improve CLI startup
918
def __getattr__(name):

lm_eval/api/registry.py

Lines changed: 11 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -19,7 +19,7 @@ def __init__(self, **kwargs):
1919
### Registering with Lazy Loading
2020
```python
2121
# Register without importing the actual implementation
22-
model_registry.register("lazy-model", target="my_package.models:LazyModel")
22+
model_registry.register("lazy-model", target="my_package.models: LazyModel")
2323
```
2424
2525
### Looking up Components
@@ -39,9 +39,10 @@ def __init__(self, **kwargs):
3939
import inspect
4040
import logging
4141
import threading
42+
from collections.abc import Callable
4243
from functools import lru_cache
4344
from types import MappingProxyType
44-
from typing import TYPE_CHECKING, Any, Generic, TypeVar, Union, cast, overload
45+
from typing import TYPE_CHECKING, Any, Generic, TypeVar, cast, overload
4546

4647

4748
eval_logger = logging.getLogger(__name__)
@@ -53,6 +54,7 @@ def __init__(self, **kwargs):
5354
from lm_eval.api.filter import Filter
5455
from lm_eval.api.model import LM
5556

57+
5658
__all__ = [
5759
# Core registry class
5860
"Registry",
@@ -201,7 +203,7 @@ def register(
201203
>>> class MyModel(LM):
202204
... pass
203205
>>>
204-
>>> # Direct registration with lazy placeholder
206+
>>> # Direct registration with a lazy placeholder
205207
>>> model_registry.register("lazy-name", target="mymodule:MyModel")
206208
207209
Raises:
@@ -401,7 +403,7 @@ def _clear(self): # pragma: no cover
401403
"""Erase registry (for isolated tests).
402404
403405
Clears both the registry contents and the materialization cache.
404-
Only use this in test code to ensure clean state between tests.
406+
Only use this in test code to ensure a clean state between tests.
405407
"""
406408
if isinstance(self._objs, MappingProxyType):
407409
self._objs = dict(self._objs) # type: ignore[assignment]
@@ -520,7 +522,7 @@ def get_model(model_name: str):
520522
# =============================================================================
521523

522524

523-
def register_filter(name):
525+
def register_filter(name: str):
524526
"""Decorator to register a filter class.
525527
526528
Args:
@@ -550,12 +552,12 @@ def get_filter(filter_name: str | Callable) -> Callable:
550552
The filter class/function
551553
552554
Raises:
553-
KeyError: If filter name is not found and is not callable
555+
KeyError: If a filter name is not found and is not callable
554556
"""
555557
if callable(filter_name):
556558
return filter_name
557559
try:
558-
return filter_registry.get(filter_name)
560+
return filter_registry.get(cast("str", filter_name))
559561
except KeyError as e:
560562
eval_logger.warning(f"filter `{filter_name}` is not registered!")
561563
raise e
@@ -574,7 +576,7 @@ def register_metric(**args):
574576
"""Decorator to register a metric function.
575577
576578
Args:
577-
**args: Keyword arguments including:
579+
**args: Keyword arguments including
578580
- metric: Name to register the metric under (required)
579581
- higher_is_better: Whether higher scores are better
580582
- aggregation: Name of aggregation function to use
@@ -609,7 +611,7 @@ def get_metric(name: str, hf_evaluate_metric: bool = False) -> Callable | None:
609611
610612
Args:
611613
name: The metric name
612-
hf_evaluate_metric: If True, skip local registry and use HF evaluate
614+
hf_evaluate_metric: If True, skip the local registry and use HF evaluate
613615
614616
Returns:
615617
The metric compute function, or None if not found

0 commit comments

Comments
 (0)