EleutherAI
diff --git a/‎.github/workflows/unit_tests.yml‎
Lines changed: 4 additions & 0 deletions b/‎.github/workflows/unit_tests.yml‎
Lines changed: 4 additions & 0 deletions
diff --git a/‎README.md‎
Lines changed: 83 additions & 23 deletions b/‎README.md‎
Lines changed: 83 additions & 23 deletions
diff --git a/‎lm_eval/__init__.py‎
Lines changed: 9 additions & 0 deletions b/‎lm_eval/__init__.py‎
Lines changed: 9 additions & 0 deletions
diff --git a/‎lm_eval/api/registry.py‎
Lines changed: 11 additions & 9 deletions b/‎lm_eval/api/registry.py‎
Lines changed: 11 additions & 9 deletions
@@ -22,6 +22,8 @@ jobs:
     steps:
       - name: Checkout Code
         uses: actions/checkout@v6
+        with:
+          fetch-depth: 0
       - name: Install uv
         uses: astral-sh/setup-uv@v7
         with:
@@ -34,6 +36,8 @@ jobs:
         env:
           SKIP: "no-commit-to-branch,mypy"
         uses: pre-commit/[email protected]
+        with:
+          extra_args: --from-ref ${{ github.event.pull_request.base.sha || 'HEAD~1' }} --to-ref HEAD
   # Job 2
   testcpu:
     name: CPU Tests
 
@@ -63,7 +63,35 @@ cd lm-evaluation-harness
 pip install -e .
 ```
 
-We also provide a number of optional dependencies for extended functionality. A detailed table is available at the end of this document.
+### Installing Model Backends
+
+The base installation provides the core evaluation framework. **Model backends must be installed separately** using optional extras:
+
+For HuggingFace transformers models:
+
+```bash
+pip install "lm_eval[hf]"
+```
+
+For vLLM inference:
+
+```bash
+pip install "lm_eval[vllm]"
+```
+
+For API-based models (OpenAI, Anthropic, etc.):
+
+```bash
+pip install "lm_eval[api]"
+```
+
+Multiple backends can be installed together:
+
+```bash
+pip install "lm_eval[hf,vllm,api]"
+```
+
+A detailed table of all optional extras is available at the end of this document.
 
 ## Basic Usage
 
@@ -75,6 +103,9 @@ A list of supported tasks (or groupings of tasks) can be viewed with `lm-eval --
 
 ### Hugging Face `transformers`
 
+> [!Important]
+> To use the HuggingFace backend, first install: `pip install "lm_eval[hf]"`
+
 To evaluate a model hosted on the [HuggingFace Hub](https://huggingface.co/models) (e.g. GPT-J-6B) on `hellaswag` you can use the following command (this assumes you are using a CUDA-compatible GPU):
 
 ```bash
@@ -307,9 +338,9 @@ lm_eval --model vllm \
     --batch_size auto
 ```
 
-To use vllm, do `pip install lm_eval[vllm]`. For a full list of supported vLLM configurations, please reference our [vLLM integration](https://github.com/EleutherAI/lm-evaluation-harness/blob/e74ec966556253fbe3d8ecba9de675c77c075bce/lm_eval/models/vllm_causallms.py) and the vLLM documentation.
+To use vllm, do `pip install "lm_eval[vllm]"`. For a full list of supported vLLM configurations, please reference our [vLLM integration](https://github.com/EleutherAI/lm-evaluation-harness/blob/e74ec966556253fbe3d8ecba9de675c77c075bce/lm_eval/models/vllm_causallms.py) and the vLLM documentation.
 
-vLLM occasionally differs in output from Huggingface. We treat Huggingface as the reference implementation, and provide a [script](./scripts/model_comparator.py) for checking the validity of vllm results against HF.
+vLLM occasionally differs in output from Huggingface. We treat Huggingface as the reference implementation and provide a [script](./scripts/model_comparator.py) for checking the validity of vllm results against HF.
 
 > [!Tip]
 > For fastest performance, we recommend using `--batch_size auto` for vLLM whenever possible, to leverage its continuous batching functionality!
@@ -336,14 +367,17 @@ lm_eval --model sglang \
 ```
 
 > [!Tip]
-> When encountering out of memory (OOM) errors (especially for multiple-choice tasks), try these solutions:
+> When encountering out-of-memory (OOM) errors (especially for multiple-choice tasks), try these solutions:
 >
 > 1. Use a manual `batch_size`, rather than `auto`.
 > 2. Lower KV cache pool memory usage by adjusting `mem_fraction_static` - Add to your model arguments for example `--model_args pretrained=...,mem_fraction_static=0.7`.
 > 3. Increase tensor parallel size `tp_size` (if using multiple GPUs).
 
 ### Model APIs and Inference Servers
 
+> [!Important]
+> To use API-based models, first install: `pip install "lm_eval[api]"`
+
 Our library also supports the evaluation of models served via several commercial APIs, and we hope to implement support for the most commonly used performant local/self-hosted inference servers.
 
 To call a hosted model, use:
@@ -581,7 +615,7 @@ To get started with development, first clone the repository and install the dev
 ```bash
 git clone https://github.com/EleutherAI/lm-evaluation-harness
 cd lm-evaluation-harness
-pip install -e ".[dev]"
+pip install -e ".[dev,hf]"
 ````
 
 ### Implementing new tasks
@@ -607,24 +641,50 @@ The best way to get support is to open an issue on this repo or join the [Eleuth
 
 Extras dependencies can be installed via `pip install -e ".[NAME]"`
 
-| NAME                 | Description                    | NAME           | Description                           |
-|----------------------|--------------------------------|----------------|---------------------------------------|
-| tasks                | All task-specific dependencies | api            | API models (Anthropic, OpenAI, local) |
-| acpbench             | ACP Bench tasks                | audiolm_qwen   | Qwen2 audio models                    |
-| ifeval               | IFEval task                    |                |                                       |
-| japanese_leaderboard | Japanese LLM tasks             | gptq           | AutoGPTQ models                       |
-| longbench            | LongBench tasks                | gptqmodel      | GPTQModel models                      |
-| math                 | Math answer checking           | hf_transfer    | Speed up HF downloads                 |
-| multilingual         | Multilingual tokenizers        | ibm_watsonx_ai | IBM watsonx.ai models                 |
-| ruler                | RULER tasks                    | ipex           | Intel IPEX backend                    |
-|                      |                                |                |                                       |
-| dev                  | Linting & contributions        | mamba          | Mamba SSM models                      |
-| promptsource         | PromptSource prompts           | neuronx        | AWS inf2 instances                    |
-| sentencepiece        | Sentencepiece tokenizer        | optimum        | Intel OpenVINO models                 |
-| testing              | Run test suite                 | sae_lens       | SAELens model steering                |
-| unitxt               | Run unitxt tasks               |                |                                       |
-| wandb                | Weights & Biases               | sparsify       | Sparsify model steering               |
-| zeno                 | Result visualization           | vllm           | vLLM models                           |
+### Model Backends
+
+These extras install dependencies required to run specific model backends:
+
+| NAME           | Description                                      |
+|----------------|--------------------------------------------------|
+| hf             | HuggingFace Transformers (torch, transformers, accelerate, peft) |
+| vllm           | vLLM fast inference                              |
+| api            | API models (OpenAI, Anthropic, local servers)    |
+| gptq           | AutoGPTQ quantized models                        |
+| gptqmodel      | GPTQModel quantized models                       |
+| ibm_watsonx_ai | IBM watsonx.ai models                            |
+| ipex           | Intel IPEX backend                               |
+| optimum        | Intel OpenVINO models                            |
+| neuronx        | AWS Inferentia2 instances                        |
+| sparsify       | Sparsify model steering                          |
+| sae_lens       | SAELens model steering                           |
+
+### Task Dependencies
+
+These extras install dependencies required for specific evaluation tasks:
+
+| NAME                 | Description                    |
+|----------------------|--------------------------------|
+| tasks                | All task-specific dependencies |
+| acpbench             | ACP Bench tasks                |
+| audiolm_qwen         | Qwen2 audio models             |
+| ifeval               | IFEval task                    |
+| japanese_leaderboard | Japanese LLM tasks             |
+| longbench            | LongBench tasks                |
+| math                 | Math answer checking           |
+| multilingual         | Multilingual tokenizers        |
+| ruler                | RULER tasks                    |
+
+### Development & Utilities
+
+| NAME          | Description                    |
+|---------------|--------------------------------|
+| dev           | Linting & contributions        |
+| hf_transfer   | Speed up HF downloads          |
+| sentencepiece | Sentencepiece tokenizer        |
+| unitxt        | Unitxt tasks                   |
+| wandb         | Weights & Biases logging       |
+| zeno          | Zeno result visualization      |
 
 ## Cite as
 
 
@@ -4,6 +4,15 @@
 
 __version__ = "0.4.9.2"
 
+# Enable hf_transfer if available
+try:
+    import hf_transfer  # type: ignore
+    import huggingface_hub.constants  # type: ignore
+
+    huggingface_hub.constants.HF_HUB_ENABLE_HF_TRANSFER = True
+except ImportError:
+    pass
+
 
 # Lazy-load .evaluator module to improve CLI startup
 def __getattr__(name):
 
@@ -19,7 +19,7 @@ def __init__(self, **kwargs):
 ### Registering with Lazy Loading
 ```python
 # Register without importing the actual implementation
-model_registry.register("lazy-model", target="my_package.models:LazyModel")
+model_registry.register("lazy-model", target="my_package.models: LazyModel")
 ```
 
 ### Looking up Components
@@ -39,9 +39,10 @@ def __init__(self, **kwargs):
 import inspect
 import logging
 import threading
+from collections.abc import Callable
 from functools import lru_cache
 from types import MappingProxyType
-from typing import TYPE_CHECKING, Any, Generic, TypeVar, Union, cast, overload
+from typing import TYPE_CHECKING, Any, Generic, TypeVar, cast, overload
 
 
 eval_logger = logging.getLogger(__name__)
@@ -53,6 +54,7 @@ def __init__(self, **kwargs):
     from lm_eval.api.filter import Filter
     from lm_eval.api.model import LM
 
+
 __all__ = [
     # Core registry class
     "Registry",
@@ -201,7 +203,7 @@ def register(
             >>> class MyModel(LM):
             ...     pass
             >>>
-            >>> # Direct registration with lazy placeholder
+            >>> # Direct registration with a lazy placeholder
             >>> model_registry.register("lazy-name", target="mymodule:MyModel")
 
         Raises:
@@ -401,7 +403,7 @@ def _clear(self):  # pragma: no cover
         """Erase registry (for isolated tests).
 
         Clears both the registry contents and the materialization cache.
-        Only use this in test code to ensure clean state between tests.
+        Only use this in test code to ensure a clean state between tests.
         """
         if isinstance(self._objs, MappingProxyType):
             self._objs = dict(self._objs)  # type: ignore[assignment]
@@ -520,7 +522,7 @@ def get_model(model_name: str):
 # =============================================================================
 
 
-def register_filter(name):
+def register_filter(name: str):
     """Decorator to register a filter class.
 
     Args:
@@ -550,12 +552,12 @@ def get_filter(filter_name: str | Callable) -> Callable:
         The filter class/function
 
     Raises:
-        KeyError: If filter name is not found and is not callable
+        KeyError: If a filter name is not found and is not callable
     """
     if callable(filter_name):
         return filter_name
     try:
-        return filter_registry.get(filter_name)
+        return filter_registry.get(cast("str", filter_name))
     except KeyError as e:
         eval_logger.warning(f"filter `{filter_name}` is not registered!")
         raise e
@@ -574,7 +576,7 @@ def register_metric(**args):
     """Decorator to register a metric function.
 
     Args:
-        **args: Keyword arguments including:
+        **args: Keyword arguments including
             - metric: Name to register the metric under (required)
             - higher_is_better: Whether higher scores are better
             - aggregation: Name of aggregation function to use
@@ -609,7 +611,7 @@ def get_metric(name: str, hf_evaluate_metric: bool = False) -> Callable | None:
 
     Args:
         name: The metric name
-        hf_evaluate_metric: If True, skip local registry and use HF evaluate
+        hf_evaluate_metric: If True, skip the local registry and use HF evaluate
 
     Returns:
         The metric compute function, or None if not found