Skip to content

Commit 28001d2

Browse files
authored
remove all; reformat table (#3107)
1 parent 71d0289 commit 28001d2

File tree

6 files changed

+65
-89
lines changed

6 files changed

+65
-89
lines changed

.github/workflows/new_tasks.yml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -50,12 +50,12 @@ jobs:
5050
with:
5151
python-version: 3.9
5252
cache: 'pip'
53-
cache-dependency-path: setup.py
53+
cache-dependency-path: pyproject.toml
5454
- name: Install dependencies
5555
if: steps.changed-tasks.outputs.tasks_any_modified == 'true' || steps.changed-tasks.outputs.api_any_modified == 'true'
5656
run: |
5757
python -m pip install --upgrade pip
58-
pip install -e '.[dev,ifeval]' --extra-index-url https://download.pytorch.org/whl/cpu
58+
pip install -e '.[dev,ifeval,unitxt]' --extra-index-url https://download.pytorch.org/whl/cpu
5959
# Install optional git dependencies
6060
# pip install bleurt@https://github.com/google-research/bleurt/archive/b610120347ef22b494b6d69b4316e303f5932516.zip#egg=bleurt
6161
# if [ -f requirements.txt ]; then pip install -r requirements.txt; fi

.github/workflows/unit_tests.yml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -53,7 +53,7 @@ jobs:
5353

5454
# Cache HuggingFace cache directory for CPU tests
5555
- name: Cache HuggingFace cache (CPU tests)
56-
uses: actions/cache@v3
56+
uses: actions/cache@v4
5757
id: cache-hf-cpu
5858
with:
5959
path: ~/.cache/huggingface
@@ -64,7 +64,7 @@ jobs:
6464
- name: Install dependencies
6565
run: |
6666
python -m pip install --upgrade pip
67-
pip install -e '.[dev]' --extra-index-url https://download.pytorch.org/whl/cpu
67+
pip install -e '.[dev,unitxt]' --extra-index-url https://download.pytorch.org/whl/cpu
6868
pip install hf_xet
6969
7070
- name: Test with pytest

.pre-commit-config.yaml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -29,7 +29,7 @@ repos:
2929
- id: mixed-line-ending
3030
args: [--fix=lf]
3131
- repo: https://github.com/astral-sh/ruff-pre-commit
32-
rev: v0.11.10
32+
rev: v0.12.2
3333
hooks:
3434
# Run the linter.
3535
- id: ruff
@@ -47,7 +47,7 @@ repos:
4747
)$
4848
args: [--check-filenames, --check-hidden, --ignore-words=ignore.txt]
4949
- repo: https://github.com/jackdewinter/pymarkdown
50-
rev: v0.9.29
50+
rev: v0.9.30
5151
hooks:
5252
- id: pymarkdown
5353
exclude: ^(lm_eval/tasks/.*|docs/footguns\.md)$

README.md

Lines changed: 18 additions & 31 deletions
Original file line numberDiff line numberDiff line change
@@ -599,37 +599,24 @@ The best way to get support is to open an issue on this repo or join the [Eleuth
599599

600600
Extras dependencies can be installed via `pip install -e ".[NAME]"`
601601

602-
| Name | Use |
603-
| -------------------- | ----------------------------------------------------- |
604-
| api | For using api models (Anthropic, OpenAI API) |
605-
| audiolm_qwen | For running Qwen2 audio models |
606-
| deepsparse | For running NM's DeepSparse models |
607-
| dev | For linting PRs and contributions |
608-
| gptq | For loading models with AutoGPTQ |
609-
| gptqmodel | For loading models with GPTQModel |
610-
| hf_transfer | For speeding up HF Hub file downloads |
611-
| ibm_watsonx_ai | For using IBM watsonx.ai model apis |
612-
| ifeval | For running the IFEval task |
613-
| ipex | For running on optimum-intel ipex backend |
614-
| japanese_leaderboard | For running Japanese LLM Leaderboard tasks |
615-
| longbench | For running LongBench tasks |
616-
| mamba | For loading Mamba SSM models |
617-
| math | For running math task answer checking |
618-
| multilingual | For multilingual tokenizers |
619-
| neuronx | For running on AWS inf2 instances |
620-
| optimum | For running Intel OpenVINO models |
621-
| promptsource | For using PromptSource prompts |
622-
| ruler | For running RULER tasks |
623-
| sae_lens | For using SAELens to steer models |
624-
| sentencepiece | For using the sentencepiece tokenizer |
625-
| sparseml | For using NM's SparseML models |
626-
| sparsify | For using Sparsify to steer models |
627-
| testing | For running library test suite |
628-
| vllm | For loading models with vLLM |
629-
| wandb | For integration with `Weights and Biases` platform |
630-
| zeno | For visualizing results with Zeno |
631-
| -------------------- | ----------------------------------------------------- |
632-
| all | Loads all extras (not recommended) |
602+
| NAME | Description | NAME | Description |
603+
|----------------------|--------------------------------|----------------|---------------------------------------|
604+
| tasks | All task-specific dependencies | api | API models (Anthropic, OpenAI, local) |
605+
| acpbench | ACP Bench tasks | audiolm_qwen | Qwen2 audio models |
606+
| ifeval | IFEval task | deepsparse | DeepSparse models (CPU) |
607+
| japanese_leaderboard | Japanese LLM tasks | gptq | AutoGPTQ models |
608+
| longbench | LongBench tasks | gptqmodel | GPTQModel models |
609+
| math | Math answer checking | hf_transfer | Speed up HF downloads |
610+
| multilingual | Multilingual tokenizers | ibm_watsonx_ai | IBM watsonx.ai models |
611+
| ruler | RULER tasks | ipex | Intel IPEX backend |
612+
| | | | |
613+
| dev | Linting & contributions | mamba | Mamba SSM models |
614+
| promptsource | PromptSource prompts | neuronx | AWS inf2 instances |
615+
| sentencepiece | Sentencepiece tokenizer | optimum | Intel OpenVINO models |
616+
| testing | Run test suite | sae_lens | SAELens model steering |
617+
| unitxt | Run unitxt tasks | sparseml | SparseML models (CPU) |
618+
| wandb | Weights & Biases | sparsify | Sparsify model steering |
619+
| zeno | Result visualization | vllm | vLLM models |
633620

634621
## Cite as
635622

lm_eval/api/model.py

Lines changed: 38 additions & 29 deletions
Original file line numberDiff line numberDiff line change
@@ -3,15 +3,19 @@
33
import json
44
import logging
55
import os
6-
from typing import Dict, List, Optional, Tuple, Type, TypeVar, Union
6+
from typing import TYPE_CHECKING, Any, Iterable, Optional, Type, TypeVar, Union
77

8-
import transformers
9-
from sqlitedict import SqliteDict
108
from tqdm import tqdm
119

1210
from lm_eval import utils
1311

1412

13+
if TYPE_CHECKING:
14+
from sqlitedict import SqliteDict
15+
16+
from lm_eval.api.instance import Instance
17+
18+
1519
eval_logger = logging.getLogger(__name__)
1620

1721
T = TypeVar("T", bound="LM")
@@ -27,10 +31,10 @@ def __init__(self) -> None:
2731
# set rank and world size to a single process, by default.
2832
self._rank = 0
2933
self._world_size = 1
30-
self.cache_hook = CacheHook(None)
34+
self.cache_hook: "CacheHook" = CacheHook(None)
3135

3236
@abc.abstractmethod
33-
def loglikelihood(self, requests) -> List[Tuple[float, bool]]:
37+
def loglikelihood(self, requests) -> list[tuple[float, bool]]:
3438
"""Compute log-likelihood of generating a continuation from a context.
3539
Downstream tasks should attempt to use loglikelihood instead of other
3640
LM calls whenever possible.
@@ -55,7 +59,7 @@ def loglikelihood(self, requests) -> List[Tuple[float, bool]]:
5559
pass
5660

5761
@abc.abstractmethod
58-
def loglikelihood_rolling(self, requests) -> List[float]:
62+
def loglikelihood_rolling(self, requests) -> list[float]:
5963
"""Compute full log-likelihood of a string, with no truncation, for perplexity computation
6064
- We will use the full max context length of the model.
6165
- For inputs that exceed the max context length, we divide the tokenized string into chunks of up to
@@ -97,7 +101,7 @@ def loglikelihood_rolling(self, requests) -> List[float]:
97101

98102
# TODO: Add an optional max length
99103
@abc.abstractmethod
100-
def generate_until(self, requests) -> List[str]:
104+
def generate_until(self, requests) -> list[str]:
101105
"""Generate greedily until a stopping sequence
102106
103107
:param requests: list[Instance]
@@ -114,7 +118,7 @@ def generate_until(self, requests) -> List[str]:
114118
pass
115119

116120
def apply_chat_template(
117-
self, chat_history: List[Dict[str, str]], add_generation_prompt=True
121+
self, chat_history: list[dict[str, str]], add_generation_prompt=True
118122
) -> str:
119123
"""
120124
Defines how to transform few-shot examples provided as chat history into a format that can be used as input to the LM.
@@ -165,8 +169,7 @@ def create_from_arg_obj(
165169
- Instance of the LM class.
166170
"""
167171

168-
additional_config = {} if additional_config is None else additional_config
169-
additional_config = {
172+
additional_config = additional_config or {} | {
170173
k: v for k, v in additional_config.items() if v is not None
171174
}
172175

@@ -204,56 +207,58 @@ def chat_template(self, chat_template: Union[bool, str] = False) -> Optional[str
204207

205208
return ""
206209

207-
def set_cache_hook(self, cache_hook) -> None:
210+
def set_cache_hook(self, cache_hook: "CacheHook") -> None:
208211
self.cache_hook = cache_hook
209212

210213

211214
### SQLite-based caching of LM responses
212-
def hash_args(attr, args):
215+
def hash_args(attr: str, args: Iterable[Any]) -> str:
213216
dat = json.dumps([attr] + list(args))
214217
return hashlib.sha256(dat.encode("utf-8")).hexdigest()
215218

216219

217220
class CacheHook:
218-
def __init__(self, cachinglm) -> None:
221+
def __init__(self, cachinglm: Optional["CachingLM"]) -> None:
219222
if cachinglm is None:
220-
self.dbdict = None
223+
self.dbdict: Optional["SqliteDict"] = None
221224
return
222225

223226
self.dbdict = cachinglm.dbdict
224227

225-
def add_partial(self, attr, req, res) -> None:
228+
def add_partial(self, attr: str, req: Iterable[Any], res: Any) -> None:
226229
if self.dbdict is None:
227230
return
228231
hsh = hash_args(attr, req)
229232
self.dbdict[hsh] = res
230233

231234

232235
class CachingLM:
233-
def __init__(self, lm, cache_db) -> None:
236+
def __init__(self, lm: LM, cache_db: str) -> None:
234237
"""LM wrapper that returns cached results if they exist, and uses the underlying LM if not.
235238
236239
:param lm: LM
237240
Underlying LM
238241
:param cache_db: str
239242
Path to cache db
240243
"""
241-
self.lm = lm
242-
self.cache_db = cache_db
244+
from sqlitedict import SqliteDict
245+
246+
self.lm: LM = lm
247+
self.cache_db: str = cache_db
243248
if os.path.dirname(cache_db):
244249
os.makedirs(os.path.dirname(cache_db), exist_ok=True)
245250
self.dbdict = SqliteDict(cache_db, autocommit=True)
246251

247252
# add hook to lm
248253
lm.set_cache_hook(self.get_cache_hook())
249254

250-
def __getattr__(self, attr: str):
255+
def __getattr__(self, attr: str) -> Any:
251256
lm_attr = getattr(self.lm, attr)
252257
if attr not in ["loglikelihood", "loglikelihood_rolling", "generate_until"]:
253258
eval_logger.debug(f"Passing through attribute '{attr}' to underlying LM")
254259
return lm_attr
255260

256-
def fn(requests):
261+
def _fn(requests: list["Instance"]) -> list["Instance"]:
257262
res = []
258263
remaining_reqs = []
259264
warned = False
@@ -306,9 +311,9 @@ def fn(requests):
306311

307312
return res
308313

309-
return fn
314+
return _fn
310315

311-
def get_cache_hook(self):
316+
def get_cache_hook(self) -> "CacheHook":
312317
return CacheHook(self)
313318

314319

@@ -331,19 +336,23 @@ def prefix_token_id(self):
331336
return self.eot_token_id
332337

333338
@abc.abstractmethod
334-
def tok_encode(self, string: str, **kwargs) -> List[int]:
339+
def tok_encode(self, string: str, **kwargs) -> list[int]:
335340
"""
336341
Tokenize a string using the model's tokenizer and return a list of token IDs.
337342
"""
338343
pass
339344

340345
@abc.abstractmethod
341-
def _loglikelihood_tokens(self, requests, **kwargs) -> List[Tuple[float, bool]]:
346+
def _loglikelihood_tokens(
347+
self, requests: list["Instance"], **kwargs
348+
) -> list[tuple[float, bool]]:
342349
pass
343350

344351
def _encode_pair(
345352
self, context: str, continuation: str
346-
) -> Tuple[List[int], List[int]]:
353+
) -> tuple[list[int], list[int]]:
354+
import transformers
355+
347356
n_spaces = len(context) - len(context.rstrip())
348357
if n_spaces > 0:
349358
continuation = context[-n_spaces:] + continuation
@@ -364,8 +373,8 @@ def _encode_pair(
364373
return context_enc, continuation_enc
365374

366375
def loglikelihood(
367-
self, requests, disable_tqdm: bool = False
368-
) -> List[Tuple[float, bool]]:
376+
self, requests: list["Instance"], disable_tqdm: bool = False
377+
) -> list[tuple[float, bool]]:
369378
new_reqs = []
370379
for context, continuation in [req.args for req in requests]:
371380
if context == "":
@@ -384,11 +393,11 @@ def loglikelihood(
384393
@abc.abstractmethod
385394
def loglikelihood_rolling(
386395
self, requests, disable_tqdm: bool = False
387-
) -> List[float]:
396+
) -> list[float]:
388397
pass
389398

390399
@abc.abstractmethod
391-
def generate_until(self, requests, disable_tqdm: bool = False) -> List[str]:
400+
def generate_until(self, requests, disable_tqdm: bool = False) -> list[str]:
392401
pass
393402

394403
def chat_template(self, chat_template: Union[bool, str] = False) -> Optional[str]:

pyproject.toml

Lines changed: 3 additions & 23 deletions
Original file line numberDiff line numberDiff line change
@@ -61,7 +61,7 @@ acpbench = ["lark>=1.1.9", "tarski[clingo]==0.8.2", "pddl==0.4.2", "kstar-planne
6161
api = ["requests", "aiohttp", "tenacity", "tqdm", "tiktoken"]
6262
audiolm_qwen = ["librosa", "soundfile"]
6363
deepsparse = ["deepsparse-nightly[llm]>=1.8.0.20240404"]
64-
dev = ["pytest", "pytest-cov", "pytest-xdist", "pre-commit", "mypy", "unitxt==1.22.0", "requests", "aiohttp", "tenacity", "tqdm", "tiktoken", "sentencepiece"]
64+
dev = ["pytest", "pytest-cov", "pytest-xdist", "pre-commit", "requests", "aiohttp", "tenacity", "tqdm", "tiktoken", "sentencepiece"]
6565
gptq = ["auto-gptq[triton]>=0.6.0"]
6666
gptqmodel = ["gptqmodel>=1.0.9"]
6767
hf_transfer = ["hf_transfer"]
@@ -82,38 +82,18 @@ sentencepiece = ["sentencepiece>=0.1.98"]
8282
sparseml = ["sparseml-nightly[llm]>=1.8.0.20240404"]
8383
sparsify = ["sparsify"]
8484
testing = ["pytest", "pytest-cov", "pytest-xdist"]
85+
unitxt = ["unitxt==1.22.0"]
8586
vllm = ["vllm>=0.4.2"]
8687
wandb = ["wandb>=0.16.3", "pandas", "numpy"]
8788
zeno = ["pandas", "zeno-client"]
88-
all = [
89+
tasks = [
8990
"lm_eval[acpbench]",
90-
"lm_eval[api]",
91-
"lm_eval[audiolm_qwen]",
92-
"lm_eval[deepsparse]",
93-
"lm_eval[dev]",
94-
"lm_eval[gptq]",
95-
"lm_eval[gptqmodel]",
96-
"lm_eval[hf_transfer]",
97-
"lm_eval[ibm_watsonx_ai]",
9891
"lm_eval[ifeval]",
99-
"lm_eval[ipex]",
10092
"lm_eval[japanese_leaderboard]",
10193
"lm_eval[longbench]",
102-
"lm_eval[mamba]",
10394
"lm_eval[math]",
10495
"lm_eval[multilingual]",
105-
"lm_eval[neuronx]",
106-
"lm_eval[optimum]",
107-
"lm_eval[promptsource]",
10896
"lm_eval[ruler]",
109-
"lm_eval[sae_lens]",
110-
"lm_eval[sentencepiece]",
111-
"lm_eval[sparseml]",
112-
"lm_eval[sparsify]",
113-
"lm_eval[testing]",
114-
"lm_eval[vllm]",
115-
"lm_eval[wandb]",
116-
"lm_eval[zeno]",
11797
]
11898

11999
[tool.pymarkdown]

0 commit comments

Comments
 (0)