Skip to content

[BUG] MixEval crashes with KeyError: TaskID(task_name='extended|mixeval_hard:multichoice|0', task_hash='5029516ccc122911', sampling_method=<SamplingMethod.GENERATIVE: 'GENERATIVE'>) #1005

@lewtun

Description

@lewtun

Describe the bug

While trying to evaluate mixeval, I get this error after generation is completed (maybe the scorer?)

EngineCore_0 pid=3460416) WARNING 10-03 20:40:28 [cudagraph_dispatcher.py:101] cudagraph dispatching keys are not initialized. No cudagraph will be used.                                                                                   | 0/500 [00:00<?, ?it/s]
Adding requests: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 500/500 [00:00<00:00, 10319.61it/s]
Processed prompts: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 500/500 [02:14<00:00,  3.73it/s, est. speed input: 564.50 toks/s, output: 403.32 toks/s]
Splits: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [02:14<00:00, 134.30s/it]
╭─────────────────────────────────────────────────────────────────────────────────────────────────────────────── Traceback (most recent call last) ────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ /fsx/lewis/git/hf/lighteval/src/lighteval/main_vllm.py:129 in vllm                                                                                                                                                                                               │
│                                                                                                                                                                                                                                                                  │
│   126 │   │   metric_options=metric_options,                                                                                                                                                                                                                     │
│   127 │   )                                                                                                                                                                                                                                                      │
│   128 │                                                                                                                                                                                                                                                          │
│ ❱ 129 │   pipeline.evaluate()                                                                                                                                                                                                                                    │
│   130 │                                                                                                                                                                                                                                                          │
│   131 │   pipeline.show_results()                                                                                                                                                                                                                                │
│   132                                                                                                                                                                                                                                                            │
│                                                                                                                                                                                                                                                                  │
│ /fsx/lewis/git/hf/lighteval/src/lighteval/pipeline.py:282 in evaluate                                                                                                                                                                                            │
│                                                                                                                                                                                                                                                                  │
│   279 │   │   │   │   )                                                                                                                                                                                                                                          │
│   280 │   │   │   │   outputs = self._run_model()                                                                                                                                                                                                                │
│   281 │   │   else:                                                                                                                                                                                                                                              │
│ ❱ 282 │   │   │   outputs = self._run_model()                                                                                                                                                                                                                    │
│   283 │   │                                                                                                                                                                                                                                                      │
│   284 │   │   if self.is_main_process():                                                                                                                                                                                                                         │
│   285 │   │   │   self._post_process_outputs(outputs)                                                                                                                                                                                                            │
│                                                                                                                                                                                                                                                                  │
│ /fsx/lewis/git/hf/lighteval/src/lighteval/pipeline.py:335 in _run_model                                                                                                                                                                                          │
│                                                                                                                                                                                                                                                                  │
│   332 │   │   if self.model.is_async:                                                                                                                                                                                                                            │
│   333 │   │   │   outputs = asyncio.run(self._run_model_async())                                                                                                                                                                                                 │
│   334 │   │   else:                                                                                                                                                                                                                                              │
│ ❱ 335 │   │   │   outputs = self._run_model_sync()                                                                                                                                                                                                               │
│   336 │   │                                                                                                                                                                                                                                                      │
│   337 │   │   # Cleaning up the model before running metrics                                                                                                                                                                                                     │
│   338 │   │   self.model.cleanup()                                                                                                                                                                                                                               │
│                                                                                                                                                                                                                                                                  │
│ /fsx/lewis/git/hf/lighteval/src/lighteval/pipeline.py:316 in _run_model_sync                                                                                                                                                                                     │
│                                                                                                                                                                                                                                                                  │
│   313 │   │   │   logger.info(f"Running {sampling_method} requests")                                                                                                                                                                                             │
│   314 │   │   │   match sampling_method:                                                                                                                                                                                                                         │
│   315 │   │   │   │   case SamplingMethod.GENERATIVE:                                                                                                                                                                                                            │
│ ❱ 316 │   │   │   │   │   model_outputs = self.model.greedy_until(docs)                                                                                                                                                                                          │
│   317 │   │   │   │   │   outputs[sampling_method] = model_outputs                                                                                                                                                                                               │
│   318 │   │   │   │   case SamplingMethod.LOGPROBS:                                                                                                                                                                                                              │
│   319 │   │   │   │   │   model_outputs = self.model.loglikelihood(docs)                                                                                                                                                                                         │
│                                                                                                                                                                                                                                                                  │
│ /fsx/lewis/git/hf/lighteval/src/lighteval/utils/cache_management.py:405 in wrapper                                                                                                                                                                               │
│                                                                                                                                                                                                                                                                  │
│   402 │   │   │   │   new_results = func(self, docs_not_cached, *args, **kwargs)                                                                                                                                                                                 │
│   403 │   │   │   │                                                                                                                                                                                                                                              │
│   404 │   │   │   │   # Store new results in file cache                                                                                                                                                                                                          │
│ ❱ 405 │   │   │   │   cache.cache_samples(                                                                                                                                                                                                                       │
│   406 │   │   │   │   │   docs=docs_not_cached,                                                                                                                                                                                                                  │
│   407 │   │   │   │   │   results=new_results,                                                                                                                                                                                                                   │
│   408 │   │   │   │   │   task_ids=task_ids,                                                                                                                                                                                                                     │
│                                                                                                                                                                                                                                                                  │
│ /fsx/lewis/git/hf/lighteval/src/lighteval/utils/cache_management.py:308 in cache_samples                                                                                                                                                                         │
│                                                                                                                                                                                                                                                                  │
│   305 │   │   │   task_id = self.get_task_id(doc.task_name, sampling_method)                                                                                                                                                                                     │
│   306 │   │   │   sample = self._dump_sample(result)                                                                                                                                                                                                             │
│   307 │   │   │                                                                                                                                                                                                                                                  │
│ ❱ 308 │   │   │   processed_data[task_id].append({"sample_id": doc.id, "sample": sample})                                                                                                                                                                        │
│   309 │   │   processed_data = {task_id: task_data for task_id, task_data in processed_data.it                                                                                                                                                                   │
│   310 │   │                                                                                                                                                                                                                                                      │
│   311 │   │   # Concatenate it with existing data and save to file                                                                                                                                                                                               │
╰──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
KeyError: TaskID(task_name='extended|mixeval_hard:multichoice|0', task_hash='5029516ccc122911', sampling_method=<SamplingMethod.GENERATIVE: 'GENERATIVE'>)
[rank0]:[W1003 20:42:43.820905213 ProcessGroupNCCL.cpp:1479] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
[2025-10-03 20:42:43,783] [   ERROR]: Engine core proc EngineCore_0 died unexpectedly, shutting down client. (core_client.py:562)

To Reproduce

Run this

lighteval vllm "model_name=Qwen/Qwen3-4B-Instruct-2507" "extended|mixeval_hard:multichoice|0"

Expected behavior

MixEval works

Version info

absl-py==2.3.1
accelerate==1.10.1
aenum==3.1.15
aiohappyeyeballs==2.6.1
aiohttp==3.12.15
aiosignal==1.4.0
annotated-types==0.7.0
antlr4-python3-runtime==4.13.2
anyio==4.10.0
astor==0.8.1
attrs==25.3.0
blake3==1.0.6
blis==1.3.0
cachetools==6.2.0
catalogue==2.0.10
cbor2==5.7.0
certifi==2025.8.3
cffi==2.0.0
cfgv==3.4.0
chardet==5.2.0
charset-normalizer==3.4.3
click==8.3.0
cloudpathlib==0.22.0
cloudpickle==3.1.1
colorama==0.4.6
colorlog==6.9.0
compressed-tensors==0.10.2
confection==0.1.5
cupy-cuda12x==13.6.0
cymem==2.0.11
dataproperty==1.1.0
datasets==4.1.1
deepdiff==8.6.1
depyf==0.19.0
dill==0.4.0
diskcache==5.6.3
distlib==0.4.0
distro==1.9.0
dnspython==2.8.0
einops==0.8.1
email-validator==2.3.0
emoji==2.15.0
fastapi==0.117.1
fastapi-cli==0.0.13
fastapi-cloud-cli==0.2.0
fastrlock==0.8.3
filelock==3.19.1
frozendict==2.4.6
frozenlist==1.7.0
fsspec==2025.9.0
gguf==0.17.1
gitdb==4.0.12
gitpython==3.1.45
h11==0.16.0
hf-xet==1.1.10
httpcore==1.0.9
httptools==0.6.4
httpx==0.28.1
huggingface-hub==0.35.0
identify==2.6.14
idna==3.10
iniconfig==2.1.0
interegular==0.3.3
jieba==0.42.1
jinja2==3.1.6
jiter==0.11.0
joblib==1.5.2
jsonschema==4.25.1
jsonschema-specifications==2025.9.1
langcodes==3.5.0
langdetect==1.0.9
language-data==1.3.0
lark==1.2.2
latex2sympy2-extended==1.0.6
-e file:///fsx/lewis/git/hf/lighteval
llguidance==0.7.30
llvmlite==0.44.0
lm-format-enforcer==0.10.12
lxml==6.0.2
marisa-trie==1.3.1
markdown-it-py==4.0.0
markupsafe==3.0.2
mbstrdecoder==1.1.4
mdurl==0.1.2
mistral-common==1.8.5
more-itertools==10.8.0
mpmath==1.3.0
msgpack==1.1.1
msgspec==0.19.0
multidict==6.6.4
multiprocess==0.70.16
murmurhash==1.0.13
natto-py==1.0.1
networkx==3.5
ninja==1.13.0
nltk==3.9.1
nodeenv==1.9.1
numba==0.61.2
numpy==2.2.6
nvidia-cublas-cu12==12.6.4.1
nvidia-cuda-cupti-cu12==12.6.80
nvidia-cuda-nvrtc-cu12==12.6.77
nvidia-cuda-runtime-cu12==12.6.77
nvidia-cudnn-cu12==9.5.1.17
nvidia-cufft-cu12==11.3.0.4
nvidia-cufile-cu12==1.11.1.6
nvidia-curand-cu12==10.3.7.77
nvidia-cusolver-cu12==11.7.1.2
nvidia-cusparse-cu12==12.5.4.2
nvidia-cusparselt-cu12==0.6.3
nvidia-nccl-cu12==2.26.2
nvidia-nvjitlink-cu12==12.6.85
nvidia-nvtx-cu12==12.6.77
openai==1.108.1
openai-harmony==0.0.4
opencv-python-headless==4.12.0.88
orderly-set==5.5.0
outlines-core==0.2.10
packaging==25.0
pandas==2.3.2
partial-json-parser==0.2.1.1.post6
pathvalidate==3.3.1
pillow==11.3.0
pip==25.2
platformdirs==4.4.0
pluggy==1.6.0
portalocker==3.2.0
pre-commit==4.3.0
preshed==3.0.10
prometheus-client==0.23.1
prometheus-fastapi-instrumentator==7.1.0
propcache==0.3.2
protobuf==6.32.1
psutil==7.1.0
py-cpuinfo==9.0.0
pyarrow==21.0.0
pybase64==1.4.2
pycountry==24.6.1
pycparser==2.23
pydantic==2.11.9
pydantic-core==2.33.2
pydantic-extra-types==2.10.5
pygments==2.19.2
pytablewriter==1.2.1
pytest==8.4.2
pythainlp==5.1.2
python-crfsuite==0.9.11
python-dateutil==2.9.0.post0
python-dotenv==1.1.1
python-json-logger==3.3.0
python-multipart==0.0.20
pytz==2025.2
pyvi==0.1.1
pyyaml==6.0.2
pyzmq==27.1.0
ray==2.49.2
referencing==0.36.2
regex==2025.9.18
requests==2.32.5
rich==14.1.0
rich-toolkit==0.15.1
rignore==0.6.4
rouge-score==0.1.2
rpds-py==0.27.1
ruff==0.13.1
sacrebleu==2.5.1
safetensors==0.6.2
scikit-learn==1.7.2
scipy==1.16.2
sentencepiece==0.2.1
sentry-sdk==2.38.0
setproctitle==1.3.7
setuptools==80.9.0
shellingham==1.5.4
six==1.17.0
sklearn-crfsuite==0.5.0
smart-open==7.3.1
smmap==5.0.2
sniffio==1.3.1
soundfile==0.13.1
soxr==1.0.0
spacy==3.8.7
spacy-legacy==3.0.12
spacy-loggers==1.0.5
srsly==2.5.1
stanza==1.10.1
starlette==0.48.0
sudachidict-core==20250825
sudachipy==0.6.10
syllapy==0.7.2
sympy==1.14.0
tabledata==1.3.4
tabulate==0.9.0
tcolorpy==0.1.7
termcolor==2.3.0
thinc==8.3.6
threadpoolctl==3.6.0
tiktoken==0.11.0
tokenizers==0.22.1
torch==2.7.1
torchaudio==2.7.1
torchvision==0.22.1
tqdm==4.67.1
transformers==4.56.2
triton==3.3.1
typepy==1.3.4
typer==0.19.1
typing-extensions==4.15.0
typing-inspection==0.4.1
tzdata==2025.2
urllib3==2.5.0
uvicorn==0.36.0
uvloop==0.21.0
virtualenv==20.34.0
vllm==0.10.1.1
wasabi==1.1.3
watchfiles==1.1.0
weasel==0.4.1
websockets==15.0.1
wrapt==1.17.3
xformers==0.0.31
xgrammar==0.1.21
xxhash==3.5.0
yarl==1.20.1

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions