Skip to content

Conversation

@NathanHB
Copy link
Member

@NathanHB NathanHB commented Oct 31, 2025

to run:

lighteval endpoint inference-providers "model_name=openai/gpt-oss-20b,provider=hyperbolic,generation_parameters={max_new_tokens:8192}" "lighteval|mmlu_pro|0" --save-details

@NathanHB NathanHB requested a review from Copilot October 31, 2025 13:13
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR adds support for the MMLU Pro benchmark, a multiple-choice question answering task from the TIGER-Lab/MMLU-Pro dataset.

  • Introduces a new MMLU Pro task configuration
  • Implements a custom prompt function for MMLU Pro questions
  • Configures evaluation on the test split with validation for few-shots
Comments suppressed due to low confidence (8)

src/lighteval/tasks/tasks/mmlu_pro.py:74

  • The task configuration is missing the generation_size parameter, which is required for generative metrics like gpqa_instruct_metric. Based on similar tasks using this metric (e.g., gpqa.py lines 57, 73, 89), a value like generation_size=30 or generation_size=32768 should be specified depending on whether reasoning traces are expected.
    src/lighteval/tasks/tasks/mmlu_pro.py:74
  • The task configuration is missing the stop_sequence parameter. Based on the generative nature of the task and similar configurations (e.g., gpqa.py lines 59, 75, 91), stop_sequence=[] should be explicitly set to use the EOS token.
    src/lighteval/tasks/tasks/mmlu_pro.py:23
  • Import of 'LogLikelihoodAccMetric' is not used.
https://arxiv.org/abs/2406.01574
"""
from string import ascii_uppercase

src/lighteval/tasks/tasks/mmlu_pro.py:25

  • Import of 'LogProbCharNorm' is not used.
    Import of 'LogProbPMINorm' is not used.
    Import of 'LogProbTokenNorm' is not used.
from lighteval.metrics.metrics import Metrics

src/lighteval/tasks/tasks/mmlu_pro.py:27

  • Import of 'get_metrics_for_formulation' is not used.
from lighteval.tasks.requests import Doc

src/lighteval/tasks/tasks/mmlu_pro.py:29

  • Import of 'get_mcq_prompt_function' is not used.
    src/lighteval/tasks/tasks/mmlu_pro.py:34
  • Import of 'CFFormulation' is not used.
    Import of 'HybridFormulation' is not used.
    Import of 'MCFFormulation' is not used.
TEMPLATE = """
Answer the following multiple choice question. The last line of your response should be of the following format: 'Answer: $LETTER' (without quotes) where LETTER is one of ABCD. Think step by step before answering.

{question}

src/lighteval/tasks/tasks/mmlu_pro.py:35

  • Import of 'Language' is not used.
{choices}

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.


mmlu_pro = LightevalTaskConfig(
name="mmlu_pro",
prompt_function=mmlu_pro_prompt_function,
Copy link

Copilot AI Oct 31, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The variable choices on line 59 is a string (created by str.join() on line 49), so len(choices) returns the string length rather than the number of options. This should be len(line["options"]) to correctly slice the uppercase letters corresponding to the actual number of answer choices.

Copilot uses AI. Check for mistakes.

Answer:""".strip()


Copy link

Copilot AI Oct 31, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The template hardcodes 'ABCD' as the possible letters, but MMLU Pro typically has 10 answer choices (A-J). The instruction should be updated to reflect the actual range of possible letters, such as 'where LETTER is one of A through J' or made dynamic based on the number of options.

Copilot uses AI. Check for mistakes.
Comment on lines +21 to +35
https://arxiv.org/abs/2406.01574
"""
from string import ascii_uppercase

from lighteval.metrics.metrics import Metrics
from lighteval.tasks.lighteval_task import LightevalTaskConfig
from lighteval.tasks.requests import Doc


TEMPLATE = """
Answer the following multiple choice question. The last line of your response should be of the following format: 'Answer: $LETTER' (without quotes) where LETTER is one of ABCD. Think step by step before answering.

{question}

{choices}
Copy link

Copilot AI Oct 31, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Multiple imports are unused: LogLikelihoodAccMetric, LogProbCharNorm, LogProbPMINorm, LogProbTokenNorm, get_metrics_for_formulation, get_mcq_prompt_function, CFFormulation, HybridFormulation, MCFFormulation, and Language. These should be removed to keep the imports clean and maintainable.

Copilot uses AI. Check for mistakes.
@HuggingFaceDocBuilderDev
Copy link
Collaborator

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants