Skip to content

Conversation

@farook-edev
Copy link
Contributor

@farook-edev farook-edev commented Jan 6, 2026

This is an overhaul of the IFEval implementation based on comparisons to google's implementation which uses python. The following are the changes introduced in this PR:

  • repeat_prompt instruction now checks to make sure the prompt occurs at the very start of the response.
  • two_responses instruction is now more flexible, allowing the response to contain the delimiter as an indicator rather than strictly a delimiter.
  • NumberPlaceholders allows for multi-word placeholders.
  • ConstrainedResponse allows for the response to not strictly be one of the three options (allowing commas, dots, newlines, etc...).
  • JsonFormat allows for markdown ticks.
  • MultipleSections now requires a section number after the word "SECTION", but allows roman numerals.
  • MultipleSections accounts for the 2 response instructions (where the model will put 2n sections total across the responses).
  • NumberBulletLists accepts markdown style (-) bullets along with star (*) bullets.
  • KeywordFrequency uses stemming (Snowball based) and a plural word map to count plural words along with variations of the keyword (e.g. for a keyword war, **war**s will be allowed but for**war**d will not).
  • NthParagraphFirstWord counts the word itself without any punctuation marks that may be attacked.
  • NumberParagraphs doesn't miscount separators at the start or end of the response.
  • NumberSentences properly excludes characters such as dots from being counted as sentence enders if they're numerators or part of an abbreviation etc...
  • EndChecker is slightly more flexible to account for things like punctuation marks.

With these changes, the reported accuracy from this implementation is considerably more reliable than the reference implementation that Google provides. Merging this PR should finally close #1060 and #1058.

A python version of this implementation is also complete and has been submitted here.

@farook-edev farook-edev requested review from a team and anhappdev as code owners January 6, 2026 00:16
@github-actions
Copy link

github-actions bot commented Jan 6, 2026

MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅

@farook-edev farook-edev changed the title Farook/ifeval evaluation fix IFEval evaluation code overhaul Jan 6, 2026
@farook-edev farook-edev marked this pull request as draft January 6, 2026 02:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

LLM IFEval Dataset Implementation LLM Dataset Implementation

2 participants