IFEval evaluation code overhaul #1093

farook-edev · 2026-01-06T00:16:31Z

This is an overhaul of the IFEval implementation based on comparisons to google's implementation which uses python. The following are the changes introduced in this PR:

repeat_prompt instruction now checks to make sure the prompt occurs at the very start of the response.
two_responses instruction is now more flexible, allowing the response to contain the delimiter as an indicator rather than strictly a delimiter.
NumberPlaceholders allows for multi-word placeholders.
ConstrainedResponse allows for the response to not strictly be one of the three options (allowing commas, dots, newlines, etc...).
JsonFormat allows for markdown ticks.
MultipleSections now requires a section number after the word "SECTION", but allows roman numerals.
MultipleSections accounts for the 2 response instructions (where the model will put 2n sections total across the responses).
NumberBulletLists accepts markdown style (-) bullets along with star (*) bullets.
KeywordFrequency uses stemming (Snowball based) and a plural word map to count plural words along with variations of the keyword (e.g. for a keyword war, **war**s will be allowed but for**war**d will not).
NthParagraphFirstWord counts the word itself without any punctuation marks that may be attacked.
NumberParagraphs doesn't miscount separators at the start or end of the response.
NumberSentences properly excludes characters such as dots from being counted as sentence enders if they're numerators or part of an abbreviation etc...
EndChecker is slightly more flexible to account for things like punctuation marks.

With these changes, the reported accuracy from this implementation is considerably more reliable than the reference implementation that Google provides. Merging this PR should finally close #1060 and #1058.

A python version of this implementation is also complete and has been submitted here.

github-actions · 2026-01-06T00:16:41Z

MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅

farook-edev added 3 commits December 22, 2025 04:49

fixed ifeval evaluation implementation bugs

d31bb7a

further improved sentence counter

adda75f

overhauled keyword evaluation by using stemming and a plural word map

7d14e30

farook-edev requested review from a team and anhappdev as code owners January 6, 2026 00:16

farook-edev changed the title ~~Farook/ifeval evaluation fix~~ IFEval evaluation code overhaul Jan 6, 2026

This was linked to issues Jan 6, 2026

LLM Dataset Implementation #1058

Open

LLM IFEval Dataset Implementation #1060

Open

farook-edev marked this pull request as draft January 6, 2026 02:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

IFEval evaluation code overhaul #1093

IFEval evaluation code overhaul #1093

Uh oh!

farook-edev commented Jan 6, 2026 •

edited

Loading

Uh oh!

github-actions bot commented Jan 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

IFEval evaluation code overhaul #1093

Are you sure you want to change the base?

IFEval evaluation code overhaul #1093

Uh oh!

Conversation

farook-edev commented Jan 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Jan 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

farook-edev commented Jan 6, 2026 •

edited

Loading