IFEval evaluation code overhaul #1093
Draft
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This is an overhaul of the IFEval implementation based on comparisons to google's implementation which uses python. The following are the changes introduced in this PR:
repeat_promptinstruction now checks to make sure the prompt occurs at the very start of the response.two_responsesinstruction is now more flexible, allowing the response to contain the delimiter as an indicator rather than strictly a delimiter.NumberPlaceholdersallows for multi-word placeholders.ConstrainedResponseallows for the response to not strictly be one of the three options (allowing commas, dots, newlines, etc...).JsonFormatallows for markdown ticks.MultipleSectionsnow requires a section number after the word "SECTION", but allows roman numerals.MultipleSectionsaccounts for the 2 response instructions (where the model will put 2n sections total across the responses).NumberBulletListsaccepts markdown style (-) bullets along with star (*) bullets.KeywordFrequencyuses stemming (Snowball based) and a plural word map to count plural words along with variations of the keyword (e.g. for a keywordwar,**war**swill be allowed butfor**war**dwill not).NthParagraphFirstWordcounts the word itself without any punctuation marks that may be attacked.NumberParagraphsdoesn't miscount separators at the start or end of the response.NumberSentencesproperly excludes characters such as dots from being counted as sentence enders if they're numerators or part of an abbreviation etc...EndCheckeris slightly more flexible to account for things like punctuation marks.With these changes, the reported accuracy from this implementation is considerably more reliable than the reference implementation that Google provides. Merging this PR should finally close #1060 and #1058.
A python version of this implementation is also complete and has been submitted here.