Parsed Model Answer logging+ MCQ Answer analysis#110
Open
mnishant2 wants to merge 1 commit intoMedARC-AI:mainfrom
Open
Parsed Model Answer logging+ MCQ Answer analysis#110mnishant2 wants to merge 1 commit intoMedARC-AI:mainfrom
mnishant2 wants to merge 1 commit intoMedARC-AI:mainfrom
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This PR performs two things
Added the capability to log the model's parsed answer as well as the parsing method used as part of the info_dict, you can add this to any environments not yet covered by adding one line(info=info in accuracy function call, check README)
Added an answer analysis script to perform a comprehensive analysis of the model's answers, including variability, semantic consistency across rollouts, positional bias, a confusion metric, and an overall performance measure, using model logs (with and without parsed answer logged). Currently has a hardcoded list of benchmarks. Feel free to adjust; all the analysis output files are created in the output directory specified, which you can check out and use to create various plots/tables
Also has a low-key visualization script which creates a few heatmaps, scatter plots related to models variation rate, semantic consistency, win rate etc, feel free to expand on it
Both these scripts are in the scripts folder