Parsed Model Answer logging+ MCQ Answer analysis by mnishant2 · Pull Request #110 · MedARC-AI/med-lm-envs

mnishant2 · 2026-01-26T18:43:44Z

This PR performs two things

Added the capability to log the model's parsed answer as well as the parsing method used as part of the info_dict, you can add this to any environments not yet covered by adding one line(info=info in accuracy function call, check README)
Added an answer analysis script to perform a comprehensive analysis of the model's answers, including variability, semantic consistency across rollouts, positional bias, a confusion metric, and an overall performance measure, using model logs (with and without parsed answer logged). Currently has a hardcoded list of benchmarks. Feel free to adjust; all the analysis output files are created in the output directory specified, which you can check out and use to create various plots/tables
Also has a low-key visualization script which creates a few heatmaps, scatter plots related to models variation rate, semantic consistency, win rate etc, feel free to expand on it
Both these scripts are in the scripts folder

log parsed answer + answer analysis

f515e81

mnishant2 changed the title ~~log parsed answer + answer analysis~~ Parsed Model Answer logging+ MCQ Answer analysis Jan 26, 2026

Provide feedback