Hi everyone,
I’m new to NLP and currently reviewing papers like SimPO that use AlpacaEval2 for evaluation. I have two questions:
-
Is GPT-4-1106-preview the default judge model in AlpacaEval2?
Many recent papers (e.g., SimPO) seem to rely on GPT-4 for evaluation. Is it specifically the gpt-4-1106-preview version, or another variant?
-
If GPT-4-1106-preview is unavailable, what are the alternatives?
For fairness and reproducibility, what models do researchers typically use instead?
Would appreciate any insights or references to papers addressing this! Thanks!