Conversation
|
ok, I fixed a bug and ran both vanilla LLama3.1-1B-Instruct and the sampler with settings of this branch on 200 samples of GSM8k. Here are the results. @xjdr-alt With entropix sampler, took ~10min These are preliminary results of course. |
|
After applying the chat template, the results differ significantly. I benchmarked the 3B-Instruct model and I could reproduce Meta's original result from their blog post without the sampler. Without sampler, took ~30 min, batch size 1: Entropix sampler, took ~10 hours: Both benched on one 4090. Is there anything wrong with this branch? @xjdr-alt |
|
I ran the vanilla model with: And the entropix sampler with: Both with lm-eval-harness version |








Based off https://github.com/xjdr-alt/entropix/blob/70B/entropix/eval_main.py
This is still WIP as for a correct comparison
apply_chat_template()must be implemented for theCustomLLaMAModelto use thegsm8k_cot_llamatask.See also official docs of
gsm8kinlm-evaluation-harness.I will run the evaluation overnight without applying the chat template and without using multi-turn for few-shot.