WIP: Evaluate 1B-Instruct on GSM8k by rasdani · Pull Request #82 · xjdr-alt/entropix

rasdani · 2024-10-21T23:38:50Z

Based off https://github.com/xjdr-alt/entropix/blob/70B/entropix/eval_main.py

This is still WIP as for a correct comparison apply_chat_template() must be implemented for the CustomLLaMAModel to use the gsm8k_cot_llama task.

See also official docs of gsm8k in lm-evaluation-harness.

I will run the evaluation overnight without applying the chat template and without using multi-turn for few-shot.

rasdani · 2024-10-24T07:38:08Z

ok, I fixed a bug and ran both vanilla LLama3.1-1B-Instruct and the sampler with settings of this branch on 200 samples of GSM8k.

Here are the results. @xjdr-alt

Without sampler, took ~5min

With entropix sampler, took ~10min

These are preliminary results of course.
I will run the full benchmark on the main branch soon.

rasdani · 2024-10-24T21:27:42Z

I ran full GSM8k.

Without sampler, took ~30mins:

With entropix sampler, took more than 4 hours.

rasdani · 2024-10-28T09:13:19Z

After applying the chat template, the results differ significantly. I benchmarked the 3B-Instruct model and I could reproduce Meta's original result from their blog post without the sampler.
https://ai.meta.com/blog/llama-3-2-connect-2024-vision-edge-mobile-devices/

Without sampler, took ~30 min, batch size 1:

Entropix sampler, took ~10 hours:

Both benched on one 4090.

Is there anything wrong with this branch? @xjdr-alt

rasdani · 2024-10-28T10:26:57Z

I ran the vanilla model with:

lm-eval --model hf --model_args pretrained=meta-llama/Llama-3.2-3B-Instruct --tasks gsm8k_cot_llama --batch_size 1 --log_samples --output_path logs/ --apply_chat_template --fewshot_as_multiturn

And the entropix sampler with:

python entropix/eval_main.py

Both with lm-eval-harness version

lm_eval==0.4.5

rasdani · 2024-10-29T09:15:33Z

Here are the up to date results for 1B-Instruct.

Without sampler, took ~15min

With sampler, took ~5 hours

xjdr-alt and others added 7 commits October 8, 2024 22:25

[WIP] - for frog

8084588

with nemotron 70B

cdb6f6f

WIP: GSM8K with Llama 3.2 1B-Instruct

b3e3330

fix tokenizer name error by llm_eval

3373313

run CoT zeroshot on both GPUs

f4745b0

run correct task

378d02a

fix: collect tokens correctly

fe17866

WIP: apply chat template, bench 3B model

3e5e4c9

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP: Evaluate 1B-Instruct on GSM8k#82

WIP: Evaluate 1B-Instruct on GSM8k#82
rasdani wants to merge 8 commits intoxjdr-alt:mainfrom
rasdani:gsm8k-1B

rasdani commented Oct 21, 2024

Uh oh!

rasdani commented Oct 24, 2024 •

edited

Loading

Uh oh!

rasdani commented Oct 24, 2024

Uh oh!

rasdani commented Oct 28, 2024 •

edited

Loading

Uh oh!

rasdani commented Oct 28, 2024

Uh oh!

rasdani commented Oct 29, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

rasdani commented Oct 21, 2024

Uh oh!

rasdani commented Oct 24, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rasdani commented Oct 24, 2024

Uh oh!

rasdani commented Oct 28, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rasdani commented Oct 28, 2024

Uh oh!

rasdani commented Oct 29, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

rasdani commented Oct 24, 2024 •

edited

Loading

rasdani commented Oct 28, 2024 •

edited

Loading