Skip to content

Results differ when running multiple tasks #569

@pumetu

Description

@pumetu

Hello, thank you for your work! I'm encountering a problem when evaluating tinyllava on multiple tasks, where the results differ compared to running each task on its own. The commands are exactly the same between the two runs, only differing in the amount of tasks ran. I have also attached the results from each as an example.

Commands:
Two tasks command:

python3 -m accelerate.commands.launch \
    --num_processes=4 \
    -m lmms_eval \
    --model tinyllava \
    --model_args pretrained="$PATH,conv_mode=$conv_mode" \
    --tasks mme,mmstar \
    --batch_size 1 \
    --log_samples \
    --log_samples_suffix $suffix \
    --output_path outputs/ \
    --verbosity=DEBUG

Multiple tasks command:

python3 -m accelerate.commands.launch \
    --num_processes=4 \
    -m lmms_eval \
    --model tinyllava \
    --model_args pretrained="$PATH,conv_mode=$conv_mode" \
    --tasks mme,mmstar,mmvet,pope,scienceqa,textvqa,gqa,vizwiz_vqa,seedbench_2_plus \
    --batch_size 1 \
    --log_samples \
    --log_samples_suffix $suffix \
    --output_path outputs/ \
    --verbosity=DEBUG

Results:
Two tasks:

Image

Multiple tasks:
Image
Here we see that the mme and mmstar results are completely different compared to the earlier image when we run just two tasks (mme and mmstar) with mmstar getting 0% in the multiple tasks command compared to 23% when we run just mme and mmstar (the same story happens for mme).

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions