Results differ when running multiple tasks

Hello, thank you for your work! I'm encountering a problem when evaluating tinyllava on multiple tasks, where the results differ compared to running each task on its own. The commands are exactly the same between the two runs, only differing in the amount of tasks ran. I have also attached the results from each as an example.

Commands:
Two tasks command:
```python
python3 -m accelerate.commands.launch \
    --num_processes=4 \
    -m lmms_eval \
    --model tinyllava \
    --model_args pretrained="$PATH,conv_mode=$conv_mode" \
    --tasks mme,mmstar \
    --batch_size 1 \
    --log_samples \
    --log_samples_suffix $suffix \
    --output_path outputs/ \
    --verbosity=DEBUG
```

Multiple tasks command:
```python
python3 -m accelerate.commands.launch \
    --num_processes=4 \
    -m lmms_eval \
    --model tinyllava \
    --model_args pretrained="$PATH,conv_mode=$conv_mode" \
    --tasks mme,mmstar,mmvet,pope,scienceqa,textvqa,gqa,vizwiz_vqa,seedbench_2_plus \
    --batch_size 1 \
    --log_samples \
    --log_samples_suffix $suffix \
    --output_path outputs/ \
    --verbosity=DEBUG
```
Results:
Two tasks:

<img width="588" alt="Image" src="https://github.com/user-attachments/assets/a92faafc-d04f-4a2f-8d4f-4c667adc50eb" />

Multiple tasks:
<img width="715" alt="Image" src="https://github.com/user-attachments/assets/631d8302-9362-4068-a496-2a33f63a546a" />
Here we see that the mme and mmstar results are completely different compared to the earlier image when we run just two tasks (mme and mmstar) with mmstar getting 0% in the multiple tasks command compared to 23% when we run just mme and mmstar (the same story happens for mme).


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Results differ when running multiple tasks #569

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Results differ when running multiple tasks #569

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions