git clone https://github.com/physical-superintelligence-lab/PhysBenchTo download the data from the 🤗 Dataset, you can prepare the dataset by executing the following command.
It is recommended to set <your_path_for_dataset> as eval/physbench; however, you may change this to a different path if necessary, in which case you must adjust the --dataset_path parameter accordingly.
cd <your_path_for_dataset> # such as '/home/usr/dataset'
huggingface-cli download USC-GVL/PhysBench --local-dir . --local-dir-use-symlinks False --repo-type dataset
# Unzip the compressed files of videos and pictures
yes | unzip image.zip -d image
yes | unzip video.zip -d videoTo avoid conflicts caused by the specific library dependencies of the video-llava and chat-univi models, which may interfere with other models, we created a separate Conda environment. If you do not intend to test these two models, this step can be disregarded. All other operations are performed within the physbench environment.
# main environment (recommended)
conda create --name physbench python=3.10
conda activate physbench
pip install -r requirements.txt
# Create a special environment for video-llava and chat-univi (selective)
conda create --name physbench_video python=3.10
conda activate physbench_video
pip install -r requirements-video.txtSpecifically, we have implemented 74 models within the models directory, which can be installed in a single step using the method outlined below.
-
Closed-Source Models: To integrate closed-source models (such as
GPT,Gemini, andClaude), you will need to configure your 🔑 API key in the fileeval/models/qa_model/imageqa_model.py. -
Open-Source Models: All open-source models can be automatically downloaded via Hugging Face. You can refer to
eval/models/qa_model/imageqa_model.pyto identify the relevant keys. For example, the key "Mantis-llava-7b" will invoke the "Mantis" model class, which will automatically download the model from the Hugging Face repository "TIGER-Lab/Mantis-llava-7b"."Mantis-llava-7b" : ("Mantis", "TIGER-Lab/Mantis-llava-7b"),
To adapt your model, you can refer to our implementation of 39 models within the eval/models/qa_model/imageqa_model.py file. You may create your own class, which should primarily include the methods __init__ and qa, ensuring that both methods conform to the specified interfaces.
class YourModel(QAModelInstance):
def __init__(self, ckpt, torch_device=torch.device("cuda"), model_precision=torch.float16, num_video_frames=8):
### load your model
self.model = ...
self.tokenizer = ...
def qa(self, image, prompt, mode):
### give answer for each item
output_ids = self.model.generate(
input_ids,
images=[images_tensor],
do_sample=False,
temperature=0.1,
top_p=None,
... # other your config
)
outputs = self.tokenizer.batch_decode(output_ids, skip_special_tokens=True)[0]
answer = ...
return answer # return the answerOnce implemented, you can add your model to the imageqa_models section at the top of the file, following the format below:
| model name | class | checkpoint path |
|---|---|---|
| "instructblip-flan-t5-xl" | "InstructBlip" | "./eval/models/checkpoints/instructblip-flan-t5-xl" |
For video-based models, you may refer to eval/models/qa_model/videoqa_model.py.
You will need to take the following steps:
-
Add your
model_nameto thetask_splitsection in the file located ateval/eval_utils/task_evaluator.py.task_split中"instructblip-flan-t5-xl" : "image-only",`For each model value, there are 3 options:
image-onlymeans only one image is input,image&videomeans only one video is input (actually the test is image-only + image&video), andgeneralmeans interleaved data (actually the test is image-only + image&video + general) -
In the
PhysionBenchEvaluatorclass, modify thetestmethod to invoke thepromptinterface of your model.
After completing these changes, you can execute your model using the provided script!
CUDA_VISIBLE_DEVICES=9 PYTHONPATH='./' python eval/test_benchmark.py --model_name gpt4o --dataset_path ./eval/physbenchAfter running the script, a file named [model_name].json will be generated in the ./eval/physbench/results directory.
-
[Common Case] We also provide a 📃tested sample file, which can be referenced to understand the required JSON submission format.
-
[For model only support one image input] If the model you are testing is a model like LLaVA-1.5 that only supports one image input, you need to select
image&videoinstead of thegeneralabove. At this time, you are actually testing the results of removing interleaved items, which is the experiment in the main table in our paper. We also provide 📃tested sample file w/o interleaved. Please select Dev Phase in 🔗 EvalAI(You need to pay special attention to that for models like LLaVA-1.5 that only support one image input, you need to stitch multiple frames of the item with mode
image&video(only one video input) into one frame for input. We do not support the evaluation of image-only models due to the problem of the EvalAI platform
Given that EvalAI is too cumbersome, you can now directly use the following script to evaluate and print out the scores without uploading to the EvalAI platform! (The results below are consistent with those in EvalAI)
PYTHONPATH='./' python eval/print_test_score.py --test-file '[model_name].json'- This script differs slightly from directly using function calls within
val. To maintain consistency with EvalAI, it does not directly call...
Please upload this file to 🔗 EvalAI to automatically evaluate the results. Please select Test Phase in 🔗 EvalAI
In case floder, you can find how to use eval_utils get the results.
😀 We sincerely welcome any questions or inquiries and encourage open communication.