Below, we have adapted the repository from the TempCompass repository for our experiments with Video-LLaVA:
To begin with, clone this repository and install some packages:
git clone https://https://github.com/Emmakast/Modification-on-Video-LLaVA.git
cd Modification-on-Video-LLaVA
pip install -r requirements.txt
1. Task Instructions
The task instructions can be found in questions/
.
Task Instruction Generation Procedure
-
Generate Multi-Choice QA instructions (
question_gen.py
). -
Manually validate quality and rectify.
-
Generate task instructions for Yes/No QA (
question_gen_yes_no.py
), Caption Matching (question_gen_caption_match.py
) and Caption Generation (question_gen_captioning.py
), based on manually rectified Multi-Choice QA instructions. -
Manually validate quality and rectify.
2. Videos
All the processed videos can be downloaded from google drive or huggingface.
As an alternative, you can also download the raw videos and process them yourself
Run the following commands. The videos will be saved to videos/
.
cd utils
python download_video.py # Download raw videos
python process_videos.py # Construct conflicting videos
Note: If you encounter a MoviePy error
when running the processing script, please refer to this issue.
We use Video-LLaVA to illustrate how to conduct MLLM inference on the benchmark.
1. Video-LLaVA
Enter run_video_llava
and install the environment as instructed.
Then run the following commands. The prediction results will be saved to predictions/video-llava/<task_type>
.
# select <task_type> from multi-choice, yes_no, caption_matching, (captioning)
python inference_dataset.py --task_type <task_type>
If you want to do inference with our modifications, you can run the following command:
# select <task_type> from multi-choice, yes_no, caption_matching
# and <modification> from prompt, timestamps, framesampling, fs_gradient, blackframes
python inference_dataset_<modification>.py --task_type <task_type>
After obtaining the MLLM predictions, run the following commands to conduct automatic evaluation.
-
Multi-Choice QA
python eval_multi_choice.py --video_llm video-llava --disable_llm
-
Yes/No QA
python eval_yes_no.py --video_llm video-llava --disable_llm
-
Caption Matching
python eval_caption_matching.py --video_llm video-llava --disable_llm
-
Caption Generation (NOTE: this method needs an API key for an LLM, which we haven't used)
python eval_captioning.py --video_llm video-llava
The results of each data point will be saved to auto_eval_results/video-llava/<task_type>.json
and the overall results on each temporal aspect will be printed out as follows:
{'action': 76.0, 'direction': 35.2, 'speed': 35.6, 'order': 37.7, 'attribute_change': 41.0, 'avg': 45.6}
{'fine-grained action': 58.8, 'coarse-grained action': 90.3, 'object motion': 36.2, 'camera motion': 32.6, 'absolute speed': 47.6, 'relative speed': 28.0, 'order': 37.7, 'color & light change': 43.6, 'size & shape change': 39.4, 'combined change': 41.7, 'other change': 38.9}
Match Success Rate=100.0
For running frame sampling experiments yourself, we include an interactive jupyter notebook. This can be found at run_video_llava/framesampling_visualizations.ipynb
, and will produce plots like the one below:
For running prompt engineering experiments, run_video_llava/inference_dataset_prompt_3runs.py
can be used. The prompt can be adjusted as desired by running the following command:
# select <task_type> from multi-choice, yes_no, caption_matching, (captioning)
python inference_dataset_prompt_3runs.py --task_type <task_type> --answer_prompt <prompt for specific task>
Here, <prompt for specific task>
should be a string.
For our research, we use the following prompts:
- "Approach the video by thinking about the reasons behind the actions and their order in time, and choose the most relevant option."
- "Approach the video by thinking about the reasons behind the actions and their order in time, and please answer with yes or no."
- "Analyze the video frame-by-frame for this event, answer yes or no:"
- "Choose the option that best matches the visual content of the video."
- "Does the video show this event happening? Answer yes or no, focusing on timing:"
- "Consider the beginning, middle, and end of the video. Which caption best summarizes the overall temporal narrative?"Choose the option that best matches the visual content of the video."
During evaluation, run the following command to not overwrite other results:
python eval_<task-type>.py --video_llm video-llava --disable_llm --input_path run_video_llava/predictions_prompt --output_path auto_eval_results_prompt
For running optical flow experiments, in processing_video.py
the variable mode
should be set to optical_flow_arrow
. This file can be found in run_video_llava/llava/model/multimodal_encoder/languagebind/video
. Optical flow arrows will then be added to the (prior) sampled frames for each video.
For running the time encoding experiments, in the file run_video_llava/llava/model/llava_arch.py
, the time_encoding_type
should be set to None
(for the vanilla model), Timestamps
or FrameOrder
.
This dataset is intended for academic research only. It is under CC BY-NC 4.0 License.