Diagnosing and Addressing Temporal Reasoning Limitations in Video-LLaVA

Arne Eichholtz, Jutte Vijverberg, Emma Kasteleyn, Freek Byrman, Daniel Uyterlinde and Caspar de Jong

Below, we have adapted the repository from the TempCompass repository for our experiments with Video-LLaVA:

🚀 Quick Start

To begin with, clone this repository and install some packages:

git clone https://https://github.com/Emmakast/Modification-on-Video-LLaVA.git
cd Modification-on-Video-LLaVA
pip install -r requirements.txt

Data Preparation

1. Task Instructions

The task instructions can be found in questions/.

Task Instruction Generation Procedure

Generate Multi-Choice QA instructions (question_gen.py).
Manually validate quality and rectify.
Generate task instructions for Yes/No QA (question_gen_yes_no.py), Caption Matching (question_gen_caption_match.py) and Caption Generation (question_gen_captioning.py), based on manually rectified Multi-Choice QA instructions.
Manually validate quality and rectify.

2. Videos

All the processed videos can be downloaded from google drive or huggingface.

As an alternative, you can also download the raw videos and process them yourself

Run the following commands. The videos will be saved to videos/.

cd utils
python download_video.py    # Download raw videos
python process_videos.py    # Construct conflicting videos

Note: If you encounter a MoviePy error when running the processing script, please refer to this issue.

Run Inference

We use Video-LLaVA to illustrate how to conduct MLLM inference on the benchmark.

1. Video-LLaVA

Enter run_video_llava and install the environment as instructed.

Then run the following commands. The prediction results will be saved to predictions/video-llava/<task_type>.

# select <task_type> from multi-choice, yes_no, caption_matching, (captioning)
python inference_dataset.py --task_type <task_type>

If you want to do inference with our modifications, you can run the following command:

# select <task_type> from multi-choice, yes_no, caption_matching
# and <modification> from prompt, timestamps, framesampling, fs_gradient, blackframes
python inference_dataset_<modification>.py --task_type <task_type>

Run Evaluation

After obtaining the MLLM predictions, run the following commands to conduct automatic evaluation.

Multi-Choice QA python eval_multi_choice.py --video_llm video-llava --disable_llm
Yes/No QA python eval_yes_no.py --video_llm video-llava --disable_llm
Caption Matching python eval_caption_matching.py --video_llm video-llava --disable_llm
Caption Generation (NOTE: this method needs an API key for an LLM, which we haven't used) python eval_captioning.py --video_llm video-llava

The results of each data point will be saved to auto_eval_results/video-llava/<task_type>.json and the overall results on each temporal aspect will be printed out as follows:

{'action': 76.0, 'direction': 35.2, 'speed': 35.6, 'order': 37.7, 'attribute_change': 41.0, 'avg': 45.6}
{'fine-grained action': 58.8, 'coarse-grained action': 90.3, 'object motion': 36.2, 'camera motion': 32.6, 'absolute speed': 47.6, 'relative speed': 28.0, 'order': 37.7, 'color & light change': 43.6, 'size & shape change': 39.4, 'combined change': 41.7, 'other change': 38.9}
Match Success Rate=100.0

Frame Sampling

For running frame sampling experiments yourself, we include an interactive jupyter notebook. This can be found at run_video_llava/framesampling_visualizations.ipynb, and will produce plots like the one below:

Prompt engineering

For running prompt engineering experiments, run_video_llava/inference_dataset_prompt_3runs.py can be used. The prompt can be adjusted as desired by running the following command:

# select <task_type> from multi-choice, yes_no, caption_matching, (captioning)
python inference_dataset_prompt_3runs.py --task_type <task_type> --answer_prompt <prompt for specific task>

Here, <prompt for specific task> should be a string.

For our research, we use the following prompts:

"Approach the video by thinking about the reasons behind the actions and their order in time, and choose the most relevant option."
"Approach the video by thinking about the reasons behind the actions and their order in time, and please answer with yes or no."
"Analyze the video frame-by-frame for this event, answer yes or no:"
"Choose the option that best matches the visual content of the video."
"Does the video show this event happening? Answer yes or no, focusing on timing:"
"Consider the beginning, middle, and end of the video. Which caption best summarizes the overall temporal narrative?"Choose the option that best matches the visual content of the video."

During evaluation, run the following command to not overwrite other results:

python eval_<task-type>.py --video_llm video-llava --disable_llm --input_path run_video_llava/predictions_prompt --output_path auto_eval_results_prompt

Optical flow

For running optical flow experiments, in processing_video.py the variable mode should be set to optical_flow_arrow. This file can be found in run_video_llava/llava/model/multimodal_encoder/languagebind/video. Optical flow arrows will then be added to the (prior) sampled frames for each video.

Time encoding

For running the time encoding experiments, in the file run_video_llava/llava/model/llava_arch.py, the time_encoding_type should be set to None (for the vanilla model), Timestamps or FrameOrder.

License

This dataset is intended for academic research only. It is under CC BY-NC 4.0 License.

Name		Name	Last commit message	Last commit date
Latest commit History 31 Commits
assets		assets
query_features		query_features
questions		questions
run_video_llava		run_video_llava
utils		utils
video_features		video_features
.DS_Store		.DS_Store
CLIP_dictionary_all_questions.pkl		CLIP_dictionary_all_questions.pkl
LICENSE		LICENSE
README.md		README.md
Video-LLaVA-TempReason.pdf		Video-LLaVA-TempReason.pdf
eval_caption_matching.py		eval_caption_matching.py
eval_captioning.py		eval_captioning.py
eval_multi-choice.py		eval_multi-choice.py
eval_yes_no.py		eval_yes_no.py
meta_info.json		meta_info.json
prompt_templates.py		prompt_templates.py
question_gen.py		question_gen.py
question_gen_caption_match.py		question_gen_caption_match.py
question_gen_captioning.py		question_gen_captioning.py
question_gen_yes_no.py		question_gen_yes_no.py
requirements.txt		requirements.txt
sampling_visualization_1034419625.png		sampling_visualization_1034419625.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Diagnosing and Addressing Temporal Reasoning Limitations in Video-LLaVA

Below, we have adapted the repository from the TempCompass repository for our experiments with Video-LLaVA:

🚀 Quick Start

Data Preparation

Run Inference

Run Evaluation

Frame Sampling

Prompt engineering

Optical flow

Time encoding

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 3

Uh oh!

Languages

License

Emmakast/Modification-on-Video-LLaVA

Folders and files

Latest commit

History

Repository files navigation

Diagnosing and Addressing Temporal Reasoning Limitations in Video-LLaVA

Below, we have adapted the repository from the TempCompass repository for our experiments with Video-LLaVA:

🚀 Quick Start

Data Preparation

Run Inference

Run Evaluation

Frame Sampling

Prompt engineering

Optical flow

Time encoding

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 3

Uh oh!

Languages

Packages