Skip to content

Emmakast/Modification-on-Video-LLaVA

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

31 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Diagnosing and Addressing Temporal Reasoning Limitations in Video-LLaVA

Arne Eichholtz, Jutte Vijverberg, Emma Kasteleyn, Freek Byrman, Daniel Uyterlinde and Caspar de Jong

Below, we have adapted the repository from the TempCompass repository for our experiments with Video-LLaVA:

🚀 Quick Start

To begin with, clone this repository and install some packages:

git clone https://https://github.com/Emmakast/Modification-on-Video-LLaVA.git
cd Modification-on-Video-LLaVA
pip install -r requirements.txt

Data Preparation

1. Task Instructions

The task instructions can be found in questions/.

Task Instruction Generation Procedure
  1. Generate Multi-Choice QA instructions (question_gen.py).

  2. Manually validate quality and rectify.

  3. Generate task instructions for Yes/No QA (question_gen_yes_no.py), Caption Matching (question_gen_caption_match.py) and Caption Generation (question_gen_captioning.py), based on manually rectified Multi-Choice QA instructions.

  4. Manually validate quality and rectify.

2. Videos

All the processed videos can be downloaded from google drive or huggingface.

As an alternative, you can also download the raw videos and process them yourself

Run the following commands. The videos will be saved to videos/.

cd utils
python download_video.py    # Download raw videos
python process_videos.py    # Construct conflicting videos

Note: If you encounter a MoviePy error when running the processing script, please refer to this issue.

Run Inference

We use Video-LLaVA to illustrate how to conduct MLLM inference on the benchmark.

1. Video-LLaVA

Enter run_video_llava and install the environment as instructed.

Then run the following commands. The prediction results will be saved to predictions/video-llava/<task_type>.

# select <task_type> from multi-choice, yes_no, caption_matching, (captioning)
python inference_dataset.py --task_type <task_type>

If you want to do inference with our modifications, you can run the following command:

# select <task_type> from multi-choice, yes_no, caption_matching
# and <modification> from prompt, timestamps, framesampling, fs_gradient, blackframes
python inference_dataset_<modification>.py --task_type <task_type>

Run Evaluation

After obtaining the MLLM predictions, run the following commands to conduct automatic evaluation.

  • Multi-Choice QA python eval_multi_choice.py --video_llm video-llava --disable_llm

  • Yes/No QA python eval_yes_no.py --video_llm video-llava --disable_llm

  • Caption Matching python eval_caption_matching.py --video_llm video-llava --disable_llm

  • Caption Generation (NOTE: this method needs an API key for an LLM, which we haven't used) python eval_captioning.py --video_llm video-llava

The results of each data point will be saved to auto_eval_results/video-llava/<task_type>.json and the overall results on each temporal aspect will be printed out as follows:

{'action': 76.0, 'direction': 35.2, 'speed': 35.6, 'order': 37.7, 'attribute_change': 41.0, 'avg': 45.6}
{'fine-grained action': 58.8, 'coarse-grained action': 90.3, 'object motion': 36.2, 'camera motion': 32.6, 'absolute speed': 47.6, 'relative speed': 28.0, 'order': 37.7, 'color & light change': 43.6, 'size & shape change': 39.4, 'combined change': 41.7, 'other change': 38.9}
Match Success Rate=100.0

Frame Sampling

For running frame sampling experiments yourself, we include an interactive jupyter notebook. This can be found at run_video_llava/framesampling_visualizations.ipynb, and will produce plots like the one below:

Prompt engineering

For running prompt engineering experiments, run_video_llava/inference_dataset_prompt_3runs.py can be used. The prompt can be adjusted as desired by running the following command:

# select <task_type> from multi-choice, yes_no, caption_matching, (captioning)
python inference_dataset_prompt_3runs.py --task_type <task_type> --answer_prompt <prompt for specific task>

Here, <prompt for specific task> should be a string.

For our research, we use the following prompts:

  • "Approach the video by thinking about the reasons behind the actions and their order in time, and choose the most relevant option."
  • "Approach the video by thinking about the reasons behind the actions and their order in time, and please answer with yes or no."
  • "Analyze the video frame-by-frame for this event, answer yes or no:"
  • "Choose the option that best matches the visual content of the video."
  • "Does the video show this event happening? Answer yes or no, focusing on timing:"
  • "Consider the beginning, middle, and end of the video. Which caption best summarizes the overall temporal narrative?"Choose the option that best matches the visual content of the video."

During evaluation, run the following command to not overwrite other results:

python eval_<task-type>.py --video_llm video-llava --disable_llm --input_path run_video_llava/predictions_prompt --output_path auto_eval_results_prompt

Optical flow

For running optical flow experiments, in processing_video.py the variable mode should be set to optical_flow_arrow. This file can be found in run_video_llava/llava/model/multimodal_encoder/languagebind/video. Optical flow arrows will then be added to the (prior) sampled frames for each video.

Time encoding

For running the time encoding experiments, in the file run_video_llava/llava/model/llava_arch.py, the time_encoding_type should be set to None (for the vanilla model), Timestamps or FrameOrder.

License

This dataset is intended for academic research only. It is under CC BY-NC 4.0 License.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •