Modifications in this fork:
- changed stage 2 prompting to reflect SWISS TXT guidelines
- added character recognition module from AutoAD Zero
- changed models:
- stage 1: qwen2.5-vl instead of qwen2-vl
- stage 2: qwen3 instead of Llama3
Junyu Xie1, Tengda Han1, Max Bain1, Arsha Nagrani1, Eshika Khandelwal2 3, Gül Varol1 3, Weidi Xie1 4, Andrew Zisserman1
1 Visual Geometry Group, Department of Engineering Science, University of Oxford
2 CVIT, IIIT Hyderabad
3 LIGM, École des Ponts, Univ Gustave Eiffel, CNRS
4 CMIC, Shanghai Jiao Tong University
In this work, we evaluate our model on common AD benchmarks including CMD-AD, MAD-Eval, and TV-AD.
- CMD-AD can be downloaded here.
- MAD-Eval can be downloaded here.
- TV-AD can be downloaded following instructions here.
- All annotations can be found in
resources/annotations/
.
- The AD predictions (by Qwen2-VL+LLaMA3 or GPT-4o+GPT-4o) can be downloaded here.
We propose a new evaluation metric, named "action score", that focuses on whether a specific ground truth (GT) action is captured within the prediction.
The detailed evaluation code can be found in action_score/
.
-
Basic Dependencies:
python>=3.8
,pytorch=2.1.2
,transformers=4.46.0
,Pillow
,pandas
,decord
,opencv
-
For inference based on open-sourced models, set up path for cache (for Qwen2-VL, LLaMA3, etc.) by modifying
os.environ['TRANSFORMERS_CACHE'] = "/path/to/cache/"
instage1/main_qwen2vl.py
andstage2/main_llama3.py
-
For inference based on proprietary GPT-4o models, set up path for API keys by modifying
os.environ["OPENAI_API_KEY"] = <open-api-key>
instage1/main_gpt4o.py
andstage2/main_gpt4o.py
To structure the context frames according to shots, as well as recognise characters in each shot, please refer to guideline in preprocess/
.
(This step can be skipped by directly referred to the pre-computed results in the form resources/annotations/{dataset}_anno_context-3.0-8.0_face-0.2-0.4.csv
)
To predict the film grammar including shot scales and thread structures, please follow the steps detailed in film_grammar/
.
(This step can be skipped by directly referred to the pre-computed results in the form resources/annotations/{dataset}_anno_context-3.0-8.0_face-0.2-0.4_scale_thread.csv
)
python stage1/main_qwen2vl.py \ # or stage1/main_gpt4o.py to run with GPT-4o
--dataset={dataset} \ # e.g., "cmdad"
--anno_path={anno_path} # e.g., "resources/annotations/cmdad_anno_context-3.0-8.0_face-0.2-0.4_scale_thread.csv" \
--charbank_path={charbank_path} # e.g., "resources/charbanks/cmdad_charbank.json" \
--video_dir={video_dir} \
--save_dir="{save_dir} \
--font_path="resources/fonts/times.ttf" \
--shot_label
--dataset
: choices are cmdad
, madeval
, and tvad
.
--anno_path
: path to AD annotations (with character recognition results and film grammar predictions), available in resources/annotations
.
--charbank_path
: path to external character banks, available in resources/charbanks/
.
--video_dir
: directory of video datasets, example file structures can be found in resources/example_file_structures
(files are empty, for references only).
--save_dir
: directory to save output csv.
--font_path
: path to font file for shot labels (default is Times New Roman)
--shot_label
: add shot number label at the top-left of each frame
python stage2/main_llama3.py \ # or stage2/main_gpt4o.py to run with GPT-4o
--dataset={dataset} \ # e.g., "cmdad"
--mode={mode} \ # e.g., "single"
--pred_path={pred_path} \
--save_dir={save_dir}
--dataset
: choices are cmdad
, madeval
, and tvad
.
--mode
: single
for single AD output; assistant
for five candidate AD outputs
--pred_path
: path to the stage1 saved csv file.
--save_dir
: directory to save output csv.
If you find this repository helpful, please consider citing our work! 😊
@article{xie2025shotbyshot,
title={Shot-by-Shot: Film-Grammar-Aware Training-Free Audio Description Generation},
author={Junyu Xie and Tengda Han and Max Bain and Arsha Nagrani and Eshika Khandelwal and G\"ul Varol and Weidi Xie and Andrew Zisserman},
journal={arXiv preprint arXiv:2504.01020},
year={2025}
}
Qwen2-VL: https://huggingface.co/Qwen/Qwen2-VL-7B-Instruct
LLaMA3: https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct
GPT-4o: https://openai.com/api/