Skip to content

[ArXiv 2025] Official Implementation of "Shot-by-Shot: Film-Grammar-Aware Training-Free Audio Description Generation". Junyu Xie, Tengda Han, Max Bain, Arsha Nagrani, Eshika Khandelwal, Gül Varol, Weidi Xie, Andrew Zisserman

License

Notifications You must be signed in to change notification settings

ZurichNLP/shot-by-shot

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Adaptions for German/Italian movies:

Modifications in this fork:

  • changed stage 2 prompting to reflect SWISS TXT guidelines
  • added character recognition module from AutoAD Zero
  • changed models:
For more information, see original README below.

Shot🎞️-by-Shot🎞️: Film-Grammar-Aware Training-Free Audio Description Generation

Junyu Xie1, Tengda Han1, Max Bain1, Arsha Nagrani1, Eshika Khandelwal2 3, Gül Varol1 3, Weidi Xie1 4, Andrew Zisserman1

1 Visual Geometry Group, Department of Engineering Science, University of Oxford
2 CVIT, IIIT Hyderabad
3 LIGM, École des Ponts, Univ Gustave Eiffel, CNRS
4 CMIC, Shanghai Jiao Tong University

Project page

Datasets and Results

In this work, we evaluate our model on common AD benchmarks including CMD-AD, MAD-Eval, and TV-AD.

Video Frames

  • CMD-AD can be downloaded here.
  • MAD-Eval can be downloaded here.
  • TV-AD can be downloaded following instructions here.

Ground Truth AD Annotations

  • All annotations can be found in resources/annotations/.

Predicted ADs

  • The AD predictions (by Qwen2-VL+LLaMA3 or GPT-4o+GPT-4o) can be downloaded here.

Action Score

We propose a new evaluation metric, named "action score", that focuses on whether a specific ground truth (GT) action is captured within the prediction.

The detailed evaluation code can be found in action_score/.

Audio Description (AD) Generation

Requirements

  • Basic Dependencies: python>=3.8, pytorch=2.1.2, transformers=4.46.0, Pillow, pandas, decord, opencv

  • For inference based on open-sourced models, set up path for cache (for Qwen2-VL, LLaMA3, etc.) by modifying os.environ['TRANSFORMERS_CACHE'] = "/path/to/cache/" in stage1/main_qwen2vl.py and stage2/main_llama3.py

  • For inference based on proprietary GPT-4o models, set up path for API keys by modifying os.environ["OPENAI_API_KEY"] = <open-api-key> in stage1/main_gpt4o.py and stage2/main_gpt4o.py

Preprocessing

To structure the context frames according to shots, as well as recognise characters in each shot, please refer to guideline in preprocess/.
(This step can be skipped by directly referred to the pre-computed results in the form resources/annotations/{dataset}_anno_context-3.0-8.0_face-0.2-0.4.csv)

Film Grammar Prediction

To predict the film grammar including shot scales and thread structures, please follow the steps detailed in film_grammar/.
(This step can be skipped by directly referred to the pre-computed results in the form resources/annotations/{dataset}_anno_context-3.0-8.0_face-0.2-0.4_scale_thread.csv)

Inference

- Generating Dense Description by VLM (Stage I)
python stage1/main_qwen2vl.py \  # or stage1/main_gpt4o.py to run with GPT-4o
--dataset={dataset} \            # e.g., "cmdad"
--anno_path={anno_path}          # e.g., "resources/annotations/cmdad_anno_context-3.0-8.0_face-0.2-0.4_scale_thread.csv" \
--charbank_path={charbank_path}  # e.g., "resources/charbanks/cmdad_charbank.json" \
--video_dir={video_dir} \
--save_dir="{save_dir} \
--font_path="resources/fonts/times.ttf" \
--shot_label 

--dataset: choices are cmdad, madeval, and tvad.
--anno_path: path to AD annotations (with character recognition results and film grammar predictions), available in resources/annotations.
--charbank_path: path to external character banks, available in resources/charbanks/.
--video_dir: directory of video datasets, example file structures can be found in resources/example_file_structures (files are empty, for references only).
--save_dir: directory to save output csv.
--font_path: path to font file for shot labels (default is Times New Roman)
--shot_label: add shot number label at the top-left of each frame

- Generating AD Sentence by LLM (Stage II)
python stage2/main_llama3.py \  # or stage2/main_gpt4o.py to run with GPT-4o
--dataset={dataset} \           # e.g., "cmdad"
--mode={mode} \                 # e.g., "single"
--pred_path={pred_path} \       
--save_dir={save_dir} 

--dataset: choices are cmdad, madeval, and tvad.
--mode: single for single AD output; assistant for five candidate AD outputs
--pred_path: path to the stage1 saved csv file.
--save_dir: directory to save output csv.

Citation

If you find this repository helpful, please consider citing our work! 😊

@article{xie2025shotbyshot,
	title={Shot-by-Shot: Film-Grammar-Aware Training-Free Audio Description Generation},
	author={Junyu Xie and Tengda Han and Max Bain and Arsha Nagrani and Eshika Khandelwal and G\"ul Varol and Weidi Xie and Andrew Zisserman},
	journal={arXiv preprint arXiv:2504.01020},
	year={2025}
}

References

Qwen2-VL: https://huggingface.co/Qwen/Qwen2-VL-7B-Instruct
LLaMA3: https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct
GPT-4o: https://openai.com/api/

About

[ArXiv 2025] Official Implementation of "Shot-by-Shot: Film-Grammar-Aware Training-Free Audio Description Generation". Junyu Xie, Tengda Han, Max Bain, Arsha Nagrani, Eshika Khandelwal, Gül Varol, Weidi Xie, Andrew Zisserman

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 100.0%