Adaptions for German/Italian movies:

Modifications in this fork:

changed stage 2 prompting to reflect SWISS TXT guidelines
added character recognition module from AutoAD Zero
changed models:
- stage 1: qwen2.5-vl instead of qwen2-vl
- stage 2: qwen3 instead of Llama3

For more information, see original README below.

Shot🎞️-by-Shot🎞️: Film-Grammar-Aware Training-Free Audio Description Generation

Junyu Xie¹, Tengda Han¹, Max Bain¹, Arsha Nagrani¹, Eshika Khandelwal² ³, Gül Varol¹ ³, Weidi Xie¹ ⁴, Andrew Zisserman¹

¹ Visual Geometry Group, Department of Engineering Science, University of Oxford
² CVIT, IIIT Hyderabad
³ LIGM, École des Ponts, Univ Gustave Eiffel, CNRS
⁴ CMIC, Shanghai Jiao Tong University

Datasets and Results

In this work, we evaluate our model on common AD benchmarks including CMD-AD, MAD-Eval, and TV-AD.

Video Frames

CMD-AD can be downloaded here.
MAD-Eval can be downloaded here.
TV-AD can be downloaded following instructions here.

Ground Truth AD Annotations

All annotations can be found in resources/annotations/.

Predicted ADs

The AD predictions (by Qwen2-VL+LLaMA3 or GPT-4o+GPT-4o) can be downloaded here.

Action Score

We propose a new evaluation metric, named "action score", that focuses on whether a specific ground truth (GT) action is captured within the prediction.

The detailed evaluation code can be found in action_score/.

Audio Description (AD) Generation

Requirements

Basic Dependencies: python>=3.8, pytorch=2.1.2, transformers=4.46.0, Pillow, pandas, decord, opencv
For inference based on open-sourced models, set up path for cache (for Qwen2-VL, LLaMA3, etc.) by modifying os.environ['TRANSFORMERS_CACHE'] = "/path/to/cache/" in stage1/main_qwen2vl.py and stage2/main_llama3.py
For inference based on proprietary GPT-4o models, set up path for API keys by modifying os.environ["OPENAI_API_KEY"] = <open-api-key> in stage1/main_gpt4o.py and stage2/main_gpt4o.py

Preprocessing

To structure the context frames according to shots, as well as recognise characters in each shot, please refer to guideline in preprocess/.
(This step can be skipped by directly referred to the pre-computed results in the form resources/annotations/{dataset}_anno_context-3.0-8.0_face-0.2-0.4.csv)

Film Grammar Prediction

To predict the film grammar including shot scales and thread structures, please follow the steps detailed in film_grammar/.
(This step can be skipped by directly referred to the pre-computed results in the form resources/annotations/{dataset}_anno_context-3.0-8.0_face-0.2-0.4_scale_thread.csv)

Inference

- Generating Dense Description by VLM (Stage I)

python stage1/main_qwen2vl.py \  # or stage1/main_gpt4o.py to run with GPT-4o
--dataset={dataset} \            # e.g., "cmdad"
--anno_path={anno_path}          # e.g., "resources/annotations/cmdad_anno_context-3.0-8.0_face-0.2-0.4_scale_thread.csv" \
--charbank_path={charbank_path}  # e.g., "resources/charbanks/cmdad_charbank.json" \
--video_dir={video_dir} \
--save_dir="{save_dir} \
--font_path="resources/fonts/times.ttf" \
--shot_label

--dataset: choices are cmdad, madeval, and tvad.
--anno_path: path to AD annotations (with character recognition results and film grammar predictions), available in resources/annotations.
--charbank_path: path to external character banks, available in resources/charbanks/.
--video_dir: directory of video datasets, example file structures can be found in resources/example_file_structures (files are empty, for references only).
--save_dir: directory to save output csv.
--font_path: path to font file for shot labels (default is Times New Roman)
--shot_label: add shot number label at the top-left of each frame

- Generating AD Sentence by LLM (Stage II)

python stage2/main_llama3.py \  # or stage2/main_gpt4o.py to run with GPT-4o
--dataset={dataset} \           # e.g., "cmdad"
--mode={mode} \                 # e.g., "single"
--pred_path={pred_path} \       
--save_dir={save_dir}

--dataset: choices are cmdad, madeval, and tvad.
--mode: single for single AD output; assistant for five candidate AD outputs
--pred_path: path to the stage1 saved csv file.
--save_dir: directory to save output csv.

Citation

If you find this repository helpful, please consider citing our work! 😊

@article{xie2025shotbyshot,
	title={Shot-by-Shot: Film-Grammar-Aware Training-Free Audio Description Generation},
	author={Junyu Xie and Tengda Han and Max Bain and Arsha Nagrani and Eshika Khandelwal and G\"ul Varol and Weidi Xie and Andrew Zisserman},
	journal={arXiv preprint arXiv:2504.01020},
	year={2025}
}

References

Qwen2-VL: https://huggingface.co/Qwen/Qwen2-VL-7B-Instruct
LLaMA3: https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct
GPT-4o: https://openai.com/api/

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
action_score		action_score
char_recog		char_recog
film_grammar		film_grammar
preprocess		preprocess
resources		resources
stage1		stage1
stage2		stage2
.gitattributes		.gitattributes
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Adaptions for German/Italian movies:

Shot🎞️-by-Shot🎞️: Film-Grammar-Aware Training-Free Audio Description Generation

Datasets and Results

Video Frames

Ground Truth AD Annotations

Predicted ADs

Action Score

Audio Description (AD) Generation

Requirements

Preprocessing

Film Grammar Prediction

Inference

- Generating Dense Description by VLM (Stage I)

- Generating AD Sentence by LLM (Stage II)

Citation

References

About

Uh oh!

Releases

Packages

Languages

License

ZurichNLP/shot-by-shot

Folders and files

Latest commit

History

Repository files navigation

Adaptions for German/Italian movies:

Shot🎞️-by-Shot🎞️: Film-Grammar-Aware Training-Free Audio Description Generation

Datasets and Results

Video Frames

Ground Truth AD Annotations

Predicted ADs

Action Score

Audio Description (AD) Generation

Requirements

Preprocessing

Film Grammar Prediction

Inference

- Generating Dense Description by VLM (Stage I)

- Generating AD Sentence by LLM (Stage II)

Citation

References

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages