Query-based video summarization aims to provide customized video summaries based on a semantic query provided by a user. CLIP-IT pursues a solution that formulates this task as a per-frame binary classification problem. However, we believe such a problem formulation precludes architectures from applying appropriate levels of consideration to the temporal aspects of considered frames, given that keyframes are still-images at a single timestep. Furthermore, desirable video summaries should avoid unnecessarily frequent jump cuts between non-contiguous keyframes: rapidly cycling through highly diverse images may make viewers nauseated and unable to fully comprehend the video summary. Given the success of contrastive learning in domains where there exists a dearth of labeled data, TCLR was able to demonstrate the effectiveness of contrastive learned video representations that account for the temporal characteristics unique to video data. We seek to leverage these learned representations to improve performance on the query-based video summarization task.
- Set up you renvironment.
conda env create -f derek_tclr_env.yml
conda activate derek_tclr
-
Preparation Download TVSum and Summe Datasets. Define your data lists (contains the absolute paths for all your videos for training/testing). Store these in the
data/splitsfolder. -
Pretraining Generate TCLR embeddings.
cd TCLR
python sumtclr_train_gen_all_step.py --run_id '[RUN_ID]]' --num_epochs [NUMBER_OF_EPOCHS] --num_dataloader_workers [NUMBER_OF_WORKERS] --data_list ../data/splits/augmented_tvsum_80.txt --batch_size=8 | tee tclr_stdout.log
tclr_stdout.log will indicate where the model weights are stored. It should be in a .pth file
- Summarization Train Summarizer on TCLR embeddings.
cd Summarizer
python main.py --verbose | tee summarizer_training.log