This repo extends the attribution method from play-fair for explaining action recognition models with the epic-kitchens-100 dataset. Such models fuse information from multiple frames within a video, through score aggregation or relational reasoning.
We break down a model’s class score into the sum of contributions from each frame, fairly. Our method adapts an axiomatic solution to fair reward distribution in cooperative games, known as the Shapley value, for elements in a variable-length sequence, which we call the Element Shapley Value (ESV). Critically, we propose a tractable approximation of ESV that scales linearly with the number of frames in the sequence.
If you want to explore further follow the set up guide below, extract features from the backbone models, and compute ESVs yourself.
You will always need to set your PYTHONPATH to include the src folder. We provide an .envrc for use with direnv which will automatically do that for you when you cd into the project directory. Alternatively just run:
$ export PYTHONPATH=$PWD/src Create a conda environment with the environment file, you will also always need to activate this environment to use the correct packages
$ conda env create -n epic-100 -f environment.yml
$ conda activate epic-100Alternatively, just add it to the .envrc file which will run it automatically (be careful that you have set the source of your conda installation either in your .bashrc file or also by adding it to the top of the .envrc file)
$ echo 'conda activate epic-100' | cat - .envrc > temp && mv temp .envrc
$ direnv allowYou will also need to install a version of ffmpeg with vp9 support, we suggest using the static builds provided by John Van Sickle:
$ wget "https://johnvansickle.com/ffmpeg/builds/ffmpeg-git-amd64-static.tar.xz"
$ tar -xvf "ffmpeg-git-amd64-static.tar.xz"
$ mkdir -p bin
$ mv ffmpeg-git-*-amd64-static/{ffmpeg,ffprobe} binWe store our files in the gulpio2 format.
-
Download p01 frames from epic-kitchens-100
You can either do this manually or use the included script (the script uses the epic-downloader which downloads the
.tarfiles from the direct download link, download speeds for this may be (very) slow depending on the region, in which case we recommend to download using Academic Torrents, find out more here)$ cd datasets $ bash ./download_p01_frames.sh -
Extract the frames, if you downloaded the frames using the script above, then simply run
$ cd datasets $ bash ./extract_p01_frames.shor if you downloaded them externally then simply run the same script with the directory of the
/P01folder as an argument$ cd datasets $ bash ./extract_p01_frames.sh <path-to-downloaded-rgb-frames>/P01/
once extracted, make sure that you have them all extracted correctly:
$ cd <path-to-rgb-frames>/P01 $ tree -hd . ├── [3.9M] P01_01 ├── [1.1M] P01_02 ├── [272K] P01_03 ├── [256K] P01_04 ├── [3.0M] P01_05 ├── [1.1M] P01_06 ├── [412K] P01_07 ├── [248K] P01_08 ├── [8.2M] P01_09 ├── [320K] P01_10 ├── [3.6M] P01_101 ├── [520K] P01_102 ├── [364K] P01_103 ├── [500K] P01_104 ├── [4.1M] P01_105 ├── [1.4M] P01_106 ├── [252K] P01_107 ├── [268K] P01_108 ├── [10.0M] P01_109 ├── [1.2M] P01_11 ├── [456K] P01_12 ├── [240K] P01_13 ├── [3.2M] P01_14 ├── [2.1M] P01_15 ├── [460K] P01_16 ├── [2.6M] P01_17 ├── [8.4M] P01_18 └── [1.1M] P01_19 28 directories
-
Gulp the dataset (we supply a labels pkl for just p01) (RGB P01 FRAMES ONLY)
$ python src/scripts/gulp_data \ /path/to/rgb/frames (datasets/epic-100/frames) \ datasets/epic-100/gulp/train \ datasets/epic-100/labels/p01.pkl \ rgbIf you need to write the gulp directory to somewhere other than the path specified in the command above, make sure to symlink it afterwards to datasets/epic-100/gulp/train so the configuration files don't need to be updated.
$ ln -s /path/to/gulp/directory datasets/epic-100/gulp/train
We provide TRN model pretrained on the training set of EPIC-KITCHENS-100
$ cd checkpoints
$ bash ./download_checkpoints.shcheck that it has downloaded:
$ tree -h
.
├── [ 150] download.sh
└── [103M] trn_rgb.ckpt
0 directories, 2 filesAs computing ESVs is expensive, we work with temporal models that operate over features. We can run these in a reasonable amount of time depending on the number of frames and whether approximate methods are used.
We provide a script to extract per-frame features, saving them to a PKL file. Extract these features using the TRN checkpoint
$ python src/scripts/extract_features.py \
/path/to/gulp/directory \
checkpoints/trn_rgb.ckpt \
datasets/epic-100/features/p01_features.pkloptionally you can change the number of workers for the PyTorch DataLoader with the --num_workers argument. If this script failes at any point then you can simply rerun and it will continue from where it crashed.
We train an MLP classifier for each 1-8 frame inputs which passes the concatenated frame features as inputs through two fully connected layers to give predictions for the verb / noun classes.
We use tensorboard to log extensive training results which you can view either throughout the training or afterwards at any point, run
$ tensorboard --logdir=datasets/epic-100/runs --reload_multifile TrueIn this example we train the verbs and the nouns separately although the framework is available to train both in the same neural network. For the selected learning rate 3e-4 and batch size 512 the best testing accuracy was observed when training for 200 epochs. You can run the scripts one by one
$ python src/scripts/train_mtrn.py \
datasets/epic-100/features/p01_features.pkl \
datasets/epic-100/models/ \
--type "verb"
$ python src/scripts/train_mtrn.py \
datasets/epic-100/features/p01_features.pkl \
datasets/epic-100/models/ \
--type "noun"Alternatively we provide a script that will run this automatically for you with an argument for the epochs
$ bash ./train_verb_noun.sh 200| Argument | Description | Default |
|---|---|---|
--val_features_pkl |
If you want to train / test on two distinct frame feature sets rather than using a train/test split | None |
--train-test-split |
Specify a train/test split between 0 and 1 | 0.3 |
--min-frames |
Minimum number of frames to train models for | 1 |
--max-frames |
Maximum number of frames to train models for (these two arguments can also be used in case of a training crash) | 8 |
--epoch |
How many iterations to train the models for | 200 |
--batch-size |
The size of the mini batches to feed to the model at a time | 512 |
We compute ESVs using a collection of models, each which operate over a fixed-length input (e.g. TRN).
Regardless of whether your model supports variable-length inputs or not, we need to compute the class priors to use in the computation of the ESVs. We provide a script that does this by computing the emprical class frequency for both verb and nouns in p01.
$ python src/scripts/compute_verb_class_priors.py \
datasets/epic-100/labels/p01.pkl \
datasets/epic-100/labels/verb_class_priors.csv
$ python src/scripts/compute_noun_class_priors.py \
datasets/epic-100/labels/p01.pkl \
datasets/epic-100/labels/noun_class_priors.csvFor models that don't support a variable-length input, we propose a way of ensembling a collection of fixed-length input models inot a new meta-model which we can compute ESVs for. To make this explanation more concrete, we now describe the process in detail for TRN.
To start with, we train multiple TRN models for 1, 2, ..., n frames separately. By training these models separately we ensure that they are capable of acting alone (this also has the nice benefit of improving performance over joint training in our experience!). At inference time, we compute all possible subsampled variants of the input video we wish to classify and pass each of these through the corresponding single scale model. We aggregate scores for verbs/nouns so that each scale is given equal weighting in the final result.
This is implemented in the OnlineShapleyAttributor class taken from play-fair.
We provide an example of how to do this for TRN, as the basic variant only supports a fixed-length input. (make sure you've set up your environment, downloaded and prepped the dataset, downloaded the models, extracted the features and trained for 1 .. n verb and noun models first)
$ python src/scripts/compute_esvs.py \
datasets/epic-100/features/p01_features.pkl \
datasets/epic-100/models/ (path where all models for all frames for verbs and nouns are located) \
datasets/epic-100/labels/verb_class_priors.csv \
datasets/epic-100/labels/noun_class_priors.csv \
datasets/epic-100/esvs/mtrn-esv-n_frames=8.pkl \
--sample_n_frames 8We provide a dashboard to investigate model behaviour when we vary how many frames are fed to the model. This dashboard is powered by multiple sets of results produced by the compute_esvs.py script
First compute ESVs for 1-8 frame inputs:
$ for n in $(seq 1 8); do
python src/scripts/compute_esvs.py \
datasets/epic-100/features/p01_features.pkl \
datasets/epic-100/models/
datasets/epic-100/labels/verb_class_priors.csv \
datasets/epic-100/labels/verb_class_priors.csv \
datasets/epic-100/esvs/mtrn-esv-n_frames=$n.pkl \
--sample_n_frames $n
doneThen we can collate them
$ python src/scripts/collate_esvs.py
--dataset "Epic Kitchens 100 (P01)"
--model "MTRN"
datasets/epic-100/esvs/mtrn-esv-n_frames={1..8}.pkl \
datasets/epic-100/esvs/mtrn-esv-min_frames=1-max_frames=8.pklbefore we can run the dashboard, we need to dump out the videos from the gulp directory as webm files (since we gulp the files, it alters the FPS). Watch out that you don't end up using the conda bundled ffmpeg which doesn't support VP9 encoding if you replace .bin/ffmpeg with ffmpeg, check which you are using by running which ffmpeg
$ mkdir datasets/epic-100/video_frames
$ python src/scripts/dump_frames_from_gulp_dir.py \
datasets/epic-100/gulp/train \
datasets/epic-100/video_frames
$ for dir in datasets/epic-100/video_frames; do
if [[ -f "$dir/frame_000000.jpg" && ! -f "$dir.webm" ]]; then
./bin/ffmpeg \
-r 8 \
-i "$dir/frame_%06d.jpg" \
-c:v vp9 \
-row-mt 1 \
-auto-alt-ref 1 \
-speed 4 \
-b:v 200k \
"$dir.webm" -y
fi
done
$ mkdir datasets/epic-100/video_frames/videos
$ mv datasets/epic-100/video_frames/*.webm datasets/epic-100/video_frames/videoswhile play-fair for Something-Something-v2 only predicts a single class label, we are predicting a verb and a noun label separately. To make the dashboard easier to use we have to extract action sequence instances for all verb/noun combinations:
$ python src/scripts/extract_vert_noun_links.py \
datasets/epic-100/gulp/train \
datasets/epic-100/labels/verb_noun.pkl \
datasets/epic-100/EPIC_100_verb_classes.csv \
datasets/epic-100/EPIC_100_noun_classes.csv
$ python src/scripts/extract_vert_noun_links.py \
datasets/epic-100/gulp/train \
datasets/epic-100/labels/verb_noun_classes.pkl \
datasets/epic-100/EPIC_100_verb_classes.csv \
datasets/epic-100/EPIC_100_noun_classes.csv \
--classes True
$ python src/scripts/extract_vert_noun_links.py \
datasets/epic-100/gulp/train \
datasets/epic-100/labels/verb_noun_classes_narration.pkl \
datasets/epic-100/EPIC_100_verb_classes.csv \
datasets/epic-100/EPIC_100_noun_classes.csv \
--classes True \
--narration-id TrueNow we can run the dashboard
$ python src/apps/esv_dashboard/visualise_esvs.py \
mtrn-esv-min_n_frames\=1-max_n_frames\=8-epoch=200.pkl \
datasets/epic-100/video_frames \
datasets/epic/labels/EPIC_100_verb_classes.csv \
datasets/epic/labels/EPIC_100_noun_classes.csv \
datasets/epic/labels/alternatively if you trained on different number of epochs or dumped the video frames to a different directory you can run the dashboard using the script:
$ bash ./dashboard.sh <epochs> <path-to-dumped-video-frames>