Skip to content

BERT based Multiple Parallel Co-Attention Networks for Visual Question Answering

License

Notifications You must be signed in to change notification settings

itsmariodias/bert-mcoatt-vqa

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

30 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

BERT based Multiple Parallel Co-Attention Networks for Visual Question Answering

This repository contains the Keras implementation for the models trained for Visual Question Answering (VQA). The model architecture and other details are discussed in our paper BERT based Multiple Parallel Co-attention Model for Visual Question Answering.

MCAN w/ BERT based Encoder

The models currently implemented for training and evaluation (scores are on the VQA v2.0 val split) are:

Model Yes/No Number Other All
BERT based Hierarchical Co-Attention 69.36 34.61 44.4 52.49
BERT based HieCoAtt w/ Shared Co-Attention 73.36 36.79 43.66 54.03
BERT based Multiple Parallel Co-Attention 76.44 37.24 48.15 57.84
BERT based Multiple Parallel Co-Attention Large 76.97 37.45 49.61 58.29
BERT based HieAltCoAtt w/ Bottom-Up Image Features 81.88 43.02 55.84 63.94
MCAN w/ BERT based Encoder 84.87 48.31 58.66 67.15

Our best model achieves an overall accuracy of 71.0% on the test-std split of the VQA v2.0 dataset as part of the VQA Challenge.
The performance on the test-dev split is reported as follows:

Model Yes/No Number Other All
MCAN w/ BERT based Encoder 86.96 52.87 60.53 70.55

Table of Contents

  1. Setup
  2. Training
  3. Evaluation
  4. Credit
  5. Citation

Setup

Hardware Requirements

You will need a machine with 1 GPU (minimum 4GB VRAM), 8-16GB RAM, and 100-250GB free disk space. We recommend using a SSD drive especially for high-speed I/O during training.

Software Requirements

Implementation was done in Python 3.10. We use CUDA version 11.6 and cuDNN version 8.1. We use Tensorflow version 2.8.0 for training. Install necessary packages using:

pip install -r requirements.txt

Dataset

We use the VQA v2.0 dataset provided here for training and testing. Download all files for questions, answers and images and extract and store them in the data folder as shown:

|-- data
	|-- train
	|  |-- train2014
	|  |  |-- COCO_train2014_...jpg
	|  |  |-- ...
	|  |-- v2_OpenEnded_mscoco_train2014_questions.json
	|  |-- v2_mscoco_train2014_annotations.json
	|-- val
	|  |-- val2014
	|  |  |-- COCO_val2014_...jpg
	|  |  |-- ...
	|  |-- v2_OpenEnded_mscoco_val2014_questions.json
	|  |-- v2_mscoco_val2014_annotations.json
	|-- test
	|  |-- test2015
	|  |  |-- COCO_test2015_...jpg
	|  |  |-- ...
	|  |-- v2_OpenEnded_mscoco_test2015_questions.json

For the questions and annotations required from the Visual Genome dataset, please download from the links provided here and extract and store them in the data folder as shown:

|-- data
    |-- vg
    |  |-- VG_annotations.json
    |  |-- VG_questions.json

For the bottom-up attention image features, please download the files from here and extract and store them in the data folder as shown:

|-- data
    |-- tsv
	|  |-- test2015
	|  |  |-- test2015_resnet101...tsv
	|  |-- trainval
	|  |  |-- karpathy_train_resnet101...tsv
	|  |  |-- ...

Training

The following script will start training with the default parameters:

python run.py --RUN train --CONFIG bert_mcoatt

Training Parameters

See python run.py -h for more details.

  1. --RUN=str for mode to run in. Either 'train', 'eval' or 'test'.
  2. --CONFIG=str loads the yaml config file to use for building the model. See configs/ for config files for the models given in our paper.
  3. --SPLIT=str the data split on which the model should be trained on. Default is train.
  4. --NO_EVAL for whether we evaluate the model after training is done. Include if you do not want to evaluate.
  5. --PRELOAD loads all image features directly into memory. Only do this if you have sufficient RAM.
  6. --SEED=int Specify seed to be used by random number generators during training.
  7. --VERSION=str for which model version to load either to resume training or for evaluation. Version is based on seed generated.
  8. --DATA_DIR=str for where the dataset and other files are stored. Default is data/.
  9. --OUTPUT_DIR=str for where the results are saved. Default is results/.
  10. --CHECKPOINT_DIR=str for where the model checkpoints are saved. Default is checkpoints/.
  11. --FEATURES_DIR=str for where the image features are stored. Default is data/{feature_type}.
  12. --START_EPOCH=int for which epoch to start training. Useful for resuming training. Default is 0.
  13. --FEATURES_DIR=int for how many epochs to train the model for. Default is 10.

Evaluation

The following script will start evaluation:

python run.py --RUN eval --CONFIG bert_mcoatt --VERSION {str} --START_EPOCH {int}

Validation only works for the VQA v2.0 val split. For test set evaluation run the following script:

python run.py --RUN test --CONFIG bert_mcoatt --VERSION {str} --START_EPOCH {int}

The results file stored in results/bert_mcoatt_{version}_results.json can then be uploaded to Eval AI to get the scores on the test-dev and test-std splits.

Credit

VQA Consortium for providing the VQA v2.0 dataset and the API and evaluation code located at utils/vqaEvaluation and utils/vqaTools available here and licensed under the MIT license.

Hierarchical Question-Image Co-Attention for Visual Question Answering for providing their code and implementation. You can see their paper here.

BERT (Bidirectional Encoder Representations from Transformers) for providing their pretrained language models. You can see their papers here and here.

Hugging Face Transformers library for providing the BERT implementation interface to use in Keras/Tensorflow.

Deep Modular Co-Attention Networks (MCAN) for providing their code and implementation. You can see their paper here.

Bottom-Up Attention for providing the pretrained image features and code for extracting the same as well as the insights for improving performance during classification. You can see their papers here and here.

Citation

If this repository was useful for your work, it would be greatly appreciated if you could cite the following paper:

@INPROCEEDINGS{9788253,
    author={Dias, Mario and Aloj, Hansie and Ninan, Nijo and Koshti, Dipali},
    booktitle={2022 6th International Conference on Intelligent Computing and Control Systems (ICICCS)},
    title={BERT based Multiple Parallel Co-attention Model for Visual Question Answering},
    year={2022},
    pages={1531-1537},
    doi={10.1109/ICICCS53718.2022.9788253}
}

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •  

Languages