A Comparative Evaluation of Transformer-Based Vision Encoder-Decoder Models for Brazilian Portuguese Image Captioning

Gabriel Bromonschenkel, Hilário Oliveira, Thiago M. Paixão

SIBGRAPI 2024

Open-stuff available in

or access our publicly available collection.

🔧 To set up the environment, use:

$ chmod +x setup.sh
$ ./setup.sh

⚙️ To run the complete train and evaluate, use:

$ python train.py

⚙️ To run only the evaluation, use:

$ python eval.py

🔧 Don't forget of setting up the training/model attributes in `config.yml`. An example:

config:
  encoder: "deit-base-224"
  decoder: "gpt2-small"
  dataset: "pracegover_63k"
  max_length: 25
  batch_size: 16
  evaluate_from_model: False
  turn_off_computer: False

generate_args:
  num_beams: 1
  no_repeat_ngram_size: 0
  early_stopping: False

training_args:
  predict_with_generate: True
  num_train_epochs: 1
  eval_steps: 200
  logging_steps: 200
  per_device_train_batch_size: 16
  per_device_eval_batch_size: 16
  learning_rate: 5.6e-5
  weight_decay: 0.01
  save_total_limit: 1
  logging_strategy: "epoch"
  evaluation_strategy: "epoch"
  save_strategy: "epoch"
  load_best_model_at_end: True
  metric_for_best_model: "rougeL"
  greater_is_better: True
  push_to_hub: False
  fp16: True

callbacks:
  early_stopping:
    patience: 1
    threshold: 0.0

encoder: # available options to select in config.encoder at the top of this document
  vit-base-224: "google/vit-base-patch16-224"
  vit-base-224-21k: "google/vit-base-patch16-224-in21k"
  vit-base-384: "google/vit-base-patch16-384"
  vit-large-384: "google/vit-large-patch16-384"
  vit-huge-224-21k: "google/vit-huge-patch14-224-in21k"
  swin-base-224: "microsoft/swin-base-patch4-window7-224"
  swin-base-224-22k: "microsoft/swin-base-patch4-window7-224-in22k"
  swin-base-384: "microsoft/swin-base-patch4-window12-384"
  swin-large-384-22k: "microsoft/swin-large-patch4-window12-384-in22k"
  beit-base-224: "microsoft/beit-base-patch16-224"
  beit-base-224-22k: "microsoft/beit-base-patch16-224-pt22k"
  beit-large-224-22k: "microsoft/beit-large-patch16-224-pt22k-ft22k"
  beit-large-512: "microsoft/beit-large-patch16-512"
  beit-large-640: "microsoft/beit-large-finetuned-ade-640-640"
  deit-base-224: "facebook/deit-base-patch16-224"
  deit-base-distil-224: "facebook/deit-base-distilled-patch16-224"
  deit-base-384: "facebook/deit-base-patch16-384"
  deit-base-distil-384: "facebook/deit-base-distilled-patch16-384"

decoder: # available options to select in config.decoder at the top of this document
  bert-base: "neuralmind/bert-base-portuguese-cased"
  bert-large: "neuralmind/bert-large-portuguese-cased"
  roberta-small: "josu/roberta-pt-br"
  distilbert-base: "adalbertojunior/distilbert-portuguese-cased"
  gpt2-small: "pierreguillou/gpt2-small-portuguese"
  bart-base: "adalbertojunior/bart-base-portuguese"

🗃️ Directory structure:

├── README.md          <- The top-level README for developers using this project.
├── data
│   └── pracegover_63k     <- Dataset #PraCegoVer 63k
│       ├── test.hf        <- Data for testing split.
│       ├── train.hf       <- Data for training split.
│       └── validation.hf  <- Data for validation split.
│
├── docs               <- A default HTML for docs.
│
├── models             <- The models and its artifacts will be saved here.
│
├── requirements.txt   <- The requirements file for reproducing the training and evaluation pipelines.
│
└── src                <- Source code for use in this project.
    ├── __init__.py    <- Makes src a Python module
    │
    ├── utils          <- Modularization for configuration, splits processing and evaluation metrics.
    │   ├── config.py
    │   ├── data_processing.py
    │   └── metrics.py
    │
    ├── eval.py
    └── train.py

📋 BibTex entry and citation info

@inproceedings{bromonschenkel2024comparative,
  title={A Comparative Evaluation of Transformer-Based Vision Encoder-Decoder Models for Brazilian Portuguese Image Captioning},
  author={Bromonschenkel, Gabriel and Oliveira, Hil{\'a}rio and Paix{\~a}o, Thiago M},
  booktitle={2024 37th SIBGRAPI Conference on Graphics, Patterns and Images (SIBGRAPI)},
  pages={1--6},
  year={2024},
  organization={IEEE}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

A Comparative Evaluation of Transformer-Based Vision Encoder-Decoder Models for Brazilian Portuguese Image Captioning

SIBGRAPI 2024

🔧 To set up the environment, use:

⚙️ To run the complete train and evaluate, use:

⚙️ To run only the evaluation, use:

🔧 Don't forget of setting up the training/model attributes in `config.yml`. An example:

🗃️ Directory structure:

📋 BibTex entry and citation info

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 40 Commits
cards		cards
data		data
docs		docs
images		images
models		models
src		src
.gitignore		.gitignore
README.md		README.md
config.yml		config.yml
requirements.txt		requirements.txt
setup.sh		setup.sh

gabrielmotablima/ved-transformer-caption-ptbr

Folders and files

Latest commit

History

Repository files navigation

A Comparative Evaluation of Transformer-Based Vision Encoder-Decoder Models for Brazilian Portuguese Image Captioning

SIBGRAPI 2024

🔧 To set up the environment, use:

⚙️ To run the complete train and evaluate, use:

⚙️ To run only the evaluation, use:

🔧 Don't forget of setting up the training/model attributes in config.yml. An example:

🗃️ Directory structure:

📋 BibTex entry and citation info

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

🔧 Don't forget of setting up the training/model attributes in `config.yml`. An example:

Packages