ViInfographicVQA: A Benchmark for Single and Multi-image Visual Question Answering on Vietnamese Infographics
This repository contains the official implementation and evaluation code for the paper "ViInfographicVQA: A Benchmark for Single and Multi-image Visual Question Answering on Vietnamese Infographics".
- [2025/12] 🔥 Our paper has been accepted to AI4Research Workshop @ AAAI 2026!
- [2025/12] We released the ViInfographicVQA dataset and the evaluation code.
ViInfographicVQA is the first large-scale benchmark for Vietnamese Infographic VQA, comprising over 6,747 infographics and 20,409 QA pairs. It evaluates models on two distinct tasks:
- Single-image QA: Traditional VQA requiring layout understanding and OCR.
- Multi-image QA: A novel task requiring cross-image reasoning, evidence synthesis, and arithmetic operations across multiple semantically related infographics.
ViInfographicVQA/
├── data/ # Dataset placeholders
├── ft-vlm/ # Fine-tuning module configuration
├── results/ # Evaluation output
├── src/ # Main source code
│ ├── common/ # Shared utilities and metrics (ANLS, Accuracy)
│ └── inference/ # Inference engines
│ ├── single/ # Single-image benchmarks
│ └── multi/ # Multi-image benchmarks
├── config.py # Centralized path configuration
├── pyproject.toml # Project dependencies & build config
└── uv.lock # Dependency lock file (ensures reproducibility)
We use uv for extremely fast dependency management. Install uv from here.
- Clone the repository:
git clone https://github.com/duongtruongbinh/ViInfographicVQA.git
cd ViInfographicVQA- Sync Environment:
This command creates a virtual environment and installs all dependencies (including
torch,transformers,flash-attncompatible packages) defined inuv.lock.
uv syncOptional: If you want to use Flash Attention 2 (Recommended for speed):
uv pip install flash-attn --no-build-isolation- Setup Environment Variables:
Create a
.envfile in the root directory:
export VQA_DATA_DIR="./data"
export VQA_IMAGES_DIR="/path/to/your/images"
export VQA_OUTPUT_DIR="./results"
# Optional: Point to local model weights if not using HuggingFace cache
export VQA_MODEL_QWENVL="Qwen/Qwen2.5-VL-7B-Instruct"
export VQA_MODEL_INTERNVL="OpenGVLab/InternVL3_5-8B"Download the dataset from Hugging Face and organize it as follows:
data/
├── train.json
├── test.json
├── images/ # Directory containing all infographic images
│ ├── image_01.jpg
│ └── ...
You can run scripts using uv run (which automatically uses the virtual environment) or activate the environment first.
uv run python -m src.inference.single.run_inference \
qwenvl \
--data_path data/test.json \
--image_folder /path/to/images \
--output_dir results/singleuv run python -m src.inference.multi.run_inference \
qwenvl \
--data_path data/multi_image_test.json \
--output_dir results/multiAfter inference, calculate the Average Normalized Levenshtein Similarity (ANLS) and Accuracy:
# For single-image results
uv run python -m src.inference.single.calculate_scores --results-dir results/single
# For multi-image results
uv run python -m src.inference.multi.calculate_scores --results-dir results/multiThe fine-tuning module is integrated into the project. We use ft-vlm (built on TRL and PEFT) for efficient instruction tuning.
# Example: Fine-tune on Multi-image task with LoRA
uv run ft-vlm-train \
--config ft-vlm/configs/train_qwen25vl_7b_multi_image.jsonOr using the python module directly:
uv run python -m src.ft_vlm.fine_tuning.train \
--config ft-vlm/configs/train_qwen25vl_7b_multi_image.jsonIf you find this code or dataset useful for your research, please cite our paper:
@misc{vandinh2025viinfographicvqa,
title={ViInfographicVQA: A Benchmark for Single and Multi-image Visual Question Answering on Vietnamese Infographics},
author={Tue-Thu Van-Dinh and Hoang-Duy Tran and Truong-Binh Duong and Mai-Hanh Pham and Binh-Nam Le-Nguyen and Quoc-Thai Nguyen},
year={2025},
eprint={2512.12424},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2512.12424},
}