ViInfographicVQA: A Benchmark for Single and Multi-image Visual Question Answering on Vietnamese Infographics

This repository contains the official implementation and evaluation code for the paper "ViInfographicVQA: A Benchmark for Single and Multi-image Visual Question Answering on Vietnamese Infographics".

📰 News

[2025/12] 🔥 Our paper has been accepted to AI4Research Workshop @ AAAI 2026!
[2025/12] We released the ViInfographicVQA dataset and the evaluation code.

🌟 Overview

ViInfographicVQA is the first large-scale benchmark for Vietnamese Infographic VQA, comprising over 6,747 infographics and 20,409 QA pairs. It evaluates models on two distinct tasks:

Single-image QA: Traditional VQA requiring layout understanding and OCR.
Multi-image QA: A novel task requiring cross-image reasoning, evidence synthesis, and arithmetic operations across multiple semantically related infographics.

📂 Project Structure

ViInfographicVQA/
├── data/                 # Dataset placeholders
├── ft-vlm/               # Fine-tuning module configuration
├── results/              # Evaluation output
├── src/                  # Main source code
│   ├── common/           # Shared utilities and metrics (ANLS, Accuracy)
│   └── inference/        # Inference engines
│       ├── single/       # Single-image benchmarks
│       └── multi/        # Multi-image benchmarks
├── config.py             # Centralized path configuration
├── pyproject.toml        # Project dependencies & build config
└── uv.lock               # Dependency lock file (ensures reproducibility)

🛠️ Installation

We use uv for extremely fast dependency management. Install uv from here.

Clone the repository:

git clone https://github.com/duongtruongbinh/ViInfographicVQA.git
cd ViInfographicVQA

Sync Environment: This command creates a virtual environment and installs all dependencies (including torch, transformers, flash-attn compatible packages) defined in uv.lock.

uv sync

Optional: If you want to use Flash Attention 2 (Recommended for speed):

uv pip install flash-attn --no-build-isolation

Setup Environment Variables: Create a .env file in the root directory:

export VQA_DATA_DIR="./data"
export VQA_IMAGES_DIR="/path/to/your/images"
export VQA_OUTPUT_DIR="./results"
# Optional: Point to local model weights if not using HuggingFace cache
export VQA_MODEL_QWENVL="Qwen/Qwen2.5-VL-7B-Instruct"
export VQA_MODEL_INTERNVL="OpenGVLab/InternVL3_5-8B"

📊 Dataset Preparation

Download the dataset from Hugging Face and organize it as follows:

data/
├── train.json
├── test.json
├── images/           # Directory containing all infographic images
│   ├── image_01.jpg
│   └── ...

🚀 Inference & Evaluation

You can run scripts using uv run (which automatically uses the virtual environment) or activate the environment first.

Single-image Evaluation

uv run python -m src.inference.single.run_inference \
    qwenvl \
    --data_path data/test.json \
    --image_folder /path/to/images \
    --output_dir results/single

Multi-image Evaluation

uv run python -m src.inference.multi.run_inference \
    qwenvl \
    --data_path data/multi_image_test.json \
    --output_dir results/multi

Calculate Scores (ANLS)

After inference, calculate the Average Normalized Levenshtein Similarity (ANLS) and Accuracy:

# For single-image results
uv run python -m src.inference.single.calculate_scores --results-dir results/single

# For multi-image results
uv run python -m src.inference.multi.calculate_scores --results-dir results/multi

🚅 Fine-tuning (Qwen2.5-VL)

The fine-tuning module is integrated into the project. We use ft-vlm (built on TRL and PEFT) for efficient instruction tuning.

# Example: Fine-tune on Multi-image task with LoRA
uv run ft-vlm-train \
    --config ft-vlm/configs/train_qwen25vl_7b_multi_image.json

Or using the python module directly:

uv run python -m src.ft_vlm.fine_tuning.train \
    --config ft-vlm/configs/train_qwen25vl_7b_multi_image.json

📝 Citation

If you find this code or dataset useful for your research, please cite our paper:

@misc{vandinh2025viinfographicvqa,
  title={ViInfographicVQA: A Benchmark for Single and Multi-image Visual Question Answering on Vietnamese Infographics}, 
  author={Tue-Thu Van-Dinh and Hoang-Duy Tran and Truong-Binh Duong and Mai-Hanh Pham and Binh-Nam Le-Nguyen and Quoc-Thai Nguyen},
  year={2025},
  eprint={2512.12424},
  archivePrefix={arXiv},
  primaryClass={cs.CV},
  url={https://arxiv.org/abs/2512.12424}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
assets		assets
data		data
ft-vlm		ft-vlm
src		src
.gitignore		.gitignore
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ViInfographicVQA: A Benchmark for Single and Multi-image Visual Question Answering on Vietnamese Infographics

📰 News

🌟 Overview

📂 Project Structure

🛠️ Installation

📊 Dataset Preparation

🚀 Inference & Evaluation

Single-image Evaluation

Multi-image Evaluation

Calculate Scores (ANLS)

🚅 Fine-tuning (Qwen2.5-VL)

📝 Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

ViInfographicVQA: A Benchmark for Single and Multi-image Visual Question Answering on Vietnamese Infographics

📰 News

🌟 Overview

📂 Project Structure

🛠️ Installation

📊 Dataset Preparation

🚀 Inference & Evaluation

Single-image Evaluation

Multi-image Evaluation

Calculate Scores (ANLS)

🚅 Fine-tuning (Qwen2.5-VL)

📝 Citation

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages