ViT-FIQA: Assessing Face Image Quality using Vision Transformers

Official repository of

ViT-FIQA: Assessing Face Image Quality using Vision Transformers

Andrea Atzori¹ Fadi Boutros¹ Naser Damer^1,2

¹Fraunhofer IGD ²Technische Universität Darmstadt

Accepted at ICCV Workshops 2025

Overview 🔎

An overview of the proposed ViT-FIQA for assessing the quality of face samples. A face sample is divided in equally sized, non-overlapping patches of size $P \times P$. All these patches are then flattened and linearly projected to extract the embeddings. 2) A learnable quality token is concatenated to the patch tokens. The concatenated tokens are fed as input to a sequence of Transformer Encoder layers. 3) The final sequence of embeddings is then used as follows: the first one - the refined quality token - is used as input to the regression layer in order to predict the utility value of the sample, while the remaining patches are used as input for a fully connected layer to obtain a final embedding representing the sample. 4) The two loss terms ($L_{FR}$ and $L_{FIQ}$) are computed and added to obtain the final loss value.

Abstract 🤏

Face Image Quality Assessment (FIQA) aims to predict the utility of a face image for face recognition (FR) systems. State-of-the-art FIQA methods mainly rely on convolutional neural networks (CNNs), leaving the potential of Vision Transformer (ViT) architectures underexplored. This work proposes ViT-FIQA, a novel approach that extends standard ViT backbones, originally optimized for FR, through a learnable quality token designed to predict a scalar utility score for any given face image. The learnable quality token is concatenated with the standard image patch tokens, and the whole sequence is processed via global self-attention by the ViT encoders to aggregate contextual information across all patches. At the output of the backbone, ViT-FIQA branches into two heads: (1) the patch tokens are passed through a fully connected layer to learn discriminative face representations via a margin-penalty softmax loss, and (2) the quality token is fed into a regression head to learn to predict the face sample's utility. Extensive experiments on challenging benchmarks and several FR models, including both CNN- and ViT-based architectures, demonstrate that ViT-FIQA consistently achieves top-tier performance. These results underscore the effectiveness of transformer-based architectures in modeling face image utility and highlight the potential of ViTs as a scalable foundation for future FIQA research.

Usage 🖥

Setup

Install all necessary packages in a Python >=3.8 environment:

   pip install torch torchvision numpy opencv-python mxnet easydict scipy==1.8.1 numpy==1.23.1

Extract Face Image Quality Scores

To extract scores for images in a folder,

download pre-trained model weights from this link and place them in a location of your choice

run python evaluation/getQualityScore.py and set arguments accordingly

usage: getQualityScore.py   [--data-dir DATA_DIR] 
                            [--pairs PAIRS] 
                            [--datasets DATASETS] 
                            [--gpu_id GPU_ID] 
                            [--model_path MODEL_PATH] 
                            [--backbone BACKBONE] 
                            [--score_file_name SCORE_FILE_NAME] 
                            [--color_channel COLOR_CHANNEL]
ViT-FIQA

options:
--data-dir DATA_DIR   Root dir for evaluation dataset
--pairs PAIRS         lfw pairs.
--datasets DATASETS   list of evaluation datasets (,) e.g. XQLFW, lfw,calfw,agedb_30,cfp_fp,cplfw,IJBC.
--gpu_id GPU_ID       GPU id.
--model_path MODEL_PATH
                    path to pretrained evaluation.
--backbone BACKBONE   vit_FC or iresnet100 or iresnet50
--score_file_name SCORE_FILE_NAME
                    score file name, the file will be store in the same data dir
--color_channel COLOR_CHANNEL
                    input image color channel, two option RGB or BGR

Evaluation and EDC curves

Please refer to CR-FIQA repository for evaluation and EDC plotting.

Citation ✒

If you found this work helpful for your research, please cite the article with the following bibtex entry:

@InProceedings{Atzori_2025_ICCV,
    author    = {Atzori, Andrea and Boutros, Fadi and Damer, Naser},
    title     = {ViT-FIQA: Assessing Face Image Quality using Vision Transformers},
    booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) Workshops},
    month     = {October},
    year      = {2025},
    pages     = {5935-5945}
}

Acknowledgements

This work is based on InsightFace for ViTs and on CR-FIQA.

License

This project is licensed under the terms of the Attribution-NonCommercial 4.0 International (CC BY-NC 4.0) license.

Name		Name	Last commit message	Last commit date
Latest commit History 31 Commits
backbones		backbones
configs		configs
eval		eval
evaluation		evaluation
utils		utils
README.md		README.md
ViT-FIQA.png		ViT-FIQA.png
dataset.py		dataset.py
eval_ijbc.py		eval_ijbc.py
flops.py		flops.py
inference.py		inference.py
lmdb_dataset.py		lmdb_dataset.py
losses.py		losses.py
lr_scheduler.py		lr_scheduler.py
partial_fc_v2.py		partial_fc_v2.py
requirement.txt		requirement.txt
run.sh		run.sh
train_v2.py		train_v2.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ViT-FIQA: Assessing Face Image Quality using Vision Transformers

Overview 🔎

Abstract 🤏

Usage 🖥

Setup

Extract Face Image Quality Scores

Evaluation and EDC curves

Citation ✒

Acknowledgements

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

ViT-FIQA: Assessing Face Image Quality using Vision Transformers

Overview 🔎

Abstract 🤏

Usage 🖥

Setup

Extract Face Image Quality Scores

Evaluation and EDC curves

Citation ✒

Acknowledgements

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages