Andrea Atzori 1 Fadi Boutros 1 Naser Damer 1,2
1 Fraunhofer IGD 2 Technische Universität Darmstadt
Accepted at ICCV Workshops 2025
An overview of the proposed ViT-FIQA for assessing the quality of face samples.
A face sample is divided in equally sized, non-overlapping patches of size
Face Image Quality Assessment (FIQA) aims to predict the utility of a face image for face recognition (FR) systems. State-of-the-art FIQA methods mainly rely on convolutional neural networks (CNNs), leaving the potential of Vision Transformer (ViT) architectures underexplored. This work proposes ViT-FIQA, a novel approach that extends standard ViT backbones, originally optimized for FR, through a learnable quality token designed to predict a scalar utility score for any given face image. The learnable quality token is concatenated with the standard image patch tokens, and the whole sequence is processed via global self-attention by the ViT encoders to aggregate contextual information across all patches. At the output of the backbone, ViT-FIQA branches into two heads: (1) the patch tokens are passed through a fully connected layer to learn discriminative face representations via a margin-penalty softmax loss, and (2) the quality token is fed into a regression head to learn to predict the face sample's utility. Extensive experiments on challenging benchmarks and several FR models, including both CNN- and ViT-based architectures, demonstrate that ViT-FIQA consistently achieves top-tier performance. These results underscore the effectiveness of transformer-based architectures in modeling face image utility and highlight the potential of ViTs as a scalable foundation for future FIQA research.
Install all necessary packages in a Python >=3.8 environment:
pip install torch torchvision numpy opencv-python mxnet easydict scipy==1.8.1 numpy==1.23.1
To extract scores for images in a folder,
- download pre-trained model weights from this link and place them in a location of your choice
- run
python evaluation/getQualityScore.pyand set arguments accordinglyusage: getQualityScore.py [--data-dir DATA_DIR] [--pairs PAIRS] [--datasets DATASETS] [--gpu_id GPU_ID] [--model_path MODEL_PATH] [--backbone BACKBONE] [--score_file_name SCORE_FILE_NAME] [--color_channel COLOR_CHANNEL] ViT-FIQA options: --data-dir DATA_DIR Root dir for evaluation dataset --pairs PAIRS lfw pairs. --datasets DATASETS list of evaluation datasets (,) e.g. XQLFW, lfw,calfw,agedb_30,cfp_fp,cplfw,IJBC. --gpu_id GPU_ID GPU id. --model_path MODEL_PATH path to pretrained evaluation. --backbone BACKBONE vit_FC or iresnet100 or iresnet50 --score_file_name SCORE_FILE_NAME score file name, the file will be store in the same data dir --color_channel COLOR_CHANNEL input image color channel, two option RGB or BGR
Please refer to CR-FIQA repository for evaluation and EDC plotting.
If you found this work helpful for your research, please cite the article with the following bibtex entry:
@InProceedings{Atzori_2025_ICCV,
author = {Atzori, Andrea and Boutros, Fadi and Damer, Naser},
title = {ViT-FIQA: Assessing Face Image Quality using Vision Transformers},
booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) Workshops},
month = {October},
year = {2025},
pages = {5935-5945}
}
This work is based on InsightFace for ViTs and on CR-FIQA.
This project is licensed under the terms of the Attribution-NonCommercial 4.0 International (CC BY-NC 4.0) license.
Copyright (c) 2025 Fraunhofer Institute for Computer Graphics Research IGD, Darmstadt.
