👀 Gaze-Guided Learning: Avoiding Shortcut Bias in Visual Classification

⭐ If you find our dataset and code useful, please consider starring this repository and citing our paper!

📋 BibTeX Citation (click to expand)

@article{li2025gaze,
  title={Gaze-Guided Learning: Avoiding Shortcut Bias in Visual Classification},
  author={Li, Jiahang and Xue, Shibo and Su, Yong},
  journal={arXiv preprint arXiv:2504.05583},
  year={2025}
}

📄 Abstract

Inspired by human visual attention, deep neural networks have widely adopted attention mechanisms to learn locally discriminative attributes for challenging visual classification tasks. However, existing approaches primarily emphasize the representation of such features while neglecting their precise localization, which often leads to misclassification caused by shortcut biases. This limitation becomes even more pronounced when models are evaluated on transfer or out-of-distribution datasets. In contrast, humans are capable of leveraging prior object knowledge to quickly localize and compare fine-grained attributes, a capability that is especially crucial in complex and high-variance classification scenarios. Motivated by this, we introduce Gaze-CIFAR-10, a human gaze time-series dataset, along with a dual-sequence gaze encoder that models the precise sequential localization of human attention on distinct local attributes. In parallel, a Vision Transformer (ViT) is employed to learn the sequential representation of image content. Through cross-modal fusion, our framework integrates human gaze priors with machine-derived visual sequences, effectively correcting inaccurate localization in image feature representations. Extensive qualitative and quantitative experiments demonstrate that gaze-guided cognitive cues significantly enhance classification accuracy.

Figure: A toy example illustrating shortcut bias: (a) DNNs attention versus (b) human gaze under limited data scale and diversity.

🧠 Method Overview

Figure: Gaze-guided cross-modal fusion network.

📂 Dataset

You can download the Gaze-CIFAR-10 dataset from the following link:

👉 Gaze-CIFAR-10 Dataset

Figure: Gaze data collection setup. (a) Overview of our data acquisition system. (b) Step 1: Reconstruct image resolution. Step 2: Participants freely view two randomly selected images from different categories. Step 3: One image is randomly re-sampled from the previously viewed categories and shown again for focused observation. Step 4: Gaze data is transmitted to the PC for processing.

🧠 Pretrained Model

Download the pretrained Vision Transformer (ViT) model:

📥 ViT Pretrained Model

🚀 Training

To train the model, run:

python train.py

🔍 Evaluation

To evaluate the trained model, run:

python predict1.py

Name		Name	Last commit message	Last commit date
Latest commit History 33 Commits
Figure		Figure
DSGE.py		DSGE.py
LICENSE		LICENSE
README.md		README.md
class_indices.json		class_indices.json
flops.py		flops.py
my_dataset.py		my_dataset.py
predict1.py		predict1.py
train.py		train.py
utils.py		utils.py
vit_model.py		vit_model.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

👀 Gaze-Guided Learning: Avoiding Shortcut Bias in Visual Classification

📄 Abstract

🧠 Method Overview

📂 Dataset

🧠 Pretrained Model

🚀 Training

🔍 Evaluation

📈 Star History

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

👀 Gaze-Guided Learning: Avoiding Shortcut Bias in Visual Classification

📄 Abstract

🧠 Method Overview

📂 Dataset

🧠 Pretrained Model

🚀 Training

🔍 Evaluation

📈 Star History

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages