Bill Psomas1†, Dionysis Christopoulos2†, Eirini Baltzi2, Ioannis Kakogeorgiou6
Tilemachos Aravanis1, Nikos Komodakis3,4,5, Konstantinos Karantzalos2, Yannis Avrithis, Giorgos Tolias1
1Visual Recognition Group, FEE, Czech Technical University in Prague 2National Technical University of Athens 3University of Crete 4Archimedes, Athena RC 5ACM-FORTH 6IIT, NCSR “Demokritos”
Official PyTorch implementation and benchmark results for Efficient Probing.
TL;DR: We introduce efficient probing (EP), a lightweight multi-query cross-attention mechanism that improves accuracy of frozen pretrained encoders while yielding interpretable attention maps.
As fine-tuning becomes impractical at scale, probing is emerging as the preferred evaluation protocol. However, standard linear probing can understate the capability of models whose pre-training optimizes local representations rather than an explicit global representation. This motivates attentive probing, an alternative that uses attention to selectively aggregate patch-level features. Despite growing adoption, attentive probing is still underexplored: existing approaches are often over-parameterized and computationally inefficient.
In this work, we revisit attentive probing through the lens of the accuracy vs. parameter-efficiency trade-off. We present the first comprehensive study of existing methods, analyzing their design choices and benchmarking their performance. Building on these insights, we propose efficient probing (EP), a lightweight yet effective multi-query cross-attention mechanism that eliminates redundant projections and reduces the number of trainable parameters. Across multiple benchmarks and pre-training paradigms, EP consistently outperforms linear probing and previous attentive probing methods, and remains effective when combined with parameter-efficient fine-tuning. Beyond evaluation, our analysis uncovers emerging properties of EP, including complementary attention maps, which open new directions for leveraging probing beyond protocol design.
We jointly visualize the attention maps of EP8. An emerging property of EP is that its queries specialize in different object regions, yielding complementary and interpretable attention patterns. Queries consistently attend to distinct parts, producing stable semantic correspondences (e.g., tails, beaks, feet) across images and a structured decomposition of visual cues.
Dependencies are listed in requirements.txt.
Use Efficient Probing (EP) as a lightweight attentive pooling over patch tokens from a frozen backbone (e.g., ViT). EP learns a small set of queries, attends to tokens with a single key projection, uses identity values (no V/O projections), and averages per-query outputs into one descriptor. It returns both the pooled descriptor and interpretable attention maps.
from poolings.ep import EfficientProbing
# ---- Minimal integration example ----
# In your model.__init__:
self.ep = EfficientProbing(dim=embed_dim, num_queries=32) # EP_32
# In your model.forward(...):
# 'tokens' are the outputs of a FROZEN backbone (e.g., ViT):
# shape (B, 1+N, D) if a [CLS] token exists, else (B, N, D)
#
# Use only patch tokens (default in our paper/code):
patch_tokens = tokens[:, 1:, :] # or 'tokens' if you have no [CLS]
#
# Optional: include [CLS] among the values by passing all tokens:
# patch_tokens = tokens # uncomment to include [CLS]
#
pooled = self.ep(patch_tokens) # pooled: (B, D)
logits = self.head(pooled) # your classifier head- Freeze the backbone; train only
EfficientProbingand your classification head. num_queriescontrols speed/accuracy (e.g., 8, 16, 32). EP averages across queries, so the output stays(B, D).- Inputs & shapes:
tokensare(B, N, D)or(B, 1+N, D)if a[CLS]token exists. - Default usage: pass patch tokens only (
tokens[:, 1:, :]when[CLS]is present). - To include
[CLS]among values, pass all tokens instead. - Outputs:
pooledis(B, D)for your head; optionalattnis(B, Q, N)for visualization/analysis. - Repro tip: set seeds to make the learned query initialization reproducible.
torchrun --nproc_per_node=4 --nnodes=1 \
main_linprobe.py --amp bfloat16 --num_workers=12 --dataloader_affinity_hack \
--epochs=90 --accum_iter=1 --optimizer=lars --batch_size=1024 \
--model vit_base_patch16 --finetune vit_base_patch16_224.mae \
--dataset_name imagenet1k --nb_classes 1000 --data_path /mnt/data/Public_datasets/imagenet/imagenet_pytorch \
--output_dir /home/psomava1/code/beyond_cls/outputs/linprobe_mae_vitb_ep_imagenet1k \
--cls_features=ep-
To perform standard linear probing (LP):
- Use
--cls_features clsto utilize the class token from the pre-trained model. - Use
--cls_features posto utilize the patch tokens (via global average pooling).
- Use
-
To perform full finetuning (FT), use the
--finetuningflag.
- Supported attentive pooling methods (as described in the paper):
abmilp,simpool,clip,siglip,aim,ep,cbam,coca,cait,dinovit,jepa,dolg,cae- These can be passed via the
--cls_featuresargument. - Note: Appending the suffix
_allto any pooling type (e.g.,ep_all) will include both patch tokens and the class token as input to the selected attentive pooling. By default, only patch tokens are used.
- These can be passed via the
- Experiment with more datasets in any setup of your choice by adjusting the
--dataset_name,--nb_classes, and--data_patharguments accordingly.- Supported datasets: ImageNet-1k, Places365, CIFAR-100, StanfordCars, Food101, FGVCAircraft, SUN397, DTD, OxfordIIITPet, CUB200
-
Try CAPI and DINOv2 pre-trained models (from PyTorch Hub) by adjusting the
--modelargument based on their official repositories.- The
--finetuneargument is not needed in this case.
- The
-
Try SimMIM, BEiTv2, and iBOT by passing the checkpoint path to the
--finetuneargument.- Pretrained weights are provided via Google Drive.
-
Instructions on how to use pre-trained models from OpenCLIP are provided in the following subsection.
torchrun --nproc_per_node=4 --nnodes=1 \
main_linprobe.py --amp bfloat16 --num_workers=12 --dataloader_affinity_hack \
--epochs=90 --accum_iter=1 --optimizer=lars --batch_size=1024 \
--model ViT-L-14 --openclip_pretrain openai --openclip \
--dataset_name imagenet1k --nb_classes 1000 --data_path /mnt/data/Public_datasets/imagenet/imagenet_pytorch \
--output_dir /home/psomava1/code/beyond_cls/outputs/linprobe_clip_openai_vitl_ep_imagenet1k \
--cls_features=ep-
To evaluate alternative pre-trained OpenCLIP models, adjust the
--modeland--openclip_pretrainarguments accordingly. Available combinations can be found in the official OpenCLIP repository.Example alternative:
--model ViT-L-16-SigLIP-256 --openclip_pretrain webli --openclip
This codebase is based on the official MAE, SimMIM and Beyond [cls] implementations.
We thank the authors for open-sourcing them.
This repository is released under the Apache 2.0 license as found in the LICENSE file.
If you find this repository useful, please consider giving a star 🌟 and citation:
@inproceedings{
psomas2026attention,
title={Attention, Please! Revisiting Attentive Probing Through the Lens of Efficiency},
author={Bill Psomas and Dionysis Christopoulos and Eirini Baltzi and Ioannis Kakogeorgiou and Tilemachos Aravanis and Nikos Komodakis and Konstantinos Karantzalos and Yannis Avrithis and Giorgos Tolias},
booktitle={The Fourteenth International Conference on Learning Representations},
year={2026},
url={https://openreview.net/forum?id=PXo0gtT7Al}
}

