[WACV 2026 π₯] ClusterMine: Robust Label-Free Visual Out-Of-Distribution Detection via Concept Mining from Text Corpora
[ π Arxiv] [ π BibTeX] [ π Quick Start]
Nikolas Adaloglou, Diana Petrusheva, Mohamed Asker, Felix Michels and Prof. Markus Kollmann
Mathematical modeling of biological systems lab (MMBS), Heinrich Heine University of Dusseldorf

An overview of the label mining framework for OOD detection using CLIP. Given a text corpus and its representation, ClusterMine aims to extract in-distribution-related class names in the shared vision-language space of CLIP. Best viewed in color.
TL;DR: This repository contains the official implementation of ClusterMine, a novel method for visual out-of-distribution (OOD) detection that leverage CLIP's text-image embedding space for mining label names from large text corpora.
Large-scale visual out-of-distribution (OOD) detection has witnessed remarkable progress by leveraging vision-language models such as CLIP. However, a significant limitation of current methods is their reliance on a pre-defined set of in-distribution (ID) ground-truth label names (positives). These fixed label names can be unavailable, unreliable at scale, or become less relevant due to in-distribution shifts after deployment. Towards truly unsupervised OOD detection, we utilize widely available text corpora for positive label mining, bypassing the need for positives. In this paper, we utilize widely available text corpora for positive label mining under a general concept mining paradigm. Within this framework, we propose ClusterMine, a novel positive label mining method. ClusterMine is the first method to achieve state-of-the-art OOD detection performance without access to positive labels. It extracts positive concepts from a large text corpus by combining visual-only sample consistency (via clustering) and zero-shot image-text consistency.
The proposed method, cluster-based positive mining (ClusterMine), consists of the following steps:
-
Visual feature-based clustering: We perform clustering on the visual encoder of CLIP using
$C=4000$ clusters. In practice, we apply TEMI clustering as it has shown significant improvements in clustering accuracy over$k$ -means, even at large scales. We use the default parameters ($\beta=0.6$ , 50 heads). In contrast to the clustering downstream task, we are only interested in a rough overestimation of$C$ and not in the ground truth classes. -
Vision-language inference: For all samples that fall into the same cluster, we apply zero-shot inference using the text corpus
$\mathcal{Y}_{corpus}$ . -
Cluster Voting: Each cluster's label name is then determined by applying majority voting, effectively reducing the false positive classes. The latter enforces visual consistency, as the nearest neighbors in feature space likely share the same label. Crucially, because different clusters can be mapped to the same label name,
$$|\mathcal{Y}_{pos}| \leq C$$ .
By integrating visual consistency into the top-1 image-text concept from
- Different clusters can be mapped to the same label name
$y \in \mathcal{Y}_{pos}$ , - Text concepts that do not match the samples' neighborhood are rejected.
Thus,
Experimental results demonstrate that ClusterMine achieves state-of-the-art robustness to covariate in-distribution shifts.

Semantic large-scale OOD detection AUROCs/FPR95 per dataset using CLIP ViT-H dfn5b. The WordNet corpus is used (nouns and adjectives). All methods do not require training or fine-tuning.
Additionally, the extracted positive class names have high overlap with the ground truth labels. On the left plot below, we calculate top-1 text-text similarity with GT, and find the shortest path (minimum amount of hops) to GT in the WordNet tree.
On the right plot below, we measure the OOD detection robustness across multiple ID shifts (x-axis) compared to ImageNet using CLIP ViT-H dfn5b. The relative AUROC difference in % of each method compared to its ImageNet score is shown on top of each bar.
Choose between conda or uv for dependency management:
# Create environment from YAML file
conda env create -f environment.yml
# Activate the environment
conda activate clustermine# Install dependencies using uv
uv sync
# Activate the virtual environment
source .venv/bin/activateNote: The conda environment includes CUDA-optimized packages (PyTorch, FAISS-GPU) which are essential for efficient computation with large embeddings.
Before running experiments, configure your dataset paths by editing dataset_loaders/paths.json:
{
"DEFAULT_PATH": "/path/to/your/datasets",
"PRECOMPUTED_PATH": "/path/to/store/embeddings",
"IMAGENET_PATH": "/path/to/imagenet",
"PRECOMPUTED_TEXT_PATH": "./data/text_embeddings",
"BASE_PATH_CORPORA": "./data/corpora"
}Required Datasets for Result Reproduction:
- In-distribution: ImageNet-1K (
IN1K) - OOD Benchmarks: NINCO (
NINCO), ImageNet-O (IN_O), OpenImages-O (openimage_o), iNaturalist (inat), ImageNet-21K-OOD (IN21OOD), Textures subset (texturev2) - Robustness Benchmarks: ImageNet-V2 (
IN_V2), ImageNet-A (IN_A), ImageNet-R (IN_R), ImageNet-C subset (IN_C), ImageNet-Sketch (sketch)
The paths of these vision dataset can be configured in the file dataset_loaders/data_paths.py We assume that the path DEFAULT_PATH has these dataset folders. You need to download the data by yourself. For instance, this repo provided intructions for downloading inat. NINCO can be downlaoded directly from zenodo.
Our method uses pre-computed text embeddings from different corpora. We primarily use clip:8 which corresponds to CLIP ViT-H-14 DFN-5B architecture. Here WN corresponds to WordNet nouns only and WN-NA includes nouns and adjectives. In the paper we report results with WN-NA unless otherwise specified.
To compare with other baselines that assume access to the in-distribution label names you also need to compute the ImageNet-1K text embeddings.
# Generate text embeddings for WordNet nouns
python gen_text_embeds.py --arch clip:8 --corpus_name WN
# Generate text embeddings for WordNet nouns + adjectives
python gen_text_embeds.py --arch clip:8 --corpus_name WN-NA
# Generate ImageNet-1K class embeddings (required for baseline methods like MCM)
python gen_text_embeds.py --arch clip:8 --corpus_name IN1KPre-computing image embeddings without augmentation significantly speeds up experiments:
# Example: Generate embeddings for NINCO dataset
python gen_embeds.py --arch clip:8 --dataset NINCO --no_eval_knn --no_compute_knn
# For multiple datasets, use the batch script
bash bash/gen_img_emb.shParameters Explanation:
--no_eval_knn: Skip KNN evaluation during embedding generation--no_compute_knn: Skip KNN index computation (saves time and storage)
ClusterMine requires the TEMI clustering head. We provide weights for clip:8 trained on ImageNet-1K:
# Create weights directory
mkdir -p weights
# Download and extract clustering weights (~160MB)
cd weights && wget https://uni-duesseldorf.sciebo.de/s/jZ7dwn7EmKJxxmG/download/clip_8.zip
unzip clip_8.zip
rm clip_8.zip
cd ..Refer to the TEMI repository instructions for training your own clustering head.
Custom Clustering Path: We store the weights in the repository folder. To use a different path, modify CLUSTER_BASE_PATH = './weights' in ood/clustermine.py.
The main experimental results on OOD detection evaluates the AUROC and FPR95 on six main OOD benchmarks: ['NINCO', 'IN_O', 'openimage_o', 'inat', 'IN21OOD', 'texturev2']. The averages correspond to these 6 datasets scores using IN1K as the in-distribution.
python main_ood.py --arch clip:8 --dataset "IN1K" \
--corpus_name "WN" --method "clustermine" --out_dim 4000 \
--step 1024 --save_path_df "./data/results/clustermine" ;Key Parameters:
--out_dim 4000: Number of clustering dimensions (must match downloaded weights)--step 1024: Batch size for processing (adjust based on GPU memory)--corpus_name "WN": Corpus for mining concepts
python main_ood.py \
--arch clip:8 \
--dataset "IN1K" \
--corpus_name "WN" \
--method "posmine" \
--threshold 0.00008 \
--save_path_df "./data/results/posmine"Key Parameters:
--threshold 0.00008: Similarity threshold for positive mining that corresponds to the minimum percentage of samples per concept for the concept to be considered in-distribution.
Run ClusterMine, PosMine, and baseline methods (MCM, NegLabel) together:
python main_ood.py \
--arch clip:8 \
--dataset "IN1K" \
--corpus_name "WN" \
--method "all" \
--save_path_df "./data/results/all_methods"Baseline Methods Included:
- MCM: Maximum Concept Matching
- NegLabel: Negative label mining
Test OOD detection performance across distribution shifts:
# Required robustness datasets
DATASETS=("IN_V2" "IN_A" "IN_R" "IN_C" "sketch")
# Generate embeddings for each dataset
for dataset in "${DATASETS[@]}"; do
python gen_embeds.py \
--arch clip:8 \
--dataset $dataset \
--no_eval_knn \
--no_compute_knn \
--test_only
doneDataset Descriptions:
IN_V2: ImageNet-V2, IN_A: ImageNet-A (adversarial examples), IN_R: ImageNet-R (artistic renditions), IN_C: ImageNet-C (corrupted images), sketch: ImageNet-Sketch (sketch drawings)
python main_robust.py \
--arch clip:8 \
--dataset IN1K \
--save_path_df "./data/results/robustness" \
--step 1024 \
--corpus_name "WN-NA"After running the experiments, you should see results similar to our paper (using our processed WN-NA corpus):
| Method | NINCO | IN_O | openimage_o | inat | IN21OOD | texturev2 | Average |
|---|---|---|---|---|---|---|---|
| PosMine | 92.56 | 93.13 | 97.04 | 98.83 | 91.36 | 93.81 | 94.46 |
| ClusterMine | 92.87 | 93.57 | 96.93 | 99 | 91.53 | 93.45 | 94.56 |
| NegLabel | 90.26 | 90.1 | 95.79 | 98.64 | 88.4 | 90.44 | 92.27 |
| MCM | 88.78 | 91.3 | 96.64 | 96.62 | 89.65 | 91.75 | 92.46 |
Results are saved as CSV files in your specified --save_path_df directory. Below we show the results wrt robustness to ID shifts.
| Method | ID | NINCO | IN_O | openimage_o | inat | IN21OOD | texturev2 | Average |
|---|---|---|---|---|---|---|---|---|
| ClusterMine | IN_A |
83.53 | 84.45 | 91.88 | 97.17 | 80.04 | 83.46 | 86.76 |
| NegLabel | IN_A |
79.7 | 80.82 | 90.74 | 96.64 | 78.14 | 80.56 | 84.43 |
| MCM | IN_A |
62.95 | 68.79 | 84.49 | 83.75 | 64.15 | 69.05 | 72.2 |
| MCM | IN_C |
70.01 | 73.9 | 84.58 | 83.76 | 70.76 | 74.43 | 76.24 |
| NegLabel | IN_C |
82.68 | 83.18 | 91.54 | 96.49 | 80.95 | 83.17 | 86.33 |
| ClusterMine | IN_C |
86.4 | 87.02 | 92.81 | 97.11 | 83.75 | 86.54 | 88.94 |
| NegLabel | IN_R |
89.3 | 89.48 | 95.44 | 98.59 | 87.71 | 89.68 | 91.7 |
| MCM | IN_R |
77.67 | 81.64 | 91.08 | 90.74 | 78.77 | 82.11 | 83.66 |
| ClusterMine | IN_R |
88.34 | 89.13 | 94.5 | 98.17 | 85.91 | 88.61 | 90.78 |
| MCM | IN_V2 |
84.72 | 87.79 | 94.79 | 94.66 | 85.66 | 88.21 | 89.31 |
| NegLabel | IN_V2 |
87.91 | 87.97 | 94.6 | 98.17 | 86.09 | 88.22 | 90.49 |
| ClusterMine | IN_V2 |
90.7 | 91.39 | 95.65 | 98.49 | 88.85 | 91.1 | 92.7 |
| MCM | sketch |
84.88 | 87.73 | 94.18 | 94.01 | 85.77 | 88.22 | 89.13 |
| NegLabel | sketch |
91.21 | 91.11 | 96.31 | 98.85 | 89.5 | 91.41 | 93.06 |
| ClusterMine | sketch |
90.35 | 91.14 | 95.6 | 98.52 | 88.43 | 90.76 | 92.47 |
For some datasets, using WN leads to superior results!
- CUDA Out of Memory: Reduce
--stepparameter or use smaller batch sizes - Missing Dataset Paths: Ensure
dataset_loaders/paths.jsonpoints to correct locations.
- Add dataset path to
DATASET_PATHSindataset_loaders/data_paths.py - Generate embeddings:
python gen_embeds.py --arch clip:8 --dataset YOUR_DATASET - Run evaluation:
python main_ood.py --dataset IN1K --save_path_df "./results/your_experiment"
Supported architectures in model_builders/backbones/:
preset_models = {
"clip:0" : ('MobileCLIP-S1', 'datacompdr'),
"clip:1" : ('MobileCLIP-B', 'datacompdr'),
"clip:2" : ('ViT-B-16', 'openai'),
"clip:3" : ('ViT-B-16-SigLIP', 'webli'),
"clip:4" : ('ViT-L-14', 'metaclip_400m'),
"clip:5" : ('ViT-L-14', 'openai'),
"clip:6" : ('ViT-L-16-SigLIP-256', 'webli'),
"clip:7" : ('ViT-H-14', 'metaclip_fullcc'),
"clip:8" : ('ViT-H-14', 'dfn5b'),
"clip:9" : ('ViT-bigG-14', 'laion2b_s39b_b160k'),
"clip:10" : ('ViT-bigG-14', 'metaclip_fullcc'),
"clip:11" : ('ViT-bigG-14-CLIPA', 'datacomp1b')
}PS:For each new CLIP model you need to produce the image and text embeddings!
If you find our work useful, please consider citing us in your work:
@inproceedings{adaloglou2026clustermine,
author = {Adaloglou, Nikolaos and Diana Petrusheva and Mohamed Asker and Michels, Felix and Kollmann, Markus},
title = {ClusterMine: Robust Label-Free Visual Out-Of-Distribution Detection via Concept Mining from Text Corpora},
booktitle = {Proceedings of the Winter Conference on Applications of Computer Vision (WACV)},
year = {2026}
}This project is licensed under the Apache License Version 2.0 - see the LICENSE file for details. Take into account the Licences of the previous work, such as WordNet and OOD datasets like NINCO.
If you liked our work and find it useful, please consider (β ) starring it, so that it can reach a broader audience of like-minded people. It would be highly appreciated!



