Skip to content

Code accompanying the WACV 2026 paper "ClusterMine: Robust Label-Free Visual Out-Of-Distribution Detection via Concept Mining from Text Corpora"

License

Notifications You must be signed in to change notification settings

HHU-MMBS/clustermine_wacv_official

Repository files navigation

[WACV 2026 πŸ”₯] ClusterMine: Robust Label-Free Visual Out-Of-Distribution Detection via Concept Mining from Text Corpora

[ πŸ“œ Arxiv] [ πŸ“– BibTeX] [ πŸš€ Quick Start]

Nikolas Adaloglou, Diana Petrusheva, Mohamed Asker, Felix Michels and Prof. Markus Kollmann

Mathematical modeling of biological systems lab (MMBS), Heinrich Heine University of Dusseldorf

SWG
An overview of the label mining framework for OOD detection using CLIP. Given a text corpus and its representation, ClusterMine aims to extract in-distribution-related class names in the shared vision-language space of CLIP. Best viewed in color.

TL;DR: This repository contains the official implementation of ClusterMine, a novel method for visual out-of-distribution (OOD) detection that leverage CLIP's text-image embedding space for mining label names from large text corpora.

Abstract and Method Overview

Large-scale visual out-of-distribution (OOD) detection has witnessed remarkable progress by leveraging vision-language models such as CLIP. However, a significant limitation of current methods is their reliance on a pre-defined set of in-distribution (ID) ground-truth label names (positives). These fixed label names can be unavailable, unreliable at scale, or become less relevant due to in-distribution shifts after deployment. Towards truly unsupervised OOD detection, we utilize widely available text corpora for positive label mining, bypassing the need for positives. In this paper, we utilize widely available text corpora for positive label mining under a general concept mining paradigm. Within this framework, we propose ClusterMine, a novel positive label mining method. ClusterMine is the first method to achieve state-of-the-art OOD detection performance without access to positive labels. It extracts positive concepts from a large text corpus by combining visual-only sample consistency (via clustering) and zero-shot image-text consistency.

πŸ—ΊοΈ Cluster-based positive mining (ClusterMine)

The proposed method, cluster-based positive mining (ClusterMine), consists of the following steps:

SWG

  1. Visual feature-based clustering: We perform clustering on the visual encoder of CLIP using $C=4000$ clusters. In practice, we apply TEMI clustering as it has shown significant improvements in clustering accuracy over $k$-means, even at large scales. We use the default parameters ($\beta=0.6$, 50 heads). In contrast to the clustering downstream task, we are only interested in a rough overestimation of $C$ and not in the ground truth classes.
  2. Vision-language inference: For all samples that fall into the same cluster, we apply zero-shot inference using the text corpus $\mathcal{Y}_{corpus}$.
  3. Cluster Voting: Each cluster's label name is then determined by applying majority voting, effectively reducing the false positive classes. The latter enforces visual consistency, as the nearest neighbors in feature space likely share the same label. Crucially, because different clusters can be mapped to the same label name, $$|\mathcal{Y}_{pos}| \leq C$$.

⭐ Advantages of ClusterMine and Main Results

cluster

By integrating visual consistency into the top-1 image-text concept from $\mathcal{Y}_{corpus}$:

  • Different clusters can be mapped to the same label name $y \in \mathcal{Y}_{pos}$,
  • Text concepts that do not match the samples' neighborhood are rejected.

Thus, $|\mathcal{Y}_{pos}|$ becomes relatively insensitive as C increases far above the real semantic categories. In practice, an overestimation of the real semantic categories for selecting $C$ is possible even with minimal to no domain knowledge. Compared to MCM, ClusterMine leverages a corpus to decide which concepts are positive and negative. Compared to NegLabel, ClusterMine extracts the positives and implicitly defines the negative concepts without a pre-defined explicit threshold, which is hard to determine a priori.

Experimental results demonstrate that ClusterMine achieves state-of-the-art robustness to covariate in-distribution shifts.

Main_Table
Semantic large-scale OOD detection AUROCs/FPR95 per dataset using CLIP ViT-H dfn5b. The WordNet corpus is used (nouns and adjectives). All methods do not require training or fine-tuning.

Additionally, the extracted positive class names have high overlap with the ground truth labels. On the left plot below, we calculate top-1 text-text similarity with GT, and find the shortest path (minimum amount of hops) to GT in the WordNet tree.

On the right plot below, we measure the OOD detection robustness across multiple ID shifts (x-axis) compared to ImageNet using CLIP ViT-H dfn5b. The relative AUROC difference in % of each method compared to its ImageNet score is shown on top of each bar.

overlap Β Β Β Β  robust

Quick Start πŸš€

1. Environment Setup

Choose between conda or uv for dependency management:

Option A: Conda environment (Recommended)

# Create environment from YAML file
conda env create -f environment.yml

# Activate the environment
conda activate clustermine

Option B: uv environment (Faster alternative)

# Install dependencies using uv
uv sync

# Activate the virtual environment
source .venv/bin/activate

Note: The conda environment includes CUDA-optimized packages (PyTorch, FAISS-GPU) which are essential for efficient computation with large embeddings.

2. Dataset Configuration

Before running experiments, configure your dataset paths by editing dataset_loaders/paths.json:

{
  "DEFAULT_PATH": "/path/to/your/datasets",
  "PRECOMPUTED_PATH": "/path/to/store/embeddings",
  "IMAGENET_PATH": "/path/to/imagenet",
  "PRECOMPUTED_TEXT_PATH": "./data/text_embeddings",
  "BASE_PATH_CORPORA": "./data/corpora"
}

Required Datasets for Result Reproduction:

  • In-distribution: ImageNet-1K (IN1K)
  • OOD Benchmarks: NINCO (NINCO), ImageNet-O (IN_O), OpenImages-O (openimage_o), iNaturalist (inat), ImageNet-21K-OOD (IN21OOD), Textures subset (texturev2)
  • Robustness Benchmarks: ImageNet-V2 (IN_V2), ImageNet-A (IN_A), ImageNet-R (IN_R), ImageNet-C subset (IN_C), ImageNet-Sketch (sketch)

The paths of these vision dataset can be configured in the file dataset_loaders/data_paths.py We assume that the path DEFAULT_PATH has these dataset folders. You need to download the data by yourself. For instance, this repo provided intructions for downloading inat. NINCO can be downlaoded directly from zenodo.

3. Generate Text and Image Embeddings

Text Embeddings Generation

Our method uses pre-computed text embeddings from different corpora. We primarily use clip:8 which corresponds to CLIP ViT-H-14 DFN-5B architecture. Here WN corresponds to WordNet nouns only and WN-NA includes nouns and adjectives. In the paper we report results with WN-NA unless otherwise specified.

To compare with other baselines that assume access to the in-distribution label names you also need to compute the ImageNet-1K text embeddings.

# Generate text embeddings for WordNet nouns
python gen_text_embeds.py --arch clip:8 --corpus_name WN

# Generate text embeddings for WordNet nouns + adjectives 
python gen_text_embeds.py --arch clip:8 --corpus_name WN-NA

# Generate ImageNet-1K class embeddings (required for baseline methods like MCM)
python gen_text_embeds.py --arch clip:8 --corpus_name IN1K

Compute Image Embeddings

Pre-computing image embeddings without augmentation significantly speeds up experiments:

# Example: Generate embeddings for NINCO dataset
python gen_embeds.py --arch clip:8 --dataset NINCO --no_eval_knn --no_compute_knn

# For multiple datasets, use the batch script
bash bash/gen_img_emb.sh

Parameters Explanation:

  • --no_eval_knn: Skip KNN evaluation during embedding generation
  • --no_compute_knn: Skip KNN index computation (saves time and storage)

4. Download Pre-trained Clustering Weights

ClusterMine requires the TEMI clustering head. We provide weights for clip:8 trained on ImageNet-1K:

# Create weights directory
mkdir -p weights

# Download and extract clustering weights (~160MB)
cd weights && wget https://uni-duesseldorf.sciebo.de/s/jZ7dwn7EmKJxxmG/download/clip_8.zip
unzip clip_8.zip
rm clip_8.zip
cd ..

Refer to the TEMI repository instructions for training your own clustering head.

Custom Clustering Path: We store the weights in the repository folder. To use a different path, modify CLUSTER_BASE_PATH = './weights' in ood/clustermine.py.

πŸ§ͺ Running Experiments

Out-of-Distribution Detection

The main experimental results on OOD detection evaluates the AUROC and FPR95 on six main OOD benchmarks: ['NINCO', 'IN_O', 'openimage_o', 'inat', 'IN21OOD', 'texturev2']. The averages correspond to these 6 datasets scores using IN1K as the in-distribution.

ClusterMine (Our Main Method)

python main_ood.py --arch clip:8 --dataset "IN1K" \
  --corpus_name "WN" --method   "clustermine" --out_dim 4000 \
  --step 1024 --save_path_df "./data/results/clustermine" ;

Key Parameters:

  • --out_dim 4000: Number of clustering dimensions (must match downloaded weights)
  • --step 1024: Batch size for processing (adjust based on GPU memory)
  • --corpus_name "WN": Corpus for mining concepts

PosMine

python main_ood.py \
    --arch clip:8 \
    --dataset "IN1K" \
    --corpus_name "WN" \
    --method "posmine" \
    --threshold 0.00008 \
    --save_path_df "./data/results/posmine"

Key Parameters:

  • --threshold 0.00008: Similarity threshold for positive mining that corresponds to the minimum percentage of samples per concept for the concept to be considered in-distribution.

Benchmark All Methods

Run ClusterMine, PosMine, and baseline methods (MCM, NegLabel) together:

python main_ood.py \
    --arch clip:8 \
    --dataset "IN1K" \
    --corpus_name "WN" \
    --method "all" \
    --save_path_df "./data/results/all_methods"

Baseline Methods Included:

  • MCM: Maximum Concept Matching
  • NegLabel: Negative label mining

Evaluation of Robustness to In-Distribution Shifts

Test OOD detection performance across distribution shifts:

Step 1: Generate Robustness Dataset Embeddings

# Required robustness datasets
DATASETS=("IN_V2" "IN_A" "IN_R" "IN_C" "sketch")

# Generate embeddings for each dataset
for dataset in "${DATASETS[@]}"; do
    python gen_embeds.py \
        --arch clip:8 \
        --dataset $dataset \
        --no_eval_knn \
        --no_compute_knn \
        --test_only
done

Dataset Descriptions: IN_V2: ImageNet-V2, IN_A: ImageNet-A (adversarial examples), IN_R: ImageNet-R (artistic renditions), IN_C: ImageNet-C (corrupted images), sketch: ImageNet-Sketch (sketch drawings)

Step 2: Run Robustness Evaluation

python main_robust.py \
    --arch clip:8 \
    --dataset IN1K \
    --save_path_df "./data/results/robustness" \
    --step 1024 \
    --corpus_name "WN-NA"

πŸ“Š Expected Results using the WN-NA corpus

After running the experiments, you should see results similar to our paper (using our processed WN-NA corpus):

Method NINCO IN_O openimage_o inat IN21OOD texturev2 Average
PosMine 92.56 93.13 97.04 98.83 91.36 93.81 94.46
ClusterMine 92.87 93.57 96.93 99 91.53 93.45 94.56
NegLabel 90.26 90.1 95.79 98.64 88.4 90.44 92.27
MCM 88.78 91.3 96.64 96.62 89.65 91.75 92.46

Results are saved as CSV files in your specified --save_path_df directory. Below we show the results wrt robustness to ID shifts.

Method ID NINCO IN_O openimage_o inat IN21OOD texturev2 Average
ClusterMine IN_A 83.53 84.45 91.88 97.17 80.04 83.46 86.76
NegLabel IN_A 79.7 80.82 90.74 96.64 78.14 80.56 84.43
MCM IN_A 62.95 68.79 84.49 83.75 64.15 69.05 72.2
MCM IN_C 70.01 73.9 84.58 83.76 70.76 74.43 76.24
NegLabel IN_C 82.68 83.18 91.54 96.49 80.95 83.17 86.33
ClusterMine IN_C 86.4 87.02 92.81 97.11 83.75 86.54 88.94
NegLabel IN_R 89.3 89.48 95.44 98.59 87.71 89.68 91.7
MCM IN_R 77.67 81.64 91.08 90.74 78.77 82.11 83.66
ClusterMine IN_R 88.34 89.13 94.5 98.17 85.91 88.61 90.78
MCM IN_V2 84.72 87.79 94.79 94.66 85.66 88.21 89.31
NegLabel IN_V2 87.91 87.97 94.6 98.17 86.09 88.22 90.49
ClusterMine IN_V2 90.7 91.39 95.65 98.49 88.85 91.1 92.7
MCM sketch 84.88 87.73 94.18 94.01 85.77 88.22 89.13
NegLabel sketch 91.21 91.11 96.31 98.85 89.5 91.41 93.06
ClusterMine sketch 90.35 91.14 95.6 98.52 88.43 90.76 92.47

For some datasets, using WN leads to superior results!

πŸ› οΈ πŸ“ˆ Troubleshooting and Extending the Framework

  1. CUDA Out of Memory: Reduce --step parameter or use smaller batch sizes
  2. Missing Dataset Paths: Ensure dataset_loaders/paths.json points to correct locations.

Adding New Datasets

  1. Add dataset path to DATASET_PATHS in dataset_loaders/data_paths.py
  2. Generate embeddings: python gen_embeds.py --arch clip:8 --dataset YOUR_DATASET
  3. Run evaluation: python main_ood.py --dataset IN1K --save_path_df "./results/your_experiment"

Using Different CLIP Models

Supported architectures in model_builders/backbones/:

preset_models = {
    "clip:0"  : ('MobileCLIP-S1', 'datacompdr'),
    "clip:1"  : ('MobileCLIP-B', 'datacompdr'),
    "clip:2"  : ('ViT-B-16', 'openai'),
    "clip:3"  : ('ViT-B-16-SigLIP', 'webli'),
    "clip:4"  : ('ViT-L-14', 'metaclip_400m'),
    "clip:5"  : ('ViT-L-14', 'openai'),
    "clip:6"  : ('ViT-L-16-SigLIP-256', 'webli'),
    "clip:7"  : ('ViT-H-14', 'metaclip_fullcc'),
    "clip:8"  : ('ViT-H-14', 'dfn5b'), 
    "clip:9"  : ('ViT-bigG-14', 'laion2b_s39b_b160k'),
    "clip:10" : ('ViT-bigG-14', 'metaclip_fullcc'),
    "clip:11" : ('ViT-bigG-14-CLIPA', 'datacomp1b')
}

PS:For each new CLIP model you need to produce the image and text embeddings!

Citation πŸ“„

If you find our work useful, please consider citing us in your work:

@inproceedings{adaloglou2026clustermine,
    author    = {Adaloglou, Nikolaos and Diana Petrusheva and Mohamed Asker and Michels, Felix and Kollmann, Markus},
    title     = {ClusterMine: Robust Label-Free Visual Out-Of-Distribution Detection via Concept Mining from Text Corpora},
    booktitle = {Proceedings of the Winter Conference on Applications of Computer Vision (WACV)},
    year      = {2026}
}

This project is licensed under the Apache License Version 2.0 - see the LICENSE file for details. Take into account the Licences of the previous work, such as WordNet and OOD datasets like NINCO.

If you liked our work and find it useful, please consider (β˜…) starring it, so that it can reach a broader audience of like-minded people. It would be highly appreciated!

About

Code accompanying the WACV 2026 paper "ClusterMine: Robust Label-Free Visual Out-Of-Distribution Detection via Concept Mining from Text Corpora"

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published