Dataset Collector & CLIP Threshold Calibration

The Problem: Fast vs. Accurate Labeling

In many computer vision projects, you need to tag or classify massive amounts of images.

VLLMs (Visual Large Language Models) are a nice way to do it but often too slow and expensive to run at scale.
CLIP (Contrastive Language-Image Pre-training) is lightning fast, allowing for zero-shot classification by comparing image and text vectors.

The Challenge: Raw CLIP similarity scores are inconsistent across different concepts. A similarity score of 0.23 might be a perfect match for "abstract art" but a complete miss for "golden retriever". If you pick a single static threshold (e.g., "everything > 0.25 is a match"), you will get poor results—either missing images or over-labeling everything.

💡 The Solution

This project provides a data-driven framework to turn CLIP from a "best effort" guesser into a calibrated classification system.

It consists of two main modules:

The Collector: Automatically builds a ground-truth dataset by harvesting images for your specific labels from the web.
The Calibrator: Uses that dataset to mathematically determine the optimal threshold ($T$) for each label, maximizing the F1 score.

1. Module A: Dataset Collector

Building a validation dataset manually is tedious. This tool automates it using a Dual-Pass Strategy to ensure your model handles both obvious and difficult examples.

Pass 1 (Contextual): Searches for {label} {context} (e.g., "golden retriever park"). This ensures the images match the specific domain you care about.
Pass 2 (Generic): Searches for {label} (e.g., "golden retriever"). This adds variety (different angles, lighting, backgrounds) to prevent overfitting.

Result: A structured folder of images (data/images_dataset/golden_retriever/, data/images_dataset/labrador/) ready for analysis.

2. Module B: Threshold Calibration

Once the data is collected, the calibration tool runs the images through your CLIP model (default: MetaCLIP) to find the "Sweet Spot".

How it works

For every label, the system:

Generates embeddings for all collected images.
Calculates similarity scores against the text label.
Performs a grid search over thresholds ($0.10 \to 0.50$).
Selects the threshold that maximizes the F1 Score (harmonic mean of Precision and Recall).

The Math

To align with production systems, raw cosine similarity is often mapped to a probability curve. This project uses a Sigmoid Transformation:

$$ S(x) = \frac{1}{1 + e^{-k(x - m)}} $$

Where $m$ (midpoint) and $k$ (steepness) are tuned hyperparameters. The optimizer then finds the cutoff $t$ where: $$ F1(t) = 2 \cdot \frac{Precision(t) \cdot Recall(t)}{Precision(t) + Recall(t)} $$

📊 Interactive Analysis

The pipeline generates a report.html dashboard allowing you to:

Visualize Distributions: See exactly why "sports car" might be confused with "sedan".
Simulate Optimization: An interactive "Digital Twin" lets you disable or merge labels to see how it affects global accuracy in real-time.
Detect Redundancy: Identifies labels that are synonyms or subsets of others (e.g., removing "puppy" if "dog" covers 99% of the same images).

Figure 1: The main dashboard showing F1 scores per label and overall health status.

Figure 2: Confusion matrix highlighting cross-class overlaps.

Figure 3: The dataset explorer showing images with their computed similarity scores.

Figure 4: The interactive simulator allows enabling/disabling labels to see real-time impact on Global F1.

🛠️ Usage

1. Setup

pip install -r requirements.txt
# Requires PyTorch and Transformers for CLIP embedding generation

2. Define your Labels

Edit keywords.json to define the categories and labels you want to detect.

3. Collect Data

Harvest images to build your ground truth.

# Collect images for all labels in keywords.json
python run_collection.py

# Collect specific labels only
python run_collection.py --labels "cat" "dog"

4. Run Calibration

Analyze the dataset to find optimal thresholds.

cd threshold_calibration
python run_calibration.py

If the analysis was already made at least once you can avoid the creation of embeddings, that can be slow.

python run_calibration.py --skip-embeddings

5. Get Results

Outputs are saved to data/calibration_results/:

optimal_thresholds.json: The machine-readable thresholds for your production app.
report.html: The interactive dashboard.
calibrated_keywords.json: A ready-to-use config file with the new thresholds applied.

📂 Project Structure

.
├── collector.py                 # Core logic for image downloading (icrawler wrapper)
├── config.py                    # Configuration for the collection process
├── keywords.json                # Taxonomy definition (categories and labels)
├── label_parser.py              # Utilities to parse the keywords JSON
├── requirements.txt             # Python dependencies
├── run_collection.py            # CLI entry point for downloading images
├── data/                        # Directory where data is stored
│   ├── images_dataset/          # Downloaded images organized by label
│   └── calibration_results/     # Output files (reports, thresholds, matrices)
└── threshold_calibration/       # Module for analyzing and calibrating thresholds
    ├── config.py                # Configuration for the calibration process
    ├── embeddings.py            # Generates CLIP embeddings for images and text
    ├── optimizer.py             # Calculates optimal F1 thresholds and health metrics
    ├── run_calibration.py       # CLI entry point for the calibration process
    └── visualizer.py            # Generates the interactive HTML dashboard

Note: This tool is domain-agnostic. You can use it to calibrate detectors for retail products, wildlife, architectural styles, or any other visual taxonomy.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Dataset Collector & CLIP Threshold Calibration

The Problem: Fast vs. Accurate Labeling

💡 The Solution

1. Module A: Dataset Collector

2. Module B: Threshold Calibration

How it works

The Math

📊 Interactive Analysis

🛠️ Usage

1. Setup

2. Define your Labels

3. Collect Data

4. Run Calibration

5. Get Results

📂 Project Structure

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
images		images
threshold_calibration		threshold_calibration
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
collector.py		collector.py
config.py		config.py
keywords.json		keywords.json
label_parser.py		label_parser.py
requirements.txt		requirements.txt
run_collection.py		run_collection.py

Folders and files

Latest commit

History

Repository files navigation

Dataset Collector & CLIP Threshold Calibration

The Problem: Fast vs. Accurate Labeling

💡 The Solution

1. Module A: Dataset Collector

2. Module B: Threshold Calibration

How it works

The Math

📊 Interactive Analysis

🛠️ Usage

1. Setup

2. Define your Labels

3. Collect Data

4. Run Calibration

5. Get Results

📂 Project Structure

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages