In many computer vision projects, you need to tag or classify massive amounts of images.
- VLLMs (Visual Large Language Models) are a nice way to do it but often too slow and expensive to run at scale.
- CLIP (Contrastive Language-Image Pre-training) is lightning fast, allowing for zero-shot classification by comparing image and text vectors.
The Challenge: Raw CLIP similarity scores are inconsistent across different concepts. A similarity score of 0.23 might be a perfect match for "abstract art" but a complete miss for "golden retriever".
If you pick a single static threshold (e.g., "everything > 0.25 is a match"), you will get poor results—either missing images or over-labeling everything.
This project provides a data-driven framework to turn CLIP from a "best effort" guesser into a calibrated classification system.
It consists of two main modules:
- The Collector: Automatically builds a ground-truth dataset by harvesting images for your specific labels from the web.
-
The Calibrator: Uses that dataset to mathematically determine the optimal threshold (
$T$ ) for each label, maximizing the F1 score.
Building a validation dataset manually is tedious. This tool automates it using a Dual-Pass Strategy to ensure your model handles both obvious and difficult examples.
- Pass 1 (Contextual): Searches for
{label} {context}(e.g., "golden retriever park"). This ensures the images match the specific domain you care about. - Pass 2 (Generic): Searches for
{label}(e.g., "golden retriever"). This adds variety (different angles, lighting, backgrounds) to prevent overfitting.
Result: A structured folder of images (data/images_dataset/golden_retriever/, data/images_dataset/labrador/) ready for analysis.
Once the data is collected, the calibration tool runs the images through your CLIP model (default: MetaCLIP) to find the "Sweet Spot".
For every label, the system:
- Generates embeddings for all collected images.
- Calculates similarity scores against the text label.
- Performs a grid search over thresholds (
$0.10 \to 0.50$ ). - Selects the threshold that maximizes the F1 Score (harmonic mean of Precision and Recall).
To align with production systems, raw cosine similarity is often mapped to a probability curve. This project uses a Sigmoid Transformation:
Where
The pipeline generates a report.html dashboard allowing you to:
- Visualize Distributions: See exactly why "sports car" might be confused with "sedan".
- Simulate Optimization: An interactive "Digital Twin" lets you disable or merge labels to see how it affects global accuracy in real-time.
- Detect Redundancy: Identifies labels that are synonyms or subsets of others (e.g., removing "puppy" if "dog" covers 99% of the same images).
Figure 1: The main dashboard showing F1 scores per label and overall health status.
Figure 2: Confusion matrix highlighting cross-class overlaps.
Figure 3: The dataset explorer showing images with their computed similarity scores.
Figure 4: The interactive simulator allows enabling/disabling labels to see real-time impact on Global F1.
pip install -r requirements.txt
# Requires PyTorch and Transformers for CLIP embedding generationEdit keywords.json to define the categories and labels you want to detect.
Harvest images to build your ground truth.
# Collect images for all labels in keywords.json
python run_collection.py
# Collect specific labels only
python run_collection.py --labels "cat" "dog"Analyze the dataset to find optimal thresholds.
cd threshold_calibration
python run_calibration.pyIf the analysis was already made at least once you can avoid the creation of embeddings, that can be slow.
python run_calibration.py --skip-embeddingsOutputs are saved to data/calibration_results/:
optimal_thresholds.json: The machine-readable thresholds for your production app.report.html: The interactive dashboard.calibrated_keywords.json: A ready-to-use config file with the new thresholds applied.
.
├── collector.py # Core logic for image downloading (icrawler wrapper)
├── config.py # Configuration for the collection process
├── keywords.json # Taxonomy definition (categories and labels)
├── label_parser.py # Utilities to parse the keywords JSON
├── requirements.txt # Python dependencies
├── run_collection.py # CLI entry point for downloading images
├── data/ # Directory where data is stored
│ ├── images_dataset/ # Downloaded images organized by label
│ └── calibration_results/ # Output files (reports, thresholds, matrices)
└── threshold_calibration/ # Module for analyzing and calibrating thresholds
├── config.py # Configuration for the calibration process
├── embeddings.py # Generates CLIP embeddings for images and text
├── optimizer.py # Calculates optimal F1 thresholds and health metrics
├── run_calibration.py # CLI entry point for the calibration process
└── visualizer.py # Generates the interactive HTML dashboard
Note: This tool is domain-agnostic. You can use it to calibrate detectors for retail products, wildlife, architectural styles, or any other visual taxonomy.