Explorations in unsupervised learning of herbaria samples using deep learning models for plant species classification.
This repository contains experiments on herbarium specimen classification using SWIN Transformers, CLIP, and BioCLIP models. The project runs on Boston University's Shared Computing Cluster (SCC).
Dataset management and preprocessing utilities.
dataset.py-HerbariaClassificationDatasetclass for loading and preprocessing herbarium imagesconstants.py- Path definitions for Kaggle 2021 and 2022 herbarium datasetsmerge_datasets.py/merge_datasets.ipynb- Tools for combining multiple datasets- Supports flexible label columns (species, family, genus) and integrates with HuggingFace AutoImageProcessor
Key Datasets:
- Kaggle Herbarium 2021 dataset
- Kaggle Herbarium 2022 dataset
Model training and evaluation scripts for multiple architectures.
SWIN Transformer model training and evaluation.
SWIN_finetuning.py- Primary training script using HuggingFace Trainer with WandB loggingtrain.py- Custom training loop with layer freezing supporteval.py- Model evaluation utilities- Base model:
microsoft/swin-base-patch4-window12-384
BioCLIP zero-shot evaluation for biological domain.
zero_shot.py- Zero-shot evaluation on herbarium datatrain_evaluation.py- Training set evaluation
Hybrid model combining SWIN visual features with CLIP text-image alignment.
train.py- Main training scriptmodular_model.py- Modular architecture implementationtrainer.py- Custom trainer implementationtrain_baseline.py- Baseline model training
Configuration:
- Model checkpoints saved in
output/SWIN/kaggle22/ - WandB integration: project
herbdl, entitybu-spark-ml - Environment variables control freezing, learning rate schedules, and run identification
Zero-shot evaluation experiments using OpenAI CLIP.
CLIP_0shot.ipynb- Primary evaluation notebook- Tests species identification with/without visible text labels
- Explores phenology detection (flowers, buds, leaves)
- Documents CLIP's OCR behavior on specimen labels
Interactive visualization of learned representations and outlier detection.
Main Workflows:
kaggle22_clustering.ipynb- Feature extraction, PCA/t-SNE dimensionality reduction, and visualization generationasteraceae_outliers.ipynb- Euclidean distance-based outlier detectionoutlier_detection/asteraceae_outliers.ipynb- Advanced Mahalanobis distance-based outlier detectiongenerate_thumbnails.py- Pre-generates optimized thumbnails for fast hover previewsindex.html- Interactive Plotly-based web interface with filtering, search, and image preview
Features:
- Click points to view herbarium specimens (stacks up to 3 images)
- Hover preview with optimized thumbnails (10-20x faster loading)
- Search/filter controls for species clusters
- Axis locking for consistent zoom levels
- Outlier visualization with different markers
Text description generation and web scraping utilities.
scrape_ncsu.py- Scrape plant descriptions from NCSU databasewikipedia_scrape.py- Extract plant information from Wikipediagenerate_conv.py- Generate conversational descriptionsplayground.ipynb- Experimentation notebook
Kaggle competition evaluation scripts and results.
evaluation.py- Evaluation metrics computationCLIP_explain.ipynb- CLIP model interpretability analysisevaluation_result/- JSON files with family, genus, and species-level results
General utility scripts.
resize_images.py- Batch image resizingimage_install_parallel.py- Parallel image downloadingnotifications.py- Job notification systemcompression.sh- Image compression utilitieslabeling.ipynb- Data labeling tools
# Create and activate virtual environment
virtualenv venv
source venv/bin/activate
# Install dependencies for finetuning
pip install -r finetuning/requirements.txt
# Install dependencies for clustering visualization
pip install -r clustering_viz/requirements.txtcd finetuning/SWIN
python SWIN_finetuning.py \
--output_dir ../output/SWIN/kaggle22/ \
--model_name_or_path "microsoft/swin-base-patch4-window12-384" \
--train_file ../datasets/train_22_scientific.json \
--do_train --do_eval \
--per_device_train_batch_size 8 \
--learning_rate 1e-4 \
--num_train_epochs 3cd clustering_viz
jupyter notebook kaggle22_clustering.ipynb
# Configure checkpoint path and validation dataset
# Generate thumbnails: python generate_thumbnails.py <plot_json_file>
# Update index.html with JSON filepath
# View via SCC OnDemand- Base project:
/projectnb/herbdl/ - Image data:
/projectnb/herbdl/data/kaggle-herbaria/ - Model checkpoints:
/projectnb/herbdl/workspaces/<username>/herbdl/finetuning/output/
CLAUDE.md- Detailed project guide for Claude Code- Subdirectory READMEs in
datasets/,utils/,CLIP/, andfinetuning/output/
See LICENSE file for details.