Skip to content

An interactive tool for classifying images with a pretrained model and exploring clustering results in 2D space.

License

Notifications You must be signed in to change notification settings

Imageomics/emb-explorer

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

42 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

emb-explorer

emb-explorer is a Streamlit-based visual exploration and clustering tool for image datasets and pre-calculated image embeddings.

🎯 Demo Screenshots

πŸ“Š Embed & Explore Images

πŸ” Explore Pre-calculated Embeddings

Embedding Interface

Embedding Clusters

Embed your images using pre-trained models

Smart Filtering

Precalculated Embedding Filters

Apply filters to pre-calculated embeddings

Cluster Summary

Cluster Summary

Analyze clustering results and representative images

Interactive Exploration

Precalculated Embedding Clusters

Explore clusters with interactive visualization

Taxonomy Tree Navigation

Precalculated Embedding Taxon Tree

Browse hierarchical taxonomy structure

Features

Embed & Explore Images from Upload

  • Batch Image Embedding: Efficiently embed large collections of images using the pretrained model (e.g., CLIP, BioCLIP) on CPU or GPU (preferably), with customizable batch size and parallelism.
  • Clustering: Reduces embedding vectors to 2D using PCA, T-SNE, and UMAP. Performs K-Means clustering and display result using a scatter plot. Explore clusters via interactive scatter plots. Click on data points to preview images and details.
  • Cluster-Based Repartitioning: Copy/repartition images into cluster-specific folders with a single click. Generates a summary CSV for downstream use.
  • Clustering Summary: Displays cluster sizes, variances, and representative images for each cluster, helping you evaluate clustering quality.

Explore Pre-computed Embeddings

  • Parquet File Support: Load precomputed embeddings with associated metadata from parquet files. Compatible with various embedding formats and metadata schemas.
  • Advanced Filtering: Filter datasets by taxonomic hierarchy, source datasets, and custom metadata fields. Combine multiple filter criteria for precise data selection.
  • Clustering: Reduce embedding vectors to 2D using PCA, UMAP, or t-SNE. Perform K-Means clustering and display result using a scatter plot. Explore clusters via interactive scatter plots. Click on points to preview images and explore metadata details.
  • Taxonomy Tree Navigation: Browse hierarchical biological classifications with interactive tree view. Expand and collapse taxonomic nodes to explore at different classification levels.

Installation

uv is a fast Python package installer and resolver. Install uv first if you haven't already:

# Install uv (if not already installed)
curl -LsSf https://astral.sh/uv/install.sh | sh

Then install the project:

# Clone the repository
git clone https://github.com/Imageomics/emb-explorer.git
cd emb-explorer

# Create virtual environment and install dependencies
uv venv
source .venv/bin/activate  # On Windows: .venv\Scripts\activate
uv pip install -e .

GPU Support (Optional)

For GPU acceleration, you'll need CUDA 12.0+ installed on your system.

# Full GPU support with RAPIDS (cuDF + cuML)
uv pip install -e ".[gpu]"

# Minimal GPU support (PyTorch + FAISS only)
uv pip install -e ".[gpu-minimal]"

Development

# Install with development tools
uv pip install -e ".[dev]"

Usage

Running the Application

# Activate virtual environment (if not already activated)
source .venv/bin/activate  # On Windows: .venv\Scripts\activate

# Run the Streamlit app
streamlit run app.py

An example dataset (example_1k.parquet) is provided in the data/ folder for testing the pre-calculated embeddings features. This parquet contains metadata and the BioCLIP 2 embeddings for a one thousand-image subset of TreeOfLife-200M.

Command Line Tools

The project also provides command-line utilities:

# List all available models
python list_models.py --format table

# List models in JSON format
python list_models.py --format json --pretty

# List models as names only
python list_models.py --format names

# Get help for the list models command
python list_models.py --help

Running on Remote Compute Nodes

If running the app on a remote compute node (e.g., HPC cluster), you'll need to set up port forwarding to access the Streamlit interface from your local machine.

  1. Start the app on the compute node:

    # On the remote compute node
    streamlit run app.py

    Note the port number (default is 8501) and the compute node hostname.

  2. Set up SSH port forwarding from your local machine:

    # From your local machine
    ssh -N -L 8501:<COMPUTE_NODE>:8501 <USERNAME>@<LOGIN_NODE>

    Example:

    ssh -N -L 8501:c0828.ten.osc.edu:8501 [email protected]

    Replace:

    • <COMPUTE_NODE> with the actual compute node hostname (e.g., c0828.ten.osc.edu)
    • <USERNAME> with your username
    • <LOGIN_NODE> with the login node address (e.g., cardinal.osc.edu)
  3. Access the app: Open your web browser and navigate to http://localhost:8501

The -N flag prevents SSH from executing remote commands, and -L sets up the local port forwarding.

Notes on Implementation

More notes on different implementation methods and approaches are available in the implementation summary doc.

Acknowledgements


About

An interactive tool for classifying images with a pretrained model and exploring clustering results in 2D space.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •