emb-explorer is a Streamlit-based visual exploration and clustering tool for image datasets and pre-calculated image embeddings.
- Batch Image Embedding: Efficiently embed large collections of images using the pretrained model (e.g., CLIP, BioCLIP) on CPU or GPU (preferably), with customizable batch size and parallelism.
- Clustering: Reduces embedding vectors to 2D using PCA, T-SNE, and UMAP. Performs K-Means clustering and display result using a scatter plot. Explore clusters via interactive scatter plots. Click on data points to preview images and details.
- Cluster-Based Repartitioning: Copy/repartition images into cluster-specific folders with a single click. Generates a summary CSV for downstream use.
- Clustering Summary: Displays cluster sizes, variances, and representative images for each cluster, helping you evaluate clustering quality.
- Parquet File Support: Load precomputed embeddings with associated metadata from parquet files. Compatible with various embedding formats and metadata schemas.
- Advanced Filtering: Filter datasets by taxonomic hierarchy, source datasets, and custom metadata fields. Combine multiple filter criteria for precise data selection.
- Clustering: Reduce embedding vectors to 2D using PCA, UMAP, or t-SNE. Perform K-Means clustering and display result using a scatter plot. Explore clusters via interactive scatter plots. Click on points to preview images and explore metadata details.
- Taxonomy Tree Navigation: Browse hierarchical biological classifications with interactive tree view. Expand and collapse taxonomic nodes to explore at different classification levels.
uv is a fast Python package installer and resolver. Install uv first if you haven't already:
# Install uv (if not already installed)
curl -LsSf https://astral.sh/uv/install.sh | shThen install the project:
# Clone the repository
git clone https://github.com/Imageomics/emb-explorer.git
cd emb-explorer
# Create virtual environment and install dependencies
uv venv
source .venv/bin/activate # On Windows: .venv\Scripts\activate
uv pip install -e .For GPU acceleration, you'll need CUDA 12.0+ installed on your system.
# Full GPU support with RAPIDS (cuDF + cuML)
uv pip install -e ".[gpu]"
# Minimal GPU support (PyTorch + FAISS only)
uv pip install -e ".[gpu-minimal]"# Install with development tools
uv pip install -e ".[dev]"# Activate virtual environment (if not already activated)
source .venv/bin/activate # On Windows: .venv\Scripts\activate
# Run the Streamlit app
streamlit run app.pyAn example dataset (example_1k.parquet) is provided in the data/ folder for testing the pre-calculated embeddings features. This parquet contains metadata and the BioCLIP 2 embeddings for a one thousand-image subset of TreeOfLife-200M.
The project also provides command-line utilities:
# List all available models
python list_models.py --format table
# List models in JSON format
python list_models.py --format json --pretty
# List models as names only
python list_models.py --format names
# Get help for the list models command
python list_models.py --helpIf running the app on a remote compute node (e.g., HPC cluster), you'll need to set up port forwarding to access the Streamlit interface from your local machine.
-
Start the app on the compute node:
# On the remote compute node streamlit run app.pyNote the port number (default is 8501) and the compute node hostname.
-
Set up SSH port forwarding from your local machine:
# From your local machine ssh -N -L 8501:<COMPUTE_NODE>:8501 <USERNAME>@<LOGIN_NODE>
Example:
ssh -N -L 8501:c0828.ten.osc.edu:8501 [email protected]
Replace:
<COMPUTE_NODE>with the actual compute node hostname (e.g.,c0828.ten.osc.edu)<USERNAME>with your username<LOGIN_NODE>with the login node address (e.g.,cardinal.osc.edu)
-
Access the app: Open your web browser and navigate to
http://localhost:8501
The -N flag prevents SSH from executing remote commands, and -L sets up the local port forwarding.
More notes on different implementation methods and approaches are available in the implementation summary doc.




