Skip to content

[NeurIPS 2025] TL;DR: Aligning pretrained unimodal models with the proposed framework using limited paired data yields ~52% gains in cross-modality zero-shot classification and ~92% in retrieval.

Notifications You must be signed in to change notification settings

mlbio-epfl/STRUCTURE

Repository files navigation

[NeurIPS 2025] With Limited Data for Multimodal Alignment, Let the STRUCTURE Guide You

Paper OpenReview Project Page

Fabian Gröger* · Shuo Wen* · Huyen Le · Maria Brbić


Overview

STRUCTURE Teaser

Multimodal models have demonstrated powerful capabilities in complex tasks requiring multimodal alignment, including zero-shot classification and cross-modal retrieval. However, existing models typically rely on millions of paired multimodal samples, which are prohibitively expensive or infeasible to obtain in many domains.

In this work, we explore the feasibility of building multimodal models with limited amounts of paired data by aligning pretrained unimodal foundation models. We show that high-quality alignment is possible with as few as tens of thousands of paired samples — less than 1% of the data typically used in the field.

Key Contributions

  • STRUCTURE Regularization: An effective technique that preserves the neighborhood geometry of the latent space of unimodal encoders
  • Layer Selection: Demonstration that aligning last layers is often suboptimal, with benefits from aligning layers with highest representational similarity
  • Strong Results: 51.6% average relative improvement in classification and 91.8% in retrieval tasks across 24 benchmarks
  • Broad Applicability: Can be readily incorporated into existing alignment methods

Installation

Requirements

  • Python 3.8+
  • PyTorch 2.1.2+
  • CUDA 11.8+ (for GPU support)

Setup

  1. Clone the repository:
git clone https://github.com/mlbio-epfl/STRUCTURE.git
cd STRUCTURE
  1. Install dependencies:
pip install -r requirements.txt

Quick Start

1. Data Preparation

Prepare your datasets using the provided scripts. For example, for COCO:

# COCO dataset will be downloaded automatically
# Place in ./data/coco/

For other datasets, see the src/dataset_preparation/ directory for preparation scripts.

2. Training

Train an alignment model with limited data:

python src/train_subsampled_alignment.py --config_path configs/losses_lin/clip_base_best.yaml

Train with full alignment:

python src/train_alignment.py --config_path configs/losses_lin/clip_base_best.yaml

3. Evaluation

Extract features:

python src/extract_features.py --config_path configs/default.yaml

Measure alignment quality:

python src/measure_alignment.py --config_path configs/metrics/clip_mutual_knn_rice.yaml

Configuration

The repository uses YAML configuration files located in configs/:

  • configs/default.yaml - Base configuration
  • configs/losses_lin/ - Linear alignment layer configurations
  • configs/losses_mlp/ - MLP alignment layer configurations
  • configs/ablations/ - Ablation study configurations
  • configs/csa/ - CSA configurations
  • configs/metrics/ - Alignment metrics configurations

Project Structure

representation-alignment/
├── configs/              # Configuration files
├── src/
│   ├── alignment/       # Alignment layer implementations
│   ├── trainers/        # Training logic
│   ├── models/          # Model implementations
│   ├── loss/            # Loss functions
│   ├── evaluation/      # Evaluation metrics
│   ├── dataset_preparation/  # Dataset preparation scripts
│   └── utils/           # Utility functions
├── data/                # Dataset directory (created during setup)
└── requirements.txt     # Python dependencies

Citation

If you find this work useful, please cite our paper:

@inproceedings{groger2025structure,
  title={With Limited Data for Multimodal Alignment, Let the {STRUCTURE} Guide You},
  author={Gr{\"o}ger, Fabian and Wen, Shuo and Le, Huyen and Brbic, Maria},
  booktitle={The Thirty-ninth Annual Conference on Neural Information Processing Systems},
  year={2025},
  url={https://openreview.net/forum?id=IkvQqD7hk3}
}

About

[NeurIPS 2025] TL;DR: Aligning pretrained unimodal models with the proposed framework using limited paired data yields ~52% gains in cross-modality zero-shot classification and ~92% in retrieval.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published