SimCLR-AE

This repository contains the Python code for reproducing our experiments conducted as part of the ICTAI 2025 conference paper

An Empirical Study on Improving SimCLR's Nonlinear Projection Head using Pretrained Autoencoder Embeddings

by Andreas Schliebitz, Heiko Tapken and Martin Atzmueller.

Installation

This project uses poetry for managing Python dependencies (see pyproject.toml). Follow the steps below to install the code from this repository as a standalone simclr_ae Python package:

Install Poetry:

curl -sSL https://install.python-poetry.org | python3 -

Create and activate a virtual environment:

poetry shell

Install the requirements:

poetry lock
poetry install

Usage

This project is mainly subdivided into two experiments. Within the first experiment, we train the autoencoder embeddings used for replacing the input layer of SimCLR's default projection head. In our second experiment, we train and evaluate our modified projectors as part of the SimCLR framework following standard protocol.

In order to reproduce our results, you will have to prepare the following five image classification datasets in a way that Torchvision's dataset module can load them:

Note: We advise you to download and extract these datasets into this project's datasets directory. We also recommend the use of MLflow Tracking to record all training and evaluation runs. As an alternative, we additionally implement tracking via Lightning's CSVLogger by default.

After that, clone this repository to a location of your choice:

git clone https://github.com/andreas-schliebitz/simclr-ae.git \
    && cd simclr-ae

Training the Autoencoder Embeddings

First, train the 15 autoencoder embeddings using varying latent dimensions (128, 256, 512). We'll use these embeddings in the next section to perform our SimCLR training and evaluation runs.

Navigate into the directory of the ae experiment:
```
cd simclr_ae/experiments/ae
```
Optional: If applicable, provide your MLflow Tracking credentials in the ae experiment's .env file. If you've placed the datasets into a different directory, change DATASET_DIR to that path.
Execute the experiment's run_experiments.sh helper script. If you have multiple GPU's at your disposal, specify the GPU's ID as first parameter, otherwise use 0 as the ID of your single GPU. The second parameter can either be a comma separated list of latent dimensions or a single latent dimension. By default, each GPU trains the autoencoder with the specified number of latent dimensions on all datasets:
```
# Train autoencoder on GPU 0 with all three latent dimensions
./run_experiments.sh 0 128,256,512

# Train autoencoder on three GPUs, parallelizing over latent dimensions
./run_experiments.sh 0 128
./run_experiments.sh 1 256
./run_experiments.sh 2 512
```
Verify that all model checkpoints, hyperparameters and metrics are written into the logs directory.

End-to-End Training of SimCLR with custom Projectors

Navigate into the directory of the simclr_ae experiment:
```
cd simclr_ae/experiments/simclr_ae
```
Optional: If applicable, provide your MLflow Tracking credentials in the simclr_ae experiment's .env file. If you've placed the datasets into a different directory, change DATASET_DIR to that path.
Due to the run IDs of each pretrained autoencoder embedding being randomly generated, you'll have to adapt the IDs in the run_experiments.sh helper script of the simclr_ae experiment for each dataset (variables AE_WEIGHTS_128_PATH, AE_WEIGHTS_256_PATH and AE_WEIGHTS_512_PATH). The script will throw an error if no pretrained autoencoder checkpoint with matching latent dimensions is found for a given dataset.

Execute the experiment's run_experiments.sh helper script. As the first argument, provide your GPU's ID followed by the second argument being the latent dimension of SimCLR's projection space (32, 64, 128).

# Train on single GPU with single latent dimension
./run_experiments.sh 0 32

# Train on three GPUs with different latent dimensions
./run_experiments.sh 0 32
./run_experiments.sh 1 64
./run_experiments.sh 2 128

Once again, verify that all model checkpoints, hyperparameters and metrics are written into the logs directory.

You can now verify our results by either inspecting the CSV files in the logs directory of the ae and simclr_ae experiment or by visiting the web interface of your MLFlow Tracking instance. As a basis for comparison, we provide our MLflow runs as CSV exports in the results directory of each experiment.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
datasets		datasets
simclr_ae		simclr_ae
.gitignore		.gitignore
CITATION.cff		CITATION.cff
LICENSE		LICENSE
README.md		README.md
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SimCLR-AE

Installation

Usage

Training the Autoencoder Embeddings

End-to-End Training of SimCLR with custom Projectors

About

Uh oh!

Uh oh!

Languages

License

andreas-schliebitz/simclr-ae

Folders and files

Latest commit

History

Repository files navigation

SimCLR-AE

Installation

Usage

Training the Autoencoder Embeddings

End-to-End Training of SimCLR with custom Projectors

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Uh oh!

Languages