This repository contains the Python code for reproducing our experiments conducted as part of the ICTAI 2025 conference paper
by Andreas Schliebitz, Heiko Tapken and Martin Atzmueller.
This project uses poetry for managing Python dependencies (see pyproject.toml). Follow the steps below to install the code from this repository as a standalone simclr_ae Python package:
- Install Poetry:
curl -sSL https://install.python-poetry.org | python3 -- Create and activate a virtual environment:
poetry shell- Install the requirements:
poetry lock
poetry installThis project is mainly subdivided into two experiments. Within the first experiment, we train the autoencoder embeddings used for replacing the input layer of SimCLR's default projection head. In our second experiment, we train and evaluate our modified projectors as part of the SimCLR framework following standard protocol.
In order to reproduce our results, you will have to prepare the following five image classification datasets in a way that Torchvision's dataset module can load them:
Note: We advise you to download and extract these datasets into this project's datasets directory. We also recommend the use of MLflow Tracking to record all training and evaluation runs. As an alternative, we additionally implement tracking via Lightning's CSVLogger by default.
After that, clone this repository to a location of your choice:
git clone https://github.com/andreas-schliebitz/simclr-ae.git \
&& cd simclr-aeFirst, train the 15 autoencoder embeddings using varying latent dimensions (128, 256, 512). We'll use these embeddings in the next section to perform our SimCLR training and evaluation runs.
-
Navigate into the directory of the
aeexperiment:cd simclr_ae/experiments/ae -
Optional: If applicable, provide your MLflow Tracking credentials in the
aeexperiment's.envfile. If you've placed the datasets into a different directory, changeDATASET_DIRto that path. -
Execute the experiment's
run_experiments.shhelper script. If you have multiple GPU's at your disposal, specify the GPU's ID as first parameter, otherwise use0as the ID of your single GPU. The second parameter can either be a comma separated list of latent dimensions or a single latent dimension. By default, each GPU trains the autoencoder with the specified number of latent dimensions on all datasets:# Train autoencoder on GPU 0 with all three latent dimensions ./run_experiments.sh 0 128,256,512 # Train autoencoder on three GPUs, parallelizing over latent dimensions ./run_experiments.sh 0 128 ./run_experiments.sh 1 256 ./run_experiments.sh 2 512
-
Verify that all model checkpoints, hyperparameters and metrics are written into the
logsdirectory.
-
Navigate into the directory of the
simclr_aeexperiment:cd simclr_ae/experiments/simclr_ae -
Optional: If applicable, provide your MLflow Tracking credentials in the
simclr_aeexperiment's.envfile. If you've placed the datasets into a different directory, changeDATASET_DIRto that path. -
Due to the run IDs of each pretrained autoencoder embedding being randomly generated, you'll have to adapt the IDs in the
run_experiments.shhelper script of thesimclr_aeexperiment for each dataset (variablesAE_WEIGHTS_128_PATH,AE_WEIGHTS_256_PATHandAE_WEIGHTS_512_PATH). The script will throw an error if no pretrained autoencoder checkpoint with matching latent dimensions is found for a given dataset. -
Execute the experiment's
run_experiments.shhelper script. As the first argument, provide your GPU's ID followed by the second argument being the latent dimension of SimCLR's projection space (32, 64, 128).# Train on single GPU with single latent dimension ./run_experiments.sh 0 32 # Train on three GPUs with different latent dimensions ./run_experiments.sh 0 32 ./run_experiments.sh 1 64 ./run_experiments.sh 2 128
-
Once again, verify that all model checkpoints, hyperparameters and metrics are written into the
logsdirectory.
You can now verify our results by either inspecting the CSV files in the logs directory of the ae and simclr_ae experiment or by visiting the web interface of your MLFlow Tracking instance. As a basis for comparison, we provide our MLflow runs as CSV exports in the results directory of each experiment.