Skip to content

Bottleneck MAE: maps an image to a single 1024-dimensional feature vector and reconstructs it back.

License

Notifications You must be signed in to change notification settings

grasp-lyrl/btnk_mae

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Bottleneck MAE: $\ \text{Image} \rightarrow \mathbb{R}^{1024} \rightarrow \hat{\text{Image}}$

Mapping image to 1024-dimensional feature vector and back.

Table of Contents

Introduction

Vision foundation models such as Masked Autoencoders (MAE), DINO series (DINO, DINOv2, DINOv3) aim to extract generalizable image features that can be used for a wide range of downstream tasks.

However, ViT based models produce large feature tensors of shape $B \times N \times D$, where $N$ is the number of patches and $D$ the feature dimension. For tasks with a temporal dimension, sequences of images would have shape $B \times T \times N \times D$, making the latent space too large to compute efficiently.

In addition, DINO models lack a decoder head to reconstruct images from features. For tasks that generate new features (e.g., world models rollout in image features), it may be useful to decode them back to image space for visualization, but the DINO series does not support this.

Thus, we finetune a pretrained MAE adapted from the official implementation. We add a bottleneck layer after the encoder and use a DETR-style reconstruction layer so that the encoder maps each image to a 1024-dimensional feature vector, while the decoder reconstructs the original image.

Citation

This project is based on the REMI paper where we modified the original MAE model for compact image representations. If you found this project useful, please cite the REMI paper:

@inproceedings{WangREMI2025,
   title = {REMI: Reconstructing Episodic Memory During Internally Driven Path Planning},
   author = {Wang, Zhaoze and Morris, Genela and Derdikman, Dori and Chaudhari, Pratik and Balasubramanian, Vijay},
   booktitle = {Advances in Neural Information Processing Systems (NeurIPS)},
   year = {2025},
   url = {https://arxiv.org/abs/2507.02064},
   eprint = {arXiv:2507.02064}
}

Since this project is based on the Masked AutoEncoder (MAE) model, please also cite:

@Article{MaskedAutoencoders2021,
  author  = {Kaiming He and Xinlei Chen and Saining Xie and Yanghao Li and Piotr Doll{\'a}r and Ross Girshick},
  journal = {arXiv:2111.06377},
  title   = {Masked Autoencoders Are Scalable Vision Learners},
  year    = {2021},
}

Examples

Run in Colab

Examples of how to use the BtnkMAE model to compress and reconstruct images from ImageNet-1k (224x224).

Open In Colab

Pretrained Models

Here we provide two pretrained models on ImageNet-1k for the BtnkMAE model which are available on Hugging Face.

The btnk_mae_vit_base_224.pth is the model trained on ImageNet-1k with ViT-Base backbone and no activation function (before the bottleneck layer).

The btnk_mae_vit_large_relu_224.pth is the model trained on ImageNet-1k with ViT-Large backbone and ReLU activation function (right before the bottleneck layer).

Model ViT Backbone Activation Function Epochs (trained on ImageNet-1k) Download
btnk_mae_vit_base_224.pth ViT-Base None 1000 Hugging Face
btnk_mae_vit_large_relu_224.pth ViT-Large ReLU 200 Hugging Face

These models can be downloaded by running the following command:

wget -nc -L https://huggingface.co/zhaozewang56/btnk_mae/blob/main/<model_name>.pth

Compress and reconstruct images from ImageNet-1k (224x224)

Processing Panorama Images

The model can also handle images of varying sizes. For instance, to process images of size 1024×512, we first resize them to 512×512. We then fine-tune the model pretraiend on 224×224 images on the 512x512 images by recomputing the cos-sin positional embeddings.

The following results are obtained by fine-tuning the model on panorama images of a single scene in Habitat-Simas part of the REMI paper, where the bottleneck MAE was originally developed for.

Getting Started

Environment

Clone the repository

git clone https://github.com/grasp-lyrl/btnk_mae.git
cd btnk_mae

Install the dependencies

python3.10 -m venv .venv
source .venv/bin/activate
pip install --upgrade pip
pip install -e .

Install PyTorch

We recommend installing PyTorch with their official instructions, as the RTX 50 series is currently supported only by the nightly builds.

Datasets

Tiny ImageNet

For local testing of the training scripts, the Tiny ImageNet dataset could be helpful. The dataset can be downloaded from the CS231n website. The following commands will download and unzip the dataset to <project_root>/data/.

cd <project_root>
mkdir data
wget http://cs231n.stanford.edu/tiny-imagenet-200.zip
unzip tiny-imagenet-200.zip -d data/ 
rm tiny-imagenet-200.zip

Set the data.dataset variable in the configs/config.yaml to "tiny-imagenet-200" to use the dataset.

ImageNet-1k

Set the data.dataset variable in the configs/config.yaml to "imagenet-1k" to use the dataset.

The huggingface cli will automatically handle the dataset download and split. The dataset is approximately 140GB. And one may need to login the huggingface cli (see here) to download the dataset.

Using HuggingFace CLI
huggingface-cli login

Then follow the prompt to enter the access token.

Custom dataset

To use a custom dataset, simply set the data.dataset variable in the configs/config.yaml to the absolute path of the dataset. For any other dataset, modify the dataset.py file to include it accordingly.

The current implementation of the custom dataset requires a dataset_config.json file in the dataset directory to enforce a controllable dataset loading behavior. You may modify the btnk_mae/utils/datasets.py file to remove this requirement. Example of the dataset_config.json file:

{
    "dataset_name": "your_dataset_name",
    "dataset_type": "panorama",
    "dataset_path": "path/to/your/dataset"
}

The dataset_type can be panorama or custom. This setting is primarily used to determine if the images in the dataset require special handling (such as resizing to panorama images to 512x512).

Training

Train the model

NOTE: We use the terms train and finetune with respect to the bottleneck structure added on top of the pretrained MAE. In both cases, we always start by loading the pretrained weights from the original MAE paper. By training, we mean loading the parts of the structure that overlap with the original MAE and then training on ImageNet-1k until the bottleneck layers are also trained. By finetuning, we mean starting from the trained model and further tuning it to let it “overfit” to a domain specific dataset.

Enable Hydra Full Error

Highly recommended to enable this when debugging.

export HYDRA_FULL_ERROR=1
Single GPU

If the run name is not provided, it will set to an auto-generate datetime string.

CUDA_VISIBLE_DEVICES=<GPU_IDS> python train.py --config-name=<train/finetune> run_name=<RUN_NAME> [any Hydra overrides…]

Example:

CUDA_VISIBLE_DEVICES=0 python train.py --config-name=train run_name=btnk_mae
Multi GPU

Two GPUs with DDP

torchrun --nproc_per_node=<NUM_GPUS> train.py --config-name=<train/finetune> run_name=<RUN_NAME> [any Hydra overrides…]

Example:

torchrun --nproc_per_node=8 train.py --config-name=train
Overwrite config with hydra
./scripts/train.sh 0 custom_run_name \
    train.batch_size=64 \
    epochs=50 \
    optimizer.lr=5e-4

Monitoring

This project uses Weights & Biases (wandb) for experiment tracking and visualization.

To use wandb:

  1. Make sure you have a wandb account
  2. Login to wandb in your terminal: wandb login
  3. Run training with the following parameters:
    • --wandb_project: Project name (default: "btnk_mae")
    • --wandb_entity: Your wandb username or team name (optional)
    • --wandb_run_name: Custom name for this run (optional)

Disable wandb

You can decide whether to use wandb or not by toggling the WANDB_MODE environment variable.

export WANDB_MODE=disabled  # Disable wandb
export WANDB_MODE=online    # Enable wandb

Related Works

This project was originally developed in REMI. The paper developed an ANN model of the rat brain during spatial navigation tasks. The goal is to visualize how the model virtually explores the environment during planning.

In neuroscience setting, the paper is about a system-level computational model of hippocampal and entorhinal cortex cells of how they build spatial maps. In the machine learning setting, it presented a brain-inspired world model similar to Dreamer, using a bottleneck MAE to extract image features and visualize imagined navigation during planning.

License

This project is licensed under the MIT License - see the LICENSE file for details.

About

Bottleneck MAE: maps an image to a single 1024-dimensional feature vector and reconstructs it back.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages