This is the code repository of the paper:
Single-pass Adaptive Image Tokenization for Minimum Program Search
Shivam Duggal, Sanghyun Byun, William T. Freeman, Antonio Torralba, Phillip Isola
MIT CSAILKeywords: Representation Learning, Adaptive Tokenization, Compression, Algorithmic Information Theory, Kolmogorov Complexity, Upside-Down Reinforcement Learning.
AIT (Adaptive Image Tokenization) meets AIT (Algorithmic Information Theory)!!
Abstract
Approach Overview
Setup
Datasets
Pretrained Checkpoints
Training
Evaluation
Citation
According to Algorithmic Information Theory (AIT), intelligent representations compress data into the shortest possible program that can reconstruct its content—exhibiting low Kolmogorov Complexity (KC). In contrast, most visual representation learning systems use fixed-length representations for all inputs, ignoring variations in complexity or familiarity. Recent adaptive tokenization methods address this by allocating variable-length representations but typically require test-time search over multiple encodings to find the most predictive one. Inspired by KC principles, we propose a single-pass adaptive tokenizer, KARL, which predicts the appropriate number of tokens for an image in a single forward pass, halting once its approximate KC is reached. The token count serves as a proxy for the minimum description length. KARL matches the performance of recent adaptive tokenizers while operating in a single pass. We present scaling laws for KARL, analyzing the role of encoder / decoder size, continuous vs. discrete tokenization and more. Additionally, we offer a conceptual study drawing an analogy between Adaptive Image Tokenization and Algorithmic Information Theory, examining the predicted image complexity (KC) across axes such as structure vs. noise and in- vs. out-of-distribution familiarity – revealing alignment with human intuition.
mamba env create -f environment.yaml
mamba activate kolmogorov_tokenizerTraining the adaptive tokenizer requires pretrained checkpoints of the base 2D image tokenizers. We use VQGAN or VAE as the base tokenizers. We acknowldege Mage / Mar for releasing Imagenet-trained checkpoints of VQGAN / VAE. Run the following to download the pretrained base tokenizers at base_tokenizers/pretrained_models
python base_tokenizers/pretrained_models/download.pyTo use a custom base tokenizer, add the tokenizer code in base_tokenizers, corresponding pretrained checkpoint in base_tokenizers/pretrained_models and a wrapper in modules/base_tokenizers.py. See VQGANWrapper or LDMVAEWrapper for reference.
We mainly used ImageNet and ImageNet100 (subset of ImageNet) for training. Download ImageNet dataset and place it in $IMAGENET_DIR. To create the ImageNet100 sybset, run the following:
python run_scripts/create_imagenet100.py --imagenet_dir $IMAGENET_DIR --imagenet100_dir datasets/imagenet100/
set -x IMAGENET100_DIR datasets/imagenet100/Download the required checkpoint and place it at kolmogorov_tokenizers/pretrained_models/imagenet100/ or kolmogorov_tokenizers/pretrained_models/imagenet/. Optinally run the following to download all the models:
python kolmogorov_tokenizers/pretrained_models/download.py| Kolmogorov Tokenizer | Base Tokenizer | Dataset | Latent Quantization | Latent Factorization | Pretrained Checkpoint |
|---|---|---|---|---|---|
| karl_small | vqgan | ImageNet (1K) | Download Link | ||
| karl_small | vqgan | ImageNet100 | Download Link | ||
| karl_small | vqgan | ImageNet100 | Download Link | ||
| karl_small | vae | ImageNet100 | Download Link | ||
| karl_small | vae | ImageNet100 | Download Link |
KARL is trained in two stages – latent distillation pretrain and full finetuning (with gan loss). The only difference between these two stages is terms of parameters optimized and GAN vs no GAN loss –– these stages are similar to how traditional VQGAN and followups are usually trained.
- The first latent distillation pretrain stage optimizes 2D tokens
$\rightarrow$ 1D tokens latent distillation encoder/decoder modules via reconstruction losses at 2D image tokens level. - The second full finetuning stage finetunes all the network weights with losses directly on image / pixel level, namely reconstruction loss, GAN loss, perceptual loss.
We train the latent-distillation encoder / decoder modules in this stage, keeping image encoder / decoder fixed.
set -x TRAIN_DATA_DIR $IMAGENET100_DIR # Set to $IMAGENET_DIR, $IMAGENET100_DIR or some other dataset to change the training dataset.
bash run_scripts/latent_distillation_pretrain.shReference guide for adaptive tokenizer arguments:
--base_tokenizerselects 2D Image Tokenizer, current options include vqgan or vae.--modelselects the adaptive tokenizer configurations. Options:karl_small.--quantize_latentleads to quantization of the learned 1D tokens before decoding (this helps create compressed image representations).--factorize_latentperforms feature dimension factorization of the learned 1D tokens before quantization. If--quantize_latentis set True,--factorize_latentwill be set True automatically.- For rest of the arguments, please refer (and directly edit) the config files at
kolmogorov_tokenizers/configs/karl_vqgan.yamlandkolmogorov_tokenizers/configs/karl_vae.yaml. - See
--output_dirfor training logs and checkpoints.
Performs full finetuning of the latent-distillation encoder / decoder and image encoder / decoder with gan losses.
bash run_scripts/full_finetuning.sh--finetuneloads the checkpoint trained in the previous stage (set the argument to the corresponding path accordingly).- See
--output_dirfor training logs and checkpoints.
To resume training from some intermediate point, remember to load weights using the tag --resume in any stage of the training.
Note: These stages are different from the Estimate Image Complexity and Learning to Tokenize Complexity phases, which are the core contribution of KARL tokenizer and are executed in every iteration of both stages.
(uploading the code soon)
(uploading the code soon)
We recommend keeping the input token budget T fixed at the maximum value (e.g., 256), and instead varying the desired reconstruction error
karl_embedding, karl_reconstruction, _ = kolmogorov_tokenizer.encode(image_tensor, input_token_budget=256, desired_reconstruction_quality=0.05) # recommend playing with desired_reconstruction_quality to meet task-requirement or dataset-requirement.min_length_embedding, _, _ = kolmogorov_tokenizer.encode(image_tensor) # default input_token_budget=256, desired_reconstruction_quality=0.05If default = 0.05 does not yield satisfactory reconstructions—for example, if important details are lost—try lowering it to 0.03. Conversely, if 0.05 already produces near-perfect reconstructions but the token count remains unnecessarily high (e.g., above 32), consider increasing it to 0.07.
Unlike other adaptive tokenizers, KARL always remain adaptive in terms of tokens utilized at test time, unless explicitly disabled by setting
If you use our code or the paper, please consider citing the following:
@article{duggal2024KARL,
author = {Shivam Duggal and Sanghyun Byun and William T. Freeman and Antonio Torralba and Phillip Isola},
title = {Single-pass Adaptive Image Tokenization for Minimum Program Search},
journal= {arxiv},
year = {2025}
}



