SwinVQColor implements a Swin Transformer-based Vector Quantized Variational Autoencoder (VQ-VAE) designed for perceptually realistic image colorization in the CIE Lab color space.
Given a grayscale (L-channel) image, the model predicts the chrominance (ab) channels using learned discrete embeddings, delivering rich, vibrant, and structured colorizations.
CNN-based colorization models often produce blurry or desaturated outputs due to regression to the mean and limited context modeling.
SwinVQColor overcomes these limitations by integrating:
- Swin Transformer Encoder: Captures hierarchical, long-range context and spatial structure from the grayscale input.
- VQ-VAE with EMA Codebook: Learns robust discrete latent color representations, encouraging multimodal and vivid color synthesis.
- Advanced Loss Functions: Blends pixel, perceptual, vector-quantization, and color fidelity losses, ensuring sharp, visually plausible results.
flowchart LR
L["Input L-channel (grayscale)"] --> E[("Swin Transformer<br>Encoder")]
E --> Z["Latent Code (z_enc)"]
Z --> QV["VQ-VAE Codebook<br>(EMA Quantization)"]
QV --> D[("Decoder<br>(CNN/Transformer)")]
D --> ab["Output ab channels<br>(colorized image)"]
| Module | Description |
|---|---|
| Swin Encoder | Hierarchical Vision Transformer backbone to extract multi-scale features from input L. |
| VQ-VAE (EMA) | Quantizes features into a discrete space using a codebook with EMA updates for stability. |
| Decoder | Maps quantized codes to color channels using upsampling, residual, and attention blocks. |
| Losses | Combines pixel, perceptual, VQ, and color consistency losses for best perceptual effect. |
- Input:
L(grayscale), shape(1×H×W) - Backbone: Swin-T/S/B (configurable)
- Output: Hierarchical latent features
z_enc
- Embedding codebook for discretizing latent space, updated with Exponential Moving Average (EMA)
- Produces:
- Quantized embeddings:
z_q - VQ Loss (
vq_loss): codebook + commitment losses
- Quantized embeddings:
- Upsamples
z_qto reconstructab_pred - Utilizes residual and attention layers for fidelity and sharpness
- Output:
ab_pred(shape:2×H×W)
pip install -r requirements.txt- Expect images in CIE Lab format or convert to obtain L (input) and ab (target).
- You may use datasets like ImageNet or COCO.
- For custom data, update paths in your config (see next step).
Edit training configs in configs/swinvq_color.yaml to fit your data paths and hyperparameters.
python train.py --config configs/swinvq_color.yamlmodel:
encoder: swin_t
codebook_size: 512
code_dim: 64
...
data:
train_dir: "path/to/train/images"
val_dir: "path/to/val/images"| Input (L) | Output (ab, pred) | Ground Truth (ab) |
|---|---|---|
![]() |
![]() |
![]() |
- VQ-VAE: Neural Discrete Representation Learning (Oord et al., 2017)
- Swin Transformer: Hierarchical Vision Transformer using Shifted Windows (Liu et al., 2021)
- VQ-VAE-2: Generating Diverse High-Fidelity Images (Razavi et al., 2019)
- Colorful Image Colorization (Zhang et al., 2016)
This project is licensed under the MIT License.
See the LICENSE file for more details.
Built with inspiration from the official Swin Transformer, VQ-VAE, and pioneering colorization works.


