Skip to content

navaneet625/SwinVQColor

Repository files navigation

🌈 SwinVQColor: Hierarchical VQ-VAE with Swin Transformer for Image Colorization

SwinVQColor implements a Swin Transformer-based Vector Quantized Variational Autoencoder (VQ-VAE) designed for perceptually realistic image colorization in the CIE Lab color space.
Given a grayscale (L-channel) image, the model predicts the chrominance (ab) channels using learned discrete embeddings, delivering rich, vibrant, and structured colorizations.


🧠 Motivation

CNN-based colorization models often produce blurry or desaturated outputs due to regression to the mean and limited context modeling.
SwinVQColor overcomes these limitations by integrating:

  • Swin Transformer Encoder: Captures hierarchical, long-range context and spatial structure from the grayscale input.
  • VQ-VAE with EMA Codebook: Learns robust discrete latent color representations, encouraging multimodal and vivid color synthesis.
  • Advanced Loss Functions: Blends pixel, perceptual, vector-quantization, and color fidelity losses, ensuring sharp, visually plausible results.

🧩 Model Overview

Pipeline

flowchart LR
    L["Input L-channel (grayscale)"] --> E[("Swin Transformer<br>Encoder")]
    E --> Z["Latent Code (z_enc)"]
    Z --> QV["VQ-VAE Codebook<br>(EMA Quantization)"]
    QV --> D[("Decoder<br>(CNN/Transformer)")]
    D --> ab["Output ab channels<br>(colorized image)"]
Loading

Key Components

Module Description
Swin Encoder Hierarchical Vision Transformer backbone to extract multi-scale features from input L.
VQ-VAE (EMA) Quantizes features into a discrete space using a codebook with EMA updates for stability.
Decoder Maps quantized codes to color channels using upsampling, residual, and attention blocks.
Losses Combines pixel, perceptual, VQ, and color consistency losses for best perceptual effect.

⚙️ Architecture Details

1️⃣ Swin Transformer Encoder

  • Input: L (grayscale), shape (1×H×W)
  • Backbone: Swin-T/S/B (configurable)
  • Output: Hierarchical latent features z_enc

2️⃣ Vector Quantization (VQ-VAE, EMA)

  • Embedding codebook for discretizing latent space, updated with Exponential Moving Average (EMA)
  • Produces:
    • Quantized embeddings: z_q
    • VQ Loss (vq_loss): codebook + commitment losses

3️⃣ Decoder

  • Upsamples z_q to reconstruct ab_pred
  • Utilizes residual and attention layers for fidelity and sharpness
  • Output: ab_pred (shape: 2×H×W)

🚀 Getting Started

1. Install Requirements

pip install -r requirements.txt

2. Prepare Data

  • Expect images in CIE Lab format or convert to obtain L (input) and ab (target).
  • You may use datasets like ImageNet or COCO.
  • For custom data, update paths in your config (see next step).

3. Training

Edit training configs in configs/swinvq_color.yaml to fit your data paths and hyperparameters.

python train.py --config configs/swinvq_color.yaml

Example Config Snippet

model:
  encoder: swin_t
  codebook_size: 512
  code_dim: 64
...
data:
  train_dir: "path/to/train/images"
  val_dir: "path/to/val/images"

🖼️ Example Results

Input (L) Output (ab, pred) Ground Truth (ab)
L-channel Colorized GT

📝 References


📄 License

This project is licensed under the MIT License.
See the LICENSE file for more details.


✨ Acknowledgements

Built with inspiration from the official Swin Transformer, VQ-VAE, and pioneering colorization works.

About

Hierarchical image colorization model combining a Swin Transformer encoder with an EMA-based VQ-VAE bottleneck and a residual decoder. Learns discrete color representations and produces realistic, perceptually consistent colorizations

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors