Skip to content

Latest commit

 

History

History
125 lines (87 loc) · 4.17 KB

File metadata and controls

125 lines (87 loc) · 4.17 KB

🌈 SwinVQColor: Hierarchical VQ-VAE with Swin Transformer for Image Colorization

SwinVQColor implements a Swin Transformer-based Vector Quantized Variational Autoencoder (VQ-VAE) designed for perceptually realistic image colorization in the CIE Lab color space.
Given a grayscale (L-channel) image, the model predicts the chrominance (ab) channels using learned discrete embeddings, delivering rich, vibrant, and structured colorizations.


🧠 Motivation

CNN-based colorization models often produce blurry or desaturated outputs due to regression to the mean and limited context modeling.
SwinVQColor overcomes these limitations by integrating:

  • Swin Transformer Encoder: Captures hierarchical, long-range context and spatial structure from the grayscale input.
  • VQ-VAE with EMA Codebook: Learns robust discrete latent color representations, encouraging multimodal and vivid color synthesis.
  • Advanced Loss Functions: Blends pixel, perceptual, vector-quantization, and color fidelity losses, ensuring sharp, visually plausible results.

🧩 Model Overview

Pipeline

flowchart LR
    L["Input L-channel (grayscale)"] --> E[("Swin Transformer<br>Encoder")]
    E --> Z["Latent Code (z_enc)"]
    Z --> QV["VQ-VAE Codebook<br>(EMA Quantization)"]
    QV --> D[("Decoder<br>(CNN/Transformer)")]
    D --> ab["Output ab channels<br>(colorized image)"]
Loading

Key Components

Module Description
Swin Encoder Hierarchical Vision Transformer backbone to extract multi-scale features from input L.
VQ-VAE (EMA) Quantizes features into a discrete space using a codebook with EMA updates for stability.
Decoder Maps quantized codes to color channels using upsampling, residual, and attention blocks.
Losses Combines pixel, perceptual, VQ, and color consistency losses for best perceptual effect.

⚙️ Architecture Details

1️⃣ Swin Transformer Encoder

  • Input: L (grayscale), shape (1×H×W)
  • Backbone: Swin-T/S/B (configurable)
  • Output: Hierarchical latent features z_enc

2️⃣ Vector Quantization (VQ-VAE, EMA)

  • Embedding codebook for discretizing latent space, updated with Exponential Moving Average (EMA)
  • Produces:
    • Quantized embeddings: z_q
    • VQ Loss (vq_loss): codebook + commitment losses

3️⃣ Decoder

  • Upsamples z_q to reconstruct ab_pred
  • Utilizes residual and attention layers for fidelity and sharpness
  • Output: ab_pred (shape: 2×H×W)

🚀 Getting Started

1. Install Requirements

pip install -r requirements.txt

2. Prepare Data

  • Expect images in CIE Lab format or convert to obtain L (input) and ab (target).
  • You may use datasets like ImageNet or COCO.
  • For custom data, update paths in your config (see next step).

3. Training

Edit training configs in configs/swinvq_color.yaml to fit your data paths and hyperparameters.

python train.py --config configs/swinvq_color.yaml

Example Config Snippet

model:
  encoder: swin_t
  codebook_size: 512
  code_dim: 64
...
data:
  train_dir: "path/to/train/images"
  val_dir: "path/to/val/images"

🖼️ Example Results

Input (L) Output (ab, pred) Ground Truth (ab)
L-channel Colorized GT

📝 References


📄 License

This project is licensed under the MIT License.
See the LICENSE file for more details.


✨ Acknowledgements

Built with inspiration from the official Swin Transformer, VQ-VAE, and pioneering colorization works.