Skip to content

The official code for paper "GPSToken: Gaussian Parameterized Spatially-adaptive Tokenization for Image Representation and Generation"

Notifications You must be signed in to change notification settings

xtudbxk/GPSToken

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

12 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

License Framework

Zhengqiang Zhang1,2 | Rongyuan Wu1,2 | Lingchen Sun1,2 | Lei Zhang1,2,+

1 The Hong Kong Polytechnic University 2 OPPO Research Institute + Corresponding Author


⭐ Update

  • 2025.09.05: Update code for higher resolution, including GPS-tokens merging (see here) for reducing boundary artifacts and resized GroupNorm layer (see here) for easing color shifts.

🎯 Motivation: Beyond Fixed Grids

Effective and efficient tokenization is crucial for image representation and generation. Conventional uniform 2D/1D grid tokenization lacks flexibility in handling regions with varying shapes, textures, and locations. We propose GPSToken, a Gaussian Parameterized Spatially-adaptive Tokenization framework, enabling non-uniform tokenization via parametric 2D Gaussians. Our method:

  • Partitions images into complexity-balanced regions of varying shapes and positions using an entropy-driven algorithm;
  • Represents each region as a 2D Gaussian (mean for position, covariance for shape) and texture features;
  • Trains a transformer to optimize Gaussian parameters and texture features for content-aware adaptation;
    • Reconstructs the image via a differentiable splatting-based renderer, enabling end-to-end training.

πŸ” Core Highlights

βœ… Spatially-Adaptive Representation

  • Iteratively split the image into entropy-balanced regions of varying positions and shapes -- finer partitions in complex textures -- and represent each region with a 2D Gaussian (mean for position, variance for extent) and corresponding texture features.

βœ… Dynamic & Scalable

Furthermore, GPSToken supports:

  • User-Controllable Adjustment: Manually allocate more tokens to user-interest areas for finer reconstruction.
  • Variable Token Count: Increase or decrease token count of each image for better efficiency-fidelity balance.
  • Scalable to Higher Resolution: maintain comparable performance at higher resolutions without retraining.

βœ… Spatial-Texture Disentanglement

  • Each token encodes a disentangled representation: Gaussian parameters for spatial geometry and a separate vector for textural features, enabling independent manipulation for downstream tasks like generation.

βœ… SOTA Performance

  • Achieves psnr=28.81, ssim=0.809, rFID = 0.22, FID=1.65 on image reconstruction with only 256 tokens, outperforming prior methods.

🎨 GPS-Tokens: Mathematical Form and CUDA-Based Rendering Algorithm

Each token is represented by a bounded 2D Gaussian function and a individual feature, encoding spatial geometry and texture separately.

πŸ“ Standard 2D Gaussian (Unnormalized)

The core form of the $i$-th Gaussian is:

Standard 2D Gaussian

  • $(\mu_{x,i}, \mu_{y,i})$: center (position)
  • $\sigma_{x,i}, \sigma_{y,i} > 0$: standard deviations (scale)
  • $\rho_i \in [-1, 1]$: correlation coefficient (orientation)

This is the unnormalized density β€” avoids costly $Z$ computation.

πŸ“ Bounded Support for Efficiency

To focus on local regions and enable fast GPU rendering, we define the modified splatting kernel:

Bounded Gaussian Kernel

  • $s$: spatial support factor (empirically set to $s=5$) β†’ Covers >99.999% of Gaussian mass, negligible truncation error.

🧩 Token Representation

An image is encoded as $l$ GPS-tokens: $\mathbf{z} = {\mathbf{z}_1, \dots, \mathbf{z}_l}$, where each $\mathbf{z}_i = \{\mathbf{g}_i, \mathbf{f}_i\}$ contains:

Component Symbol & Type Role
Geometry $\mathbf{g}_i = (\mu_x, \mu_y, \sigma_x, \sigma_y, \rho)$ Spatial layout (2D Gaussian params)
Texture $\mathbf{f}_i \in \mathbb{R}^{c-5}$ Visual features (from CNN/Transformer)

Disentangled design: geometry and texture can be manipulated independently.

⚑ CUDA-Based Rendering Algorithm

We implement a CUDA-accelerated rendering algorithm to parallelize the forward and backward processes of the bounded Gaussian splatting kernel. Implementation details are provided in the gscuda folder.

πŸ—οΈ Framework: From Image to GPS-Tokens

GPSToken pipeline: Initialization β†’ Refinement β†’ Rendering β†’ Reconstruction

Spatially-adaptive Token Initialization

We use an iterative algorithm to partition the image into regions based on texture complexity. Each region's location and size initialize the Gaussian parameters of corresponding GPS-tokens, enabling a coarse spatially-adaptive representation.

Spatially-adaptive Token Refinement

After obtaining the initialized Gaussian parameters, we employ a transformer-based encoder to refine these parameters to achieve fine-grained spatial adaptation, while simultaneously extracting the corresponding texture features $\mathbf{f}$ for each region using RoIAlign layers. After encoder refinement, the parameters better match local textures.

End-to-end Reconstruction

During decoding, we first render the GPSTokens into a 2D feature map, then decode them into the reconstructed image. Following existing works, we use a combination of reconstruction loss $L_{\text{rec}}$, perceptual loss $L_{\text{perc}}$, and adversarial loss $L_{\text{adv}}$ during training.

πŸ“Š Experimental Results

1. Image Reconstruction ($256\times 256$ on Imagenet val set)

GPSToken outperforms fixed-grid methods with same token count.

Method Token Count Params (M) PSNR SSIM LPIPS rFID FID
SDXL-VAE 32x32 83.6 25.55 0.727 0.066 0.73 2.35
VAVAE 16x16 69.8 25.76 0.742 0.050 0.27 1.74
DCAE 8x8 323.4 23.62 0.644 0.092 0.98 2.59
TiTok-B64 64 204.8 17.01 0.390 0.263 1.75 2.50
TiTok-S128 128 83.7 17.66 0.413 0.220 1.73 3.25
MAETok 128 173.9 23.25 0.626 0.096 0.65 2.01
FlexTok 256 949.7 17.69 0.475 0.257 4.02 4.88
GPSToken-S64 64 127.5 22.18 0.578 0.111 1.31 3.02
GPSToken-M128 128 127.8 24.06 0.657 0.080 0.65 2.18
GPSToken-L256 256 128.7 28.81 0.809 0.043 0.22 1.65

2. Spatial-Adaptivity Visualization

Gaussian tokens automatically concentrate on high-complexity regions.

from left to right: visualization of intialized GS params, visualization of refined GS params, reconstructed imgs, GT imgs.

3. User-Controllable Adaptivity

We can manually guide tokens to focus on user interest regions.

from left to right: input img, visualization of initialized GS params, reconstructed img, visualization of adjusted GS params, reconstructed img using adjusted GS params.

4. Variable Token Count of GPS-Tokens

We can increase or decrease the count of tokens for encode one image.

We use GPSToken-M128, which is trained only under 128 tokens, for demonstration.

5. Scales to Higher Resolutions

GPSToken can generalize to higher resolution, e.g., $512\times 512$ or $1024\times 1024$, with models trained only on $256\times 256$.

Method Tokens PSNR ↑ SSIM ↑ LPIPS ↓ rFID ↓ rec. sFID ↓
512Γ—512
SDXL-VAE 64Γ—64 28.42 0.817 0.059 0.271 1.36
VQVAE-f16 32Γ—32 21.83 0.604 0.172 2.29 7.95
GPSToken-M128 512 26.74 0.764 0.073 0.367 1.93
GPSToken-L256 1024 32.00 0.887 0.039 0.175 0.699
1024Γ—1024
SDXL-VAE 128Γ—128 33.27 0.909 0.057 0.113 0.561
VQVAE-f16 64Γ—64 25.41 0.744 0.169 1.40 4.98
GPSToken-M128 2048 31.22 0.873 0.072 0.236 1.24
GPSToken-L256 4096 37.71 0.955 0.031 0.055 0.276

πŸš€ Quick Start

Model Zoo

Models Token Count Download
GPSToken-S64 64 baidupan/onedrive
GPSToken-M128 128 baidupan/onedrive
GPSToken-L256 256 baidupan/onedrive

Inference scripts

python3 inference_gsptoken.py --model_path [model_path] --data_path [data_path] --config configs/gpstoken_l256.yaml --data_size 256 --output [xxx]

CITATION

@misc{zhang2025gpstokengaussianparameterizedspatiallyadaptive,
      title={GPSToken: Gaussian Parameterized Spatially-adaptive Tokenization for Image Representation and Generation}, 
      author={Zhengqiang Zhang and Rongyuan Wu and Lingchen Sun and Lei Zhang},
      year={2025},
      eprint={2509.01109},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2509.01109}, 
}

CONTACT

Please leave a issue or contact zhengqiang with [email protected]

visitors

About

The official code for paper "GPSToken: Gaussian Parameterized Spatially-adaptive Tokenization for Image Representation and Generation"

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published