Skip to content

zlab-princeton/Derf

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Stronger Normalization-Free Transformers [CVPR 2026]

This repository provides a modular PyTorch implementation of Dynamic erf (Derf), from the following paper

Stronger Normalization-Free Transformers
Mingzhi Chen, Taiming Lu, Jiachen Zhu, Mingjie Sun, and Zhuang Liu
Princeton, NYU, CMU [arXiv]


Dynamic erf (Derf) is a simple point-wise function: $\mathrm{Derf}(x) = \mathrm{erf}(\alpha x + s)$, where $\alpha$ and $s$ are learnable scalars.
Derf is designed as a drop-in replacement for normalization layers in Transformers, and achieves stronger performance than LayerNorm, RMSNorm, and Dynamic Tanh across a wide range of modalities and tasks.

Implementation

We evaluate Derf across four representative Transformer model families and one other modern architecture, strictly following their official implementations.

  • DiT - Diffusion Transformer for image generation
  • ViT - Vision Transformer for image classification
  • Speech Model - Speech recognition and processing
  • DNA Model - Genomic sequence modeling
  • Language Model - Transformer-based language models

For detailed installation, implementation details, and usage instructions, please refer to the README in each model's directory.

In addition to Derf, we also provide implementations of other point-wise functions mentioned in our paper, such as $\mathrm{satursin}(x)$, $\mathrm{isru}(x)$, $\mathrm{expsign}(x)$, and $\arctan(x)$.

Results

We demonstrate the effectiveness of Derf across different model architectures and tasks:

Vision Transformer (ViT)

acc@1 LN DyT Derf $\Delta_{\text{LN}}$ $\Delta_{\text{DyT}}$
ViT-B 82.3% 82.5% 82.8% ↑ 0.5% ↑ 0.3%
ViT-L 83.1% 83.6% 83.8% ↑ 0.7% ↑ 0.2%

Diffusion Transformer (DiT)

FID LN DyT Derf $\Delta_{\text{LN}}$ $\Delta_{\text{DyT}}$
DiT-B/4 64.93 63.94 63.23 ↓ 1.70 ↓ 0.71
DiT-L/4 45.91 45.66 43.94 ↓ 1.97 ↓ 1.72
DiT-XL/2 19.94 20.83 18.92 ↓ 1.02 ↓ 1.91

Speech Model (wav2vec 2.0)

val loss LN DyT Derf $\Delta_{\text{LN}}$ $\Delta_{\text{DyT}}$
wav2vec 2.0 Base 1.95 1.95 1.93 ↓ 0.02 ↓ 0.02
wav2vec 2.0 Large 1.92 1.91 1.90 ↓ 0.02 ↓ 0.01

DNA Model

acc@1 Norm DyT Derf $\Delta_{\text{Norm}}$ $\Delta_{\text{DyT}}$
Hyena 85.2% 85.2% 85.7% ↑ 0.5% ↑ 0.5%
Caduceus 86.9% 86.9% 87.3% ↑ 0.4% ↑ 0.4%

Language Model (GPT-2)

val loss LN DyT Derf $\Delta_{\text{LN}}$ $\Delta_{\text{DyT}}$
GPT-2 2.94 2.97 2.94 0.00 ↓ 0.03

Acknowledgement

This work builds upon several excellent open-source projects. We are grateful to the authors for their contributions:

  • This repository is built using the timm library
  • ViT implementation is based on ConvNeXt
  • DiT implementation is based on DiT
  • Speech Model implementation is based on fairseq
  • DNA Model implementation is based on Caduceus
  • Language Model implementation is based on nanoGPT
  • We compare our method with DyT (Dynamic Tanh) as one of our baselines

License

This project is released under the MIT license. Please see the LICENSE file for more information.

Citation

If you find this repository helpful, please consider citing:

@article{Derf,
  title={Stronger Normalization-Free Transformers},
  author={Chen, Mingzhi and Lu, Taiming and Zhu, Jiachen and Sun, Mingjie and Liu, Zhuang},
  journal={arXiv preprint arXiv:2512.10938},
  year={2025}
}

About

Official Implementation of Dynamic erf (Derf).

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors