This repository provides a modular PyTorch implementation of Dynamic erf (Derf), from the following paper
Stronger Normalization-Free Transformers
Mingzhi Chen, Taiming Lu, Jiachen Zhu, Mingjie Sun, and Zhuang Liu
Princeton, NYU, CMU [arXiv]
Dynamic erf (Derf) is a simple point-wise function:
Derf is designed as a drop-in replacement for normalization layers in Transformers, and achieves stronger performance than LayerNorm, RMSNorm, and Dynamic Tanh across a wide range of modalities and tasks.
We evaluate Derf across four representative Transformer model families and one other modern architecture, strictly following their official implementations.
- DiT - Diffusion Transformer for image generation
- ViT - Vision Transformer for image classification
- Speech Model - Speech recognition and processing
- DNA Model - Genomic sequence modeling
- Language Model - Transformer-based language models
For detailed installation, implementation details, and usage instructions, please refer to the README in each model's directory.
In addition to Derf, we also provide implementations of other point-wise functions mentioned in our paper, such as
We demonstrate the effectiveness of Derf across different model architectures and tasks:
| acc@1 | LN | DyT | Derf | ||
|---|---|---|---|---|---|
| ViT-B | 82.3% | 82.5% | 82.8% | ↑ 0.5% | ↑ 0.3% |
| ViT-L | 83.1% | 83.6% | 83.8% | ↑ 0.7% | ↑ 0.2% |
| FID | LN | DyT | Derf | ||
|---|---|---|---|---|---|
| DiT-B/4 | 64.93 | 63.94 | 63.23 | ↓ 1.70 | ↓ 0.71 |
| DiT-L/4 | 45.91 | 45.66 | 43.94 | ↓ 1.97 | ↓ 1.72 |
| DiT-XL/2 | 19.94 | 20.83 | 18.92 | ↓ 1.02 | ↓ 1.91 |
| val loss | LN | DyT | Derf | ||
|---|---|---|---|---|---|
| wav2vec 2.0 Base | 1.95 | 1.95 | 1.93 | ↓ 0.02 | ↓ 0.02 |
| wav2vec 2.0 Large | 1.92 | 1.91 | 1.90 | ↓ 0.02 | ↓ 0.01 |
| acc@1 | Norm | DyT | Derf | ||
|---|---|---|---|---|---|
| Hyena | 85.2% | 85.2% | 85.7% | ↑ 0.5% | ↑ 0.5% |
| Caduceus | 86.9% | 86.9% | 87.3% | ↑ 0.4% | ↑ 0.4% |
| val loss | LN | DyT | Derf | ||
|---|---|---|---|---|---|
| GPT-2 | 2.94 | 2.97 | 2.94 | 0.00 | ↓ 0.03 |
This work builds upon several excellent open-source projects. We are grateful to the authors for their contributions:
- This repository is built using the timm library
- ViT implementation is based on ConvNeXt
- DiT implementation is based on DiT
- Speech Model implementation is based on fairseq
- DNA Model implementation is based on Caduceus
- Language Model implementation is based on nanoGPT
- We compare our method with DyT (Dynamic Tanh) as one of our baselines
This project is released under the MIT license. Please see the LICENSE file for more information.
If you find this repository helpful, please consider citing:
@article{Derf,
title={Stronger Normalization-Free Transformers},
author={Chen, Mingzhi and Lu, Taiming and Zhu, Jiachen and Sun, Mingjie and Liu, Zhuang},
journal={arXiv preprint arXiv:2512.10938},
year={2025}
}