High-performance CUDA implementation of Layer Normalization with an optional fused GELU variant. Works as a low-level extension and as a drop-in replacement module for torch.nn.LayerNorm.
Highlights:
- Simple, robust kernels that work with any hidden size (including odd/primes)
- Forward kernel exported as a minimal extension by default
- Optional autograd-enabled module (
FusedLayerNorm) with forward/backward - Benchmark scripts and tests included
Results will vary by GPU and workload. See Benchmarks to reproduce on your machine.
- Installation
- Quick start
- API reference
- Benchmarks
- Run tests
- Build options (minimal vs full)
- Project structure
- Troubleshooting
- License
Prerequisites:
- Python 3.8+
- PyTorch with CUDA that matches your local CUDA/NVIDIA driver
- CUDA Toolkit 11.x or newer
- A C++ toolchain (on Windows, Visual Studio Build Tools) and a working NVCC
Steps:
- Clone and install deps
git clone https://github.com/JonSnow1807/Fused-LayerNorm-CUDA-Operator.git
cd Fused-LayerNorm-CUDA-Operator
pip install -r requirements.txt- Build and install the CUDA extension (editable):
pip install -e .This builds the minimal extension that exports the forward LayerNorm and the fused LayerNorm+GELU kernels.
Minimal (extension functions):
import torch
import fused_layernorm_cuda # built by setup.py
# Shapes: (batch, hidden)
x = torch.randn(32, 4096, device='cuda', dtype=torch.float32)
gamma = torch.ones(4096, device='cuda', dtype=torch.float32)
beta = torch.zeros(4096, device='cuda', dtype=torch.float32)
# Forward LayerNorm (CUDA)
y = fused_layernorm_cuda.layernorm(x, gamma, beta, 1e-5)
# Fused LayerNorm + GELU (single kernel)
y_gelu = fused_layernorm_cuda.layernorm_gelu(x, gamma, beta, 1e-5)Drop-in replacement module (optional, with autograd):
import torch
from fused_layernorm import FusedLayerNorm # requires the full build (see below)
ln = FusedLayerNorm(4096).cuda()
inp = torch.randn(32, 4096, device='cuda', requires_grad=True)
out = ln(inp)
out.sum().backward()Note: The optional module relies on forward/backward bindings found in csrc/layernorm_cuda.cpp. See Build options to enable.
Low-level CUDA extension (import fused_layernorm_cuda):
layernorm(input, gamma, beta, epsilon: float) -> Tensor- input: 2D CUDA tensor (batch, hidden)
- gamma, beta: 1D CUDA tensors with length = hidden (optional but recommended)
- epsilon: numerical stability term
layernorm_gelu(input, gamma, beta, epsilon: float) -> Tensor- Same signature as above, applies GELU after normalization
Optional Python module (from fused_layernorm import ...):
FusedLayerNorm(normalized_shape, eps=1e-5, elementwise_affine=True, device=None, dtype=None)- Drop-in replacement for
torch.nn.LayerNorm
- Drop-in replacement for
fused_layer_norm(input, normalized_shape, weight=None, bias=None, eps=1e-5)- Functional API similar to
torch.nn.functional.layer_norm
- Functional API similar to
- Utilities (when fully built):
replace_torch_layernorm(),restore_torch_layernorm()and profiling helpers infused_layernorm/functional.py.
You can reproduce realistic and cache-friendly scenarios measured from this project:
python benchmarks/reproduce_speedup.pyComprehensive suite (forward/backward, memory):
python benchmarks/benchmark_layernorm.py --quick # or run without --quick for full sweepBenchmark results and plots are written under benchmarks/results/.
pytest -qThe tests cover correctness against PyTorch, gradient checks (when fully built), edge cases, determinism, and basic performance regression.
This repo contains two bindings implementations:
- Minimal build (default via
setup.py):- Sources:
csrc/bindings.cpp,csrc/layernorm_cuda_kernel.cu - Exports:
layernorm,layernorm_gelu(forward-only)
- Sources:
- Full build (autograd-enabled):
- Sources:
csrc/layernorm_cuda.cpp,csrc/layernorm_cuda_kernel.cu,csrc/layernorm_cuda_kernel_optimized.cu - Exports:
forward,backward, plus helpersget_memory_usage,get_performance_hints - Required by the
FusedLayerNormmodule and the functional API
- Sources:
To enable the full build, modify setup.py to compile the full set of sources above, then reinstall with pip install -e ..
Fused-LayerNorm-CUDA-Operator/
├─ csrc/
│ ├─ bindings.cpp # Minimal PyBind11 bindings (forward-only)
│ ├─ layernorm_cuda_kernel.cu # Simple, robust kernel + fused GELU
│ ├─ layernorm_cuda.cpp # Full autograd bindings (forward/backward)
│ └─ layernorm_cuda_kernel_optimized.cu# Optimized forward for large hidden sizes
├─ fused_layernorm/
│ ├─ __init__.py
│ ├─ layernorm.py # FusedLayerNorm module, functional API
│ └─ functional.py # Profiling and helpers
├─ benchmarks/
│ ├─ benchmark_layernorm.py # End-to-end benchmark suite
│ ├─ reproduce_speedup.py # Quick reproduction script
│ └─ visualize_results.py # Plot generation
├─ tests/ # Correctness and performance tests
├─ docs/ # Architecture, optimization notes (WIP)
├─ setup.py # Build config (minimal by default)
└─ requirements.txt
- Build toolchain on Windows: ensure Visual Studio Build Tools (MSVC) and NVCC are on PATH.
- PyTorch/CUDA mismatch: install a PyTorch build that matches your CUDA runtime and driver.
- Architecture flags: you can set
TORCH_CUDA_ARCH_LISTbefore install to control SASS generation. - Out of memory during benchmarks: use
--quickor smaller configs.
MIT. See LICENSE.