Fused LayerNorm CUDA Operator (PyTorch)

High-performance CUDA implementation of Layer Normalization with an optional fused GELU variant. Works as a low-level extension and as a drop-in replacement module for torch.nn.LayerNorm.

Highlights:

Simple, robust kernels that work with any hidden size (including odd/primes)
Forward kernel exported as a minimal extension by default
Optional autograd-enabled module (FusedLayerNorm) with forward/backward
Benchmark scripts and tests included

Results will vary by GPU and workload. See Benchmarks to reproduce on your machine.

Installation

Prerequisites:

Python 3.8+
PyTorch with CUDA that matches your local CUDA/NVIDIA driver
CUDA Toolkit 11.x or newer
A C++ toolchain (on Windows, Visual Studio Build Tools) and a working NVCC

Steps:

Clone and install deps

git clone https://github.com/JonSnow1807/Fused-LayerNorm-CUDA-Operator.git
cd Fused-LayerNorm-CUDA-Operator
pip install -r requirements.txt

Build and install the CUDA extension (editable):

pip install -e .

This builds the minimal extension that exports the forward LayerNorm and the fused LayerNorm+GELU kernels.

Quick start

Minimal (extension functions):

import torch
import fused_layernorm_cuda  # built by setup.py

# Shapes: (batch, hidden)
x = torch.randn(32, 4096, device='cuda', dtype=torch.float32)
gamma = torch.ones(4096, device='cuda', dtype=torch.float32)
beta = torch.zeros(4096, device='cuda', dtype=torch.float32)

# Forward LayerNorm (CUDA)
y = fused_layernorm_cuda.layernorm(x, gamma, beta, 1e-5)

# Fused LayerNorm + GELU (single kernel)
y_gelu = fused_layernorm_cuda.layernorm_gelu(x, gamma, beta, 1e-5)

Drop-in replacement module (optional, with autograd):

import torch
from fused_layernorm import FusedLayerNorm  # requires the full build (see below)

ln = FusedLayerNorm(4096).cuda()
inp = torch.randn(32, 4096, device='cuda', requires_grad=True)
out = ln(inp)
out.sum().backward()

Note: The optional module relies on forward/backward bindings found in csrc/layernorm_cuda.cpp. See Build options to enable.

API reference

Low-level CUDA extension (import fused_layernorm_cuda):

layernorm(input, gamma, beta, epsilon: float) -> Tensor
- input: 2D CUDA tensor (batch, hidden)
- gamma, beta: 1D CUDA tensors with length = hidden (optional but recommended)
- epsilon: numerical stability term
layernorm_gelu(input, gamma, beta, epsilon: float) -> Tensor
- Same signature as above, applies GELU after normalization

Optional Python module (from fused_layernorm import ...):

FusedLayerNorm(normalized_shape, eps=1e-5, elementwise_affine=True, device=None, dtype=None)
- Drop-in replacement for torch.nn.LayerNorm
fused_layer_norm(input, normalized_shape, weight=None, bias=None, eps=1e-5)
- Functional API similar to torch.nn.functional.layer_norm
Utilities (when fully built): replace_torch_layernorm(), restore_torch_layernorm() and profiling helpers in fused_layernorm/functional.py.

Benchmarks

You can reproduce realistic and cache-friendly scenarios measured from this project:

python benchmarks/reproduce_speedup.py

Comprehensive suite (forward/backward, memory):

python benchmarks/benchmark_layernorm.py --quick  # or run without --quick for full sweep

Benchmark results and plots are written under benchmarks/results/.

Run tests

pytest -q

The tests cover correctness against PyTorch, gradient checks (when fully built), edge cases, determinism, and basic performance regression.

Build options (minimal vs full)

This repo contains two bindings implementations:

Minimal build (default via setup.py):
- Sources: csrc/bindings.cpp, csrc/layernorm_cuda_kernel.cu
- Exports: layernorm, layernorm_gelu (forward-only)
Full build (autograd-enabled):
- Sources: csrc/layernorm_cuda.cpp, csrc/layernorm_cuda_kernel.cu, csrc/layernorm_cuda_kernel_optimized.cu
- Exports: forward, backward, plus helpers get_memory_usage, get_performance_hints
- Required by the FusedLayerNorm module and the functional API

To enable the full build, modify setup.py to compile the full set of sources above, then reinstall with pip install -e ..

Project structure

Fused-LayerNorm-CUDA-Operator/
├─ csrc/
│  ├─ bindings.cpp                      # Minimal PyBind11 bindings (forward-only)
│  ├─ layernorm_cuda_kernel.cu          # Simple, robust kernel + fused GELU
│  ├─ layernorm_cuda.cpp                # Full autograd bindings (forward/backward)
│  └─ layernorm_cuda_kernel_optimized.cu# Optimized forward for large hidden sizes
├─ fused_layernorm/
│  ├─ __init__.py
│  ├─ layernorm.py                      # FusedLayerNorm module, functional API
│  └─ functional.py                     # Profiling and helpers
├─ benchmarks/
│  ├─ benchmark_layernorm.py            # End-to-end benchmark suite
│  ├─ reproduce_speedup.py              # Quick reproduction script
│  └─ visualize_results.py              # Plot generation
├─ tests/                               # Correctness and performance tests
├─ docs/                                # Architecture, optimization notes (WIP)
├─ setup.py                             # Build config (minimal by default)
└─ requirements.txt

Troubleshooting

Build toolchain on Windows: ensure Visual Studio Build Tools (MSVC) and NVCC are on PATH.
PyTorch/CUDA mismatch: install a PyTorch build that matches your CUDA runtime and driver.
Architecture flags: you can set TORCH_CUDA_ARCH_LIST before install to control SASS generation.
Out of memory during benchmarks: use --quick or smaller configs.

License

MIT. See LICENSE.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
benchmarks		benchmarks
csrc		csrc
docs		docs
examples		examples
fused_layernorm		fused_layernorm
tests		tests
.DS_Store		.DS_Store
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
publication_results.json		publication_results.json
publication_validation_results.json		publication_validation_results.json
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Fused LayerNorm CUDA Operator (PyTorch)

Table of contents

Installation

Quick start

API reference

Benchmarks

Run tests

Build options (minimal vs full)

Project structure

Troubleshooting

License

About

Uh oh!

Releases

Packages

Languages

License

vera-codes6/fused-layernorm-cuda-operator

Folders and files

Latest commit

History

Repository files navigation

Fused LayerNorm CUDA Operator (PyTorch)

Table of contents

Installation

Quick start

API reference

Benchmarks

Run tests

Build options (minimal vs full)

Project structure

Troubleshooting

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages