Skip to content

vera-codes6/fused-layernorm-cuda-operator

Repository files navigation

Fused LayerNorm CUDA Operator (PyTorch)

High-performance CUDA implementation of Layer Normalization with an optional fused GELU variant. Works as a low-level extension and as a drop-in replacement module for torch.nn.LayerNorm.

Highlights:

  • Simple, robust kernels that work with any hidden size (including odd/primes)
  • Forward kernel exported as a minimal extension by default
  • Optional autograd-enabled module (FusedLayerNorm) with forward/backward
  • Benchmark scripts and tests included

Results will vary by GPU and workload. See Benchmarks to reproduce on your machine.

Table of contents

  • Installation
  • Quick start
  • API reference
  • Benchmarks
  • Run tests
  • Build options (minimal vs full)
  • Project structure
  • Troubleshooting
  • License

Installation

Prerequisites:

  • Python 3.8+
  • PyTorch with CUDA that matches your local CUDA/NVIDIA driver
  • CUDA Toolkit 11.x or newer
  • A C++ toolchain (on Windows, Visual Studio Build Tools) and a working NVCC

Steps:

  1. Clone and install deps
git clone https://github.com/JonSnow1807/Fused-LayerNorm-CUDA-Operator.git
cd Fused-LayerNorm-CUDA-Operator
pip install -r requirements.txt
  1. Build and install the CUDA extension (editable):
pip install -e .

This builds the minimal extension that exports the forward LayerNorm and the fused LayerNorm+GELU kernels.

Quick start

Minimal (extension functions):

import torch
import fused_layernorm_cuda  # built by setup.py

# Shapes: (batch, hidden)
x = torch.randn(32, 4096, device='cuda', dtype=torch.float32)
gamma = torch.ones(4096, device='cuda', dtype=torch.float32)
beta = torch.zeros(4096, device='cuda', dtype=torch.float32)

# Forward LayerNorm (CUDA)
y = fused_layernorm_cuda.layernorm(x, gamma, beta, 1e-5)

# Fused LayerNorm + GELU (single kernel)
y_gelu = fused_layernorm_cuda.layernorm_gelu(x, gamma, beta, 1e-5)

Drop-in replacement module (optional, with autograd):

import torch
from fused_layernorm import FusedLayerNorm  # requires the full build (see below)

ln = FusedLayerNorm(4096).cuda()
inp = torch.randn(32, 4096, device='cuda', requires_grad=True)
out = ln(inp)
out.sum().backward()

Note: The optional module relies on forward/backward bindings found in csrc/layernorm_cuda.cpp. See Build options to enable.

API reference

Low-level CUDA extension (import fused_layernorm_cuda):

  • layernorm(input, gamma, beta, epsilon: float) -> Tensor
    • input: 2D CUDA tensor (batch, hidden)
    • gamma, beta: 1D CUDA tensors with length = hidden (optional but recommended)
    • epsilon: numerical stability term
  • layernorm_gelu(input, gamma, beta, epsilon: float) -> Tensor
    • Same signature as above, applies GELU after normalization

Optional Python module (from fused_layernorm import ...):

  • FusedLayerNorm(normalized_shape, eps=1e-5, elementwise_affine=True, device=None, dtype=None)
    • Drop-in replacement for torch.nn.LayerNorm
  • fused_layer_norm(input, normalized_shape, weight=None, bias=None, eps=1e-5)
    • Functional API similar to torch.nn.functional.layer_norm
  • Utilities (when fully built): replace_torch_layernorm(), restore_torch_layernorm() and profiling helpers in fused_layernorm/functional.py.

Benchmarks

You can reproduce realistic and cache-friendly scenarios measured from this project:

python benchmarks/reproduce_speedup.py

Comprehensive suite (forward/backward, memory):

python benchmarks/benchmark_layernorm.py --quick  # or run without --quick for full sweep

Benchmark results and plots are written under benchmarks/results/.

Run tests

pytest -q

The tests cover correctness against PyTorch, gradient checks (when fully built), edge cases, determinism, and basic performance regression.

Build options (minimal vs full)

This repo contains two bindings implementations:

  • Minimal build (default via setup.py):
    • Sources: csrc/bindings.cpp, csrc/layernorm_cuda_kernel.cu
    • Exports: layernorm, layernorm_gelu (forward-only)
  • Full build (autograd-enabled):
    • Sources: csrc/layernorm_cuda.cpp, csrc/layernorm_cuda_kernel.cu, csrc/layernorm_cuda_kernel_optimized.cu
    • Exports: forward, backward, plus helpers get_memory_usage, get_performance_hints
    • Required by the FusedLayerNorm module and the functional API

To enable the full build, modify setup.py to compile the full set of sources above, then reinstall with pip install -e ..

Project structure

Fused-LayerNorm-CUDA-Operator/
├─ csrc/
│  ├─ bindings.cpp                      # Minimal PyBind11 bindings (forward-only)
│  ├─ layernorm_cuda_kernel.cu          # Simple, robust kernel + fused GELU
│  ├─ layernorm_cuda.cpp                # Full autograd bindings (forward/backward)
│  └─ layernorm_cuda_kernel_optimized.cu# Optimized forward for large hidden sizes
├─ fused_layernorm/
│  ├─ __init__.py
│  ├─ layernorm.py                      # FusedLayerNorm module, functional API
│  └─ functional.py                     # Profiling and helpers
├─ benchmarks/
│  ├─ benchmark_layernorm.py            # End-to-end benchmark suite
│  ├─ reproduce_speedup.py              # Quick reproduction script
│  └─ visualize_results.py              # Plot generation
├─ tests/                               # Correctness and performance tests
├─ docs/                                # Architecture, optimization notes (WIP)
├─ setup.py                             # Build config (minimal by default)
└─ requirements.txt

Troubleshooting

  • Build toolchain on Windows: ensure Visual Studio Build Tools (MSVC) and NVCC are on PATH.
  • PyTorch/CUDA mismatch: install a PyTorch build that matches your CUDA runtime and driver.
  • Architecture flags: you can set TORCH_CUDA_ARCH_LIST before install to control SASS generation.
  • Out of memory during benchmarks: use --quick or smaller configs.

License

MIT. See LICENSE.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published