Home

Welcome to gpt-oss-tvm

gpt-oss-tvm is a project aimed at porting and compiling OpenAI's gpt-oss architecture using Apache TVM, a state-of-the-art compiler framework for deep learning.

The primary goal of this project is to achieve a TVM-based implementation that matches the PyTorch reference implementation with minimal numerical error, while exploring the engineering challenges and opportunities that arise when integrating modern LLM architectures into a compiler-based framework. While we adhere to the original network structure, naming conventions, and operation units of gpt-oss, we have adapted the execution flow to align better with TVM's compilation and runtime model.

Note

While some operation speed optimizations are applied, the main contribution of this project is the structural integration of the architecture of gpt-oss into the TVM ecosystem, rather than raw performance tuning.

Understanding gpt-oss

What is gpt-oss?

gpt-oss represents OpenAI's effort to provide an open-source, reference implementation of a modern large language model. Unlike many black-box commercial models, gpt-oss is designed with transparency and reproducibility in mind, offering researchers and practitioners a clear blueprint for training and deploying state-of-the-art generative models.

Key Architectural Characteristics

Mixture-of-Experts (MoE) Scaling: Instead of densely scaling model width uniformly, gpt-oss employs a MoE architecture that activates only a subset of expert modules per token. This allows massive parameter counts (20B, 120B) without proportional increases in compute.
Sliding Window Attention (SWA): Rather than computing attention over the entire sequence (O(n²) complexity), gpt-oss uses a per-layer controlled sliding window approach where some layers use full attention and others use local (windowed) attention. This dramatically reduces memory and compute requirements while maintaining model capacity.
Advanced Positional Encoding (YaRN): gpt-oss employs YaRN (Yet another RoPE extensioN method), an advanced variant of Rotary Position Embeddings that improves performance on longer sequences through dynamic interpolation of embedding dimensions.
Efficiency-First Design: Every component—from tokenization to the final logit computation—is architected for production-scale efficiency. This includes pre-trained attention sinks for superior attention calibration and load-balanced routing strategies to maximize Mixture-of-Experts (MoE) throughput.

Understanding Apache TVM

What is TVM?

Apache TVM is an open-source compiler framework that automatically optimizes deep learning workloads across diverse hardware targets. Instead of manually writing GPU kernels or relying on monolithic frameworks like PyTorch, TVM abstracts the compilation pipeline into several layers:

Frontend: Imports models from PyTorch, ONNX, TensorFlow, etc.
High-Level IR (Relax): Framework-agnostic intermediate representation for neural networks.
Low-Level IR (TIR): Loop-based tensor computation representation with explicit control over schedules and memory layouts.
Backend Code Generation: Generates optimized machine code (CUDA, CPU, Vulkan, Metal, etc.) for the target hardware.
Runtime: A lightweight inference engine with minimal dependencies.

TVM's Strengths and Constraints

Strengths

Multi-target Compilation: Write once, deploy everywhere.
Automated Optimization: Features like AutoTVM and AutoScheduler use machine learning to find optimal schedules.
Fine-grained Control: For advanced use cases, TIR allows hand-crafting kernels with explicit memory and thread management.
Active Research Community: Continuous improvements and new operator support.

Constraints

Incomplete Operator Coverage: Some specialized operations lack native support.
Compilation Overhead: Models must be compiled before inference; the first-run compilation can be slow.
Learning Curve: Understanding TVM's multi-layer abstraction requires deeper systems knowledge compared to high-level frameworks.

Why gpt-oss + TVM?

The Integration Challenge

Bringing gpt-oss into the TVM ecosystem is non-trivial. The model was designed and trained in PyTorch, where every architecture detail is supported out of the box. TVM, being a compiler-first framework, has a different operational philosophy:

PyTorch: "Define computation dynamically, let the framework figure out execution."
TVM: "Explicitly declare computation, let the compiler optimize it for your hardware."

This mismatch creates engineering challenges:

Operator Mismatch: Some operations don't map directly to existing TVM operators.
Memory Model Divergence: PyTorch's dynamic memory allocation differs from TVM's static, pre-allocated memory plans.
Numerical Precision: Translating model weights and intermediate computations across frameworks while maintaining numerical consistency requires careful attention.

Why Should We Compile gpt-oss?

While PyTorch provides excellent flexibility for research and training, compiled inference offers significant advantages for deployment:

Performance Portability: A TVM-compiled model can run efficiently across diverse hardware: NVIDIA GPUs, AMD GPUs, Apple Silicon (Metal), Vulkan-enabled devices, and even CPUs.
Memory Efficiency: TVM's compiler optimizations—operator fusion, memory layout optimization, and kernel scheduling—can reduce memory footprint and bandwidth consumption.
Production Readiness: Compiled models are smaller (no framework overhead), faster (optimized kernels), and more suitable for edge and resource-constrained deployments.

Project Scope

Primary Goal: Correctness with Minimal Error

We prioritize numerical error with the PyTorch reference implementation. While performance optimizations are applied where applicable, the focus is on faithful translation of the architecture.

For more details, refer to the Design Philosophy.

How to Navigate This Wiki

For First-Time Readers

Start with the Design Philosophy section to understand how we've adapted gpt-oss for TVM.

For Implementers

Jump to Architectural Implementations to see detailed code and design decisions.

For Performance Enthusiasts

The Low-Level Optimization section covers custom kernels, data types, and compiler-level tuning.

For Contributors

Refer to the Contribution Guide to understand how to extend or improve the codebase.

Quick Links

Setup & Run — Get the environment running in minutes.
Design Philosophy — Understanding our adaptation strategy.
Architectural Implementations — Deep dive into each component.

Disclaimer & Status

This is an active research/engineering project. The codebase is under continuous refinement, and some features may still be in development. Please refer to the main repository README for the latest status and known limitations.

Links

gpt-oss-tvm
- GitHub
gpt-oss
- Model Card
- Blog post
- GitHub
- [Huggingface] gpt-oss-20b
- [Huggingface] gpt-oss-120b
TVM
- Website
- GitHub
- Documents
MLC LLM
- Website
- GitHub
- Documents

Getting Started

0. Design Philosophy

1. Architectural Implementations

Attention & Sliding Window
- Computing attentions in TVM
- Sink Token Workaround
Mixture-of-Experts (MoE)
- TIR-based MoE Einsum
- Gating Network Implementation
- Comparison with Standard TVM Approaches
RoPE with YaRN
- What is YaRN?
- Limitations in Existing TVM Implementations
- Our Improvements

2. Low-Level Optimization

TIR-based support for MXFP4
- What is MXFP4?
- MXFP4 TIR Implementation
- Operator Fusion

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Home

Welcome to gpt-oss-tvm

Understanding gpt-oss

What is gpt-oss?

Key Architectural Characteristics

Understanding Apache TVM

What is TVM?

TVM's Strengths and Constraints

Strengths

Constraints

Why gpt-oss + TVM?

The Integration Challenge

Why Should We Compile gpt-oss?

Project Scope

Primary Goal: Correctness with Minimal Error

How to Navigate This Wiki

For First-Time Readers

For Implementers

For Performance Enthusiasts

For Contributors

Quick Links

Disclaimer & Status

Links

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Getting Started

0. Design Philosophy

1. Architectural Implementations

2. Low-Level Optimization

Clone this wiki locally