Design Philosophy

The Purpose and Focus

gpt-oss-tvm is a project that ports and compiles OpenAI’s gpt-oss architecture using Apache TVM.

The project aims to implement gpt-oss within TVM while preserving the original model structure and numerical behavior as closely as possible. Rather than introducing architectural changes, the focus is on adapting the execution model to fit TVM’s compilation and runtime constraints.

This repository is primarily concerned with correctness and structural alignment, and secondarily with performance.

Primary Goal: Correctness with Minimal Numerical Error

The primary goal of gpt-oss-tvm is to match the behavior of the PyTorch reference implementation with minimal numerical error.

The PyTorch implementation is treated as the reference baseline.
Model structure, tensor shapes, naming conventions, and operation boundaries are preserved wherever possible.
Differences from the reference are introduced only when required by TVM’s execution or operator model.

Performance optimizations are applied selectively, but numerical consistency takes precedence over aggressive optimization.

Adapting gpt-oss to TVM’s Execution Model

While the gpt-oss network architecture is kept intact, its execution flow is adjusted to better align with TVM’s compilation model, including:

Static graph construction
Ahead-of-time compilation
Explicit TensorIR (TIR) lowering and scheduling

These adaptations are intended to make implicit behaviors in the PyTorch implementation explicit and compatible with TVM, without altering the semantic meaning of the model.

The goal is not to redesign gpt-oss, but to make its existing design executable and analyzable within a compiler-based framework.

Key Design Decisions and Contributions

This project has developed several novel approaches to bridge the gap between gpt-oss's design and TVM's capabilities:

Applying Attention Sink: A mathematically equivalent workaround using LogSumExp (LSE) and Sigmoid to replace unsupported attention operations.
Flexible Per-layer Sliding Window: We expanded the scope of sliding window attention (SWA) to compute more 'correct' attention values for long user context, and made per-layer SWA applicable to gpt-oss.
TIR-level MoE Implementation: Direct TensorIR implementations of MoE Einsum and Gating for fine-grained optimization.
YaRN Refinements: Corrected and improved positional embedding logic aligned with the reference.
Dequantization support for MXFP4: TIR-based implementation and operator fusion of MXFP4 dequantization for efficient low-precision computation.

What This Project Is — and Is Not

This project is:

A TVM-based implementation of the gpt-oss architecture
Focused on correctness and architectural fidelity
Intended as a reference for integrating modern LLM designs into TVM

This project is not:

A performance-optimized alternative to highly tuned PyTorch or CUDA kernels
A benchmark-driven inference framework

Links

gpt-oss-tvm
- GitHub
gpt-oss
- Model Card
- Blog post
- GitHub
- [Huggingface] gpt-oss-20b
- [Huggingface] gpt-oss-120b
TVM
- Website
- GitHub
- Documents
MLC LLM
- Website
- GitHub
- Documents

Getting Started

0. Design Philosophy

1. Architectural Implementations

Attention & Sliding Window
- Computing attentions in TVM
- Sink Token Workaround
Mixture-of-Experts (MoE)
- TIR-based MoE Einsum
- Gating Network Implementation
- Comparison with Standard TVM Approaches
RoPE with YaRN
- What is YaRN?
- Limitations in Existing TVM Implementations
- Our Improvements

2. Low-Level Optimization

TIR-based support for MXFP4
- What is MXFP4?
- MXFP4 TIR Implementation
- Operator Fusion

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Design Philosophy

The Purpose and Focus

Primary Goal: Correctness with Minimal Numerical Error

Adapting gpt-oss to TVM’s Execution Model

Key Design Decisions and Contributions

What This Project Is — and Is Not

Links

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Getting Started

0. Design Philosophy

1. Architectural Implementations

2. Low-Level Optimization

Clone this wiki locally