Skip to content

Design Philosophy

Yeongjae Jang edited this page Feb 3, 2026 · 4 revisions

The Purpose and Focus

gpt-oss-tvm is a project that ports and compiles OpenAI’s gpt-oss architecture using Apache TVM.

The project aims to implement gpt-oss within TVM while preserving the original model structure and numerical behavior as closely as possible. Rather than introducing architectural changes, the focus is on adapting the execution model to fit TVM’s compilation and runtime constraints.

This repository is primarily concerned with correctness and structural alignment, and secondarily with performance.


Primary Goal: Correctness with Minimal Numerical Error

The primary goal of gpt-oss-tvm is to match the behavior of the PyTorch reference implementation with minimal numerical error.

  • The PyTorch implementation is treated as the reference baseline.
  • Model structure, tensor shapes, naming conventions, and operation boundaries are preserved wherever possible.
  • Differences from the reference are introduced only when required by TVM’s execution or operator model.

Performance optimizations are applied selectively, but numerical consistency takes precedence over aggressive optimization.


Adapting gpt-oss to TVM’s Execution Model

While the gpt-oss network architecture is kept intact, its execution flow is adjusted to better align with TVM’s compilation model, including:

  • Static graph construction
  • Ahead-of-time compilation
  • Explicit TensorIR (TIR) lowering and scheduling

These adaptations are intended to make implicit behaviors in the PyTorch implementation explicit and compatible with TVM, without altering the semantic meaning of the model.

The goal is not to redesign gpt-oss, but to make its existing design executable and analyzable within a compiler-based framework.


Key Design Decisions and Contributions

This project has developed several novel approaches to bridge the gap between gpt-oss's design and TVM's capabilities:

  • Applying Attention Sink: A mathematically equivalent workaround using LogSumExp (LSE) and Sigmoid to replace unsupported attention operations.
  • Flexible Per-layer Sliding Window: We expanded the scope of sliding window attention (SWA) to compute more 'correct' attention values for long user context, and made per-layer SWA applicable to gpt-oss.
  • TIR-level MoE Implementation: Direct TensorIR implementations of MoE Einsum and Gating for fine-grained optimization.
  • YaRN Refinements: Corrected and improved positional embedding logic aligned with the reference.
  • Dequantization support for MXFP4: TIR-based implementation and operator fusion of MXFP4 dequantization for efficient low-precision computation.

What This Project Is — and Is Not

This project is:

  • A TVM-based implementation of the gpt-oss architecture
  • Focused on correctness and architectural fidelity
  • Intended as a reference for integrating modern LLM designs into TVM

This project is not:

  • A performance-optimized alternative to highly tuned PyTorch or CUDA kernels
  • A benchmark-driven inference framework

Getting Started

1. Architectural Implementations

2. Low-Level Optimization

Clone this wiki locally