-
Notifications
You must be signed in to change notification settings - Fork 1
Home
gpt-oss-tvm is a project aimed at porting and compiling OpenAI's gpt-oss architecture using Apache TVM, a state-of-the-art compiler framework for deep learning.
The primary goal of this project is to achieve a TVM-based implementation that matches the PyTorch reference implementation with minimal numerical error, while exploring the engineering challenges and opportunities that arise when integrating modern LLM architectures into a compiler-based framework. While we adhere to the original network structure, naming conventions, and operation units of gpt-oss, we have adapted the execution flow to align better with TVM's compilation and runtime model.
Note
While some operation speed optimizations are applied, the main contribution of this project is the structural integration of the architecture of gpt-oss into the TVM ecosystem, rather than raw performance tuning.
gpt-oss represents OpenAI's effort to provide an open-source, reference implementation of a modern large language model. Unlike many black-box commercial models, gpt-oss is designed with transparency and reproducibility in mind, offering researchers and practitioners a clear blueprint for training and deploying state-of-the-art generative models.
-
Mixture-of-Experts (MoE) Scaling: Instead of densely scaling model width uniformly, gpt-oss employs a MoE architecture that activates only a subset of expert modules per token. This allows massive parameter counts (20B, 120B) without proportional increases in compute.
-
Sliding Window Attention (SWA): Rather than computing attention over the entire sequence (O(n²) complexity), gpt-oss uses a per-layer controlled sliding window approach where some layers use full attention and others use local (windowed) attention. This dramatically reduces memory and compute requirements while maintaining model capacity.
-
Advanced Positional Encoding (YaRN): gpt-oss employs YaRN (Yet another RoPE extensioN method), an advanced variant of Rotary Position Embeddings that improves performance on longer sequences through dynamic interpolation of embedding dimensions.
-
Efficiency-First Design: Every component—from tokenization to the final logit computation—is architected for production-scale efficiency. This includes pre-trained attention sinks for superior attention calibration and load-balanced routing strategies to maximize Mixture-of-Experts (MoE) throughput.
Apache TVM is an open-source compiler framework that automatically optimizes deep learning workloads across diverse hardware targets. Instead of manually writing GPU kernels or relying on monolithic frameworks like PyTorch, TVM abstracts the compilation pipeline into several layers:
- Frontend: Imports models from PyTorch, ONNX, TensorFlow, etc.
- High-Level IR (Relax): Framework-agnostic intermediate representation for neural networks.
- Low-Level IR (TIR): Loop-based tensor computation representation with explicit control over schedules and memory layouts.
- Backend Code Generation: Generates optimized machine code (CUDA, CPU, Vulkan, Metal, etc.) for the target hardware.
- Runtime: A lightweight inference engine with minimal dependencies.
- Multi-target Compilation: Write once, deploy everywhere.
- Automated Optimization: Features like AutoTVM and AutoScheduler use machine learning to find optimal schedules.
- Fine-grained Control: For advanced use cases, TIR allows hand-crafting kernels with explicit memory and thread management.
- Active Research Community: Continuous improvements and new operator support.
- Incomplete Operator Coverage: Some specialized operations lack native support.
- Compilation Overhead: Models must be compiled before inference; the first-run compilation can be slow.
- Learning Curve: Understanding TVM's multi-layer abstraction requires deeper systems knowledge compared to high-level frameworks.
Bringing gpt-oss into the TVM ecosystem is non-trivial. The model was designed and trained in PyTorch, where every architecture detail is supported out of the box. TVM, being a compiler-first framework, has a different operational philosophy:
- PyTorch: "Define computation dynamically, let the framework figure out execution."
- TVM: "Explicitly declare computation, let the compiler optimize it for your hardware."
This mismatch creates engineering challenges:
- Operator Mismatch: Some operations don't map directly to existing TVM operators.
- Memory Model Divergence: PyTorch's dynamic memory allocation differs from TVM's static, pre-allocated memory plans.
- Numerical Precision: Translating model weights and intermediate computations across frameworks while maintaining numerical consistency requires careful attention.
While PyTorch provides excellent flexibility for research and training, compiled inference offers significant advantages for deployment:
- Performance Portability: A TVM-compiled model can run efficiently across diverse hardware: NVIDIA GPUs, AMD GPUs, Apple Silicon (Metal), Vulkan-enabled devices, and even CPUs.
- Memory Efficiency: TVM's compiler optimizations—operator fusion, memory layout optimization, and kernel scheduling—can reduce memory footprint and bandwidth consumption.
- Production Readiness: Compiled models are smaller (no framework overhead), faster (optimized kernels), and more suitable for edge and resource-constrained deployments.
We prioritize numerical error with the PyTorch reference implementation. While performance optimizations are applied where applicable, the focus is on faithful translation of the architecture.
For more details, refer to the Design Philosophy.
Start with the Design Philosophy section to understand how we've adapted gpt-oss for TVM.
Jump to Architectural Implementations to see detailed code and design decisions.
The Low-Level Optimization section covers custom kernels, data types, and compiler-level tuning.
Refer to the Contribution Guide to understand how to extend or improve the codebase.
- Setup & Run — Get the environment running in minutes.
- Design Philosophy — Understanding our adaptation strategy.
- Architectural Implementations — Deep dive into each component.
This is an active research/engineering project. The codebase is under continuous refinement, and some features may still be in development. Please refer to the main repository README for the latest status and known limitations.
- gpt-oss-tvm
- gpt-oss
- Model Card
- Blog post
- GitHub
- [Huggingface] gpt-oss-20b
- [Huggingface] gpt-oss-120b
- TVM
- MLC LLM
-
Attention & Sliding Window
- Computing attentions in TVM
- Sink Token Workaround
-
Mixture-of-Experts (MoE)
- TIR-based MoE Einsum
- Gating Network Implementation
- Comparison with Standard TVM Approaches
-
RoPE with YaRN
- What is YaRN?
- Limitations in Existing TVM Implementations
- Our Improvements
-
TIR-based support for MXFP4
- What is MXFP4?
- MXFP4 TIR Implementation
- Operator Fusion