Setup & Run

Install dependencies

TVM & MLC-LLM

Update submodule

git submodule update --init --recursive

Apply patches

cd 3rdparty/mlc-llm/3rdparty/tvm
git apply ../../../../tvm_fix.patch
cd -

Build from source

Follow the official documentations below to build TVM & MLC-LLM.
(Cloning repository is already done through the git submodule command above.)

[TVM] Install from source

You will perform the tasks similar to this (in 3rdparty/mlc-llm/3rdparty/tvm):

mkdir build && cp cmake/config.cmake build/ && cd build
# Now, edit build/config.cmake refer to the document
cmake .. && cmake --build . --parallel $(nproc)

[MLC-LLM] Build from source

You will perform the tasks similar to this (in 3rdparty/mlc-llm):

mkdir build && cd build
python ../cmake/gen_cmake_config.py      # Answer this script to generate configuration
export CMAKE_POLICY_VERSION_MINIMUM=3.5  # Recommended to avoid the CMake error on `tokenizer-cpp`
cmake .. && cmake --build . --parallel $(nproc)

Install Python bindings

TVM FFI

# Install from source (recommended)
cd 3rdparty/mlc-llm/3rdparty/tvm/3rdparty/tvm-ffi
pip install -e .
cd -

# ... Or just use PyPI version
pip install apache-tvm-ffi<=0.1.7

TVM

cd 3rdparty/mlc-llm/3rdparty/tvm/python
pip install -e .
cd -

# Extra dependencies for tvm
# c.f. https://tvm.apache.org/docs/install/from_source.html#step-5-extra-python-dependencies
pip install psutil

MLC LLM

cd 3rdparty/mlc-llm/python
# Make sure that `flashinfer-python` is handled. (c.f. 'Important' admonition below)
pip install -e .
cd -

Important

Handling flashinfer-python in 3rdparty/mlc-llm/python/requirements.txt

exclude installation on unsupported platforms (e.g. macOS)
use version constraint >=0.5.0 for better dependency resolution

Python dependencies

pip install -r requirements.txt

Download model

Files for gpt-oss reference torch implementation

Note

While TVM supports multiple hardware backends, this project has been mainly tested with the metal target on macOS. As the model uses the original mxfp4 and bfloat16 weights without further quantization, an Apple Silicon Mac with 24 GB or more of unified memory is recommended.

pip install huggingface_hub  # to use `hf` command
hf download openai/gpt-oss-20b --include "original/*" --local-dir gpt-oss-20b/

Compile & Run

Basic single-turn test

Run the simplest gpt-oss test

python run_gpt_oss.py

Multi-turn chat

Run a simple multi-turn chat example

python chat.py

Links

gpt-oss-tvm
- GitHub
gpt-oss
- Model Card
- Blog post
- GitHub
- [Huggingface] gpt-oss-20b
- [Huggingface] gpt-oss-120b
TVM
- Website
- GitHub
- Documents
MLC LLM
- Website
- GitHub
- Documents

Getting Started

0. Design Philosophy

1. Architectural Implementations

Attention & Sliding Window
- Computing attentions in TVM
- Sink Token Workaround
Mixture-of-Experts (MoE)
- TIR-based MoE Einsum
- Gating Network Implementation
- Comparison with Standard TVM Approaches
RoPE with YaRN
- What is YaRN?
- Limitations in Existing TVM Implementations
- Our Improvements

2. Low-Level Optimization

TIR-based support for MXFP4
- What is MXFP4?
- MXFP4 TIR Implementation
- Operator Fusion

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Setup & Run

Install dependencies

TVM & MLC-LLM

Update submodule

Apply patches

Build from source

Install Python bindings

TVM FFI

TVM

MLC LLM

Python dependencies

Download model

Files for gpt-oss reference torch implementation

Compile & Run

Basic single-turn test

Multi-turn chat

Links

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Getting Started

0. Design Philosophy

1. Architectural Implementations

2. Low-Level Optimization

Clone this wiki locally