Manifest-driven ahead-of-time (AOT) builder for AITER kernels. QoLA wraps AITER's build_module() JIT compilation system with a declarative TOML manifest, producing either:
- pybind11 Python modules — standard
.sofiles importable from Python (requires PyTorch) - torch-free C-linkable shared libraries (
cpp_itfsmode) — plain.sofiles linked via HIP/ROCm with no PyTorch dependency
QoLA is designed for Transformer Engine to pre-build AITER attention (MHA) kernels at package install time, replacing hours-long JIT compilation with a structured, reproducible build.
- Declarative manifests — a single TOML file pins the AITER commit, target architectures, kernel modules, and MHA variant matrix
- torch-free builds —
cpp_itfsmode eliminates the PyTorch build dependency for C-linkable libraries - Symbol isolation — linker version scripts and C++ namespace wrapping prevent symbol collisions when multiple AITER-backed
.sofiles coexist in one process - Patch-based AITER deltas —
patches/aiter/*.patchis reapplied on top of the pinned AITER commit on every build, so bumping AITER becomes a patch rebase rather than maintaining a parallel fork branch - No
import aiter— QoLA reconstructs AITER's build namespace from a source-tree path alone, avoidingaiter/__init__.pyside effects and the torch import requirement
- Python >= 3.10
- ROCm / HIP toolchain (hipcc)
- AITER source tree (cloned on demand into
build/third_party/aiter/— git-ignored, not a submodule; the manifest's[qola] aiter_commitpins the SHA) - PyTorch (pybind mode only)
pip install -e .# Build all modules declared in a manifest (pybind mode).
# AITER is cloned into build/third_party/aiter/ on first use and checked out to the
# manifest's [qola] aiter_commit; --aiter-root is optional.
qola build \
--manifest example/te-manifest.toml \
--output-dir /tmp/qola-out
# Build in cpp_itfs mode (no PyTorch dependency)
qola build \
--manifest example/te-manifest.toml \
--output-dir /tmp/qola-out \
--mode cpp_itfs| Option | Description |
|---|---|
--manifest |
Path to the TOML manifest file |
--aiter-root |
Path to the AITER source tree (default: <QoLA repo>/build/third_party/aiter, cloned on demand) |
--aiter-commit |
AITER SHA / tag / branch to fetch and checkout (overrides manifest's [qola] aiter_commit) |
--patches-dir |
Directory of *.patch files applied on top of the AITER checkout (overrides manifest's [qola] patches_dir; defaults to <QoLA repo>/patches/aiter) |
--output-dir |
Directory for build artifacts |
--arch |
Target GPU architecture (repeatable, e.g. --arch gfx950) |
--mode |
Build mode: pybind (default) or cpp_itfs |
--skip-checkout |
Build against whatever is currently at --aiter-root instead of running the checkout + patch step. --aiter-commit and --patches-dir are ignored when set. See Skipping the checkout step below |
--verbose |
Enable verbose build output |
By default, every qola build runs the wipe-and-reapply checkout: git reset --hard to the manifest's pinned commit, force-resync submodules, then reapply patches/aiter/*.patch. Pass --skip-checkout to opt out:
# Run checkout once...
qola checkout --manifest example/te-manifest.toml
# ...then iterate on builds without the prep cost (or risk of clobbering local edits)
qola build --manifest example/te-manifest.toml --output-dir /tmp/qola-out --skip-checkout
# Or build against a custom AITER tree you've prepared yourself
qola build \
--manifest example/te-manifest.toml \
--output-dir /tmp/qola-out \
--aiter-root /path/to/my/aiter \
--skip-checkoutHard-fails if <aiter-root>/.git doesn't exist — --skip-checkout requires a real checkout already in place.
Runs only the AITER-prep phase of qola build: clones (if needed), fetches and checks out the requested commit, force-syncs submodules, and applies patches/aiter/*.patch. Useful when downstream consumers want a patched AITER source tree to inspect, run their own builds against, or feed into another tool.
# Manifest-driven: pin the commit + patches via [qola] in the manifest
qola checkout --manifest example/te-manifest.toml
# Explicit overrides without a manifest
qola checkout --aiter-commit d32b0cb62 --patches-dir patches/aiter
# Re-apply patches to whatever is currently checked out
# (only valid if build/third_party/aiter/ already exists)
qola checkoutPrints AITER ready at <abs-path> on success. Hard-fails on patch conflict — same git apply --3way semantics as qola build.
| Option | Description |
|---|---|
--manifest |
Optional. TOML manifest; [qola] aiter_commit and [qola] patches_dir are read as fallbacks for the corresponding flags |
--aiter-root |
Path to the AITER source tree (default: <QoLA repo>/build/third_party/aiter, cloned on demand) |
--aiter-commit |
AITER SHA / tag / branch to fetch and checkout (overrides manifest) |
--patches-dir |
Directory of *.patch files (overrides manifest; defaults to <QoLA repo>/patches/aiter; point at an empty dir to skip patching) |
The same logic is exposed for programmatic use:
from qola.build_tools import checkout_aiter
aiter_root = checkout_aiter(manifest_path="example/te-manifest.toml")
# aiter_root is now an absolute path to a clean, patched AITER checkoutqola build itself routes through checkout_aiter, so the two commands are guaranteed to produce identical AITER trees from identical inputs.
The manifest is a TOML file that declares what to build. See example/te-manifest.toml for a full example.
[qola]
aiter_commit = "33f2e6a..." # Pinned AITER commit
namespace = "te" # C++ namespace and .so prefix
patches_dir = "patches/aiter" # Optional; overrides the default <QoLA repo>/patches/aiter
rocm_versions = ["7.2"]
[build]
architectures = ["gfx950"]
# Static modules from AITER's optCompilerConfig.json
[[modules]]
name = "libmha_fwd"
mode = "cpp_itfs"
receipt = 700 # CK codegen filter (default: optCompilerConfig value, typically 600)
drop_srcs = ["mha_fwd_split.cu", "mha_fwd_batch_prefill.cu"]
drop_directions = ["fwd_splitkv", "batch_prefill"]
[[modules]]
name = "libmha_bwd"
mode = "cpp_itfs"
receipt = 700
# MHA variant matrix — Cartesian expansion of CK codegen filters
[[mha_fwd_variants]]
dtype = ["bf16", "fp16"]
has_lse = true
has_skip = false
[[mha_bwd_variants]]
dtype = ["bf16", "fp16"]Produces pybind11 .so modules importable from Python. Requires PyTorch at both build and runtime.
Produces torch-free C-linkable shared libraries. Each module exposes a C++ API under the configured namespace:
#include "qola_mha_fwd.h"
// With namespace = "te":
float ret = qola::te::mha_fwd(args, stream_config);Source replacement is driven by cpp_itfs/registry.toml: pybind entry points are swapped for thin C wrappers that expose a namespace-guarded C++ API.
output-dir/
lib/ # Compiled .so files
te_libmha_fwd.so
te_libmha_bwd.so
configs/ # AITER tuning CSVs
manifest.json # Build metadata and per-module results
| Module | Description | cpp_itfs API |
|---|---|---|
libmha_fwd |
Multi-head attention forward | qola::te::mha_fwd() |
libmha_bwd |
Multi-head attention backward | qola::te::mha_bwd() |
QoLA reconstructs AITER's build-time eval namespace from a source tree path alone, without ever running import aiter. This avoids AITER's __init__.py side effects and torch import requirements. See resolver.py.
Two layers prevent symbol leaks when multiple .so files coexist:
- C++ namespace wrapping —
QOLA_NS_BEGIN/QOLA_NS_ENDmacros place all public symbols underqola::<namespace>:: - Linker version script —
qola_exports.ldsforces all non-qola::*symbols local, including AITER symbols with explicitvisibility("default")
The manifest's [[mha_fwd_variants]] / [[mha_bwd_variants]] sections declare option dimensions (dtype, has_bias, has_mask, etc.) that are expanded into CK codegen filter patterns. This controls which of the ~34K possible kernel instances are actually compiled. See variant_matrix.py. This is currently only support for pybind11 output.
For cpp_itfs static modules, kernel pruning is instead controlled by the receipt manifest field, which overrides the --receipt N argument passed to CK's generate.py. AITER's optCompilerConfig.json defaults to receipt 600 (the generic aiter::mha_* C++ API filter); setting receipt = 700 selects the TransformerEngine-specific filter (fp16/bf16, row vlayout, has_lse, no skip/sink/logits/qscale), shrinking forward codegen by roughly 48×. See _rewrite_receipt in config.py for the implementation.
generate_embedded_hsa.py converts binary .co ASM blobs into a C++ header with compile-time byte arrays, enabling kernel distribution without a runtime AITER_ASM_DIR.
- CI support for building and publishing pre-built libraries from manifests
- Kernel filtering for
libmha— prune CK codegen instances based on manifest variant declarations incpp_itfsmode (currently pybind-only) - C-level JIT for
libmha— compile MHA variant.sofiles on first use at the C layer, avoiding ahead-of-time compilation of the full variant matrix
MIT License. See LICENSE for more details.