Fusilli is a C++ Graph API and JIT Frontend for IREE that leverages just-in-time compiled and code-generated kernels to accelerate training and inference workloads. Inspired by cuDNN's graph API, it exposes cuDNN-like primitives but is backed by the power of the IREE compiler and runtime stack.
We believe hand-authored GPU kernel libraries are great for highly tuned performance but they are difficult to scale to different models or target architectures and painful to package and release efficiently. This project is founded on the overarching goal to complement the ecosystem of ML frameworks and libraries with a JIT solution, while being competitive to hand-authored kernel libraries. Apart from the core benefit of having a compiler-backed JIT engine that gets progressively and pervasively better, a systemic benefit of this is it helps reduce build times and binary sizes, making it easier to ship software effectively.
Warning
🚧 Fusilli is in early stages of development. The operator coverage is limited but growing. APIs may change. 🚧
Note
The name 'Fusilli' is inspired by the term 'fusion' - a bread-and-butter compiler optimization for improving performance.
Although optional, we recommend docker as the canonical development setup for a
no-fuss quick start, hermetic and reproducible builds, and consistency with CI.
Follow these steps to launch an
interactive docker container with the required dependencies pre-installed (and
skip to the Build and Test section below).
If you prefer a custom setup instead, the following dependencies need to be brought in to build/test Fusilli:
Build Requirements: cmake, ninja-build, clang, IREE
Test Requirements: catch2, lit, FileCheck, iree-opt, iree-compile
Fusilli interfaces with the IREE compiler through the CLI and C-API and with IREE runtime through its C-API. Selection between the C-API and CLI for the compiler can be controlled via an environment variable. The IREE compiler is a heavy dependency to build (due to MLIR/LLVM), so we recommend using a prebuilt release either from a python nightly package or shared library distribution. The IREE runtime on the other hand is much more lightweight and is designed to be built from source and statically linked in. IREE does not export a shared runtime library to allow for maximum flexibility with low-level and toolchain specific (LTO style) optimizations.
Easiest way to get lit, and the
iree-* CLI tools is through pip install.
FileCheck comes packaged
with clang / llvm distributions. Everything else should be available via apt
based install.
Build and test Fusilli as follows:
cmake -GNinja -S. -Bbuild \
-DCMAKE_C_COMPILER=clang \
-DCMAKE_CXX_COMPILER=clang++ \
-DCMAKE_BUILD_TYPE=<Debug|Release|RelWithDebInfo> \
-DIREE_SOURCE_DIR=</path/to/iree/source>
cmake --build build --target all
ctest --test-dir buildWhen building on an AMD GPU system, specify -DFUSILLI_SYSTEMS_AMDGPU=ON to
enable the AMDGPU build.
To re-run failed tests verbosely:
ctest --test-dir build --rerun-failed --output-on-failure --verboseTo run tests in parallel (concurrently):
ctest --test-dir build --output-on-failure -j $(nproc)Tests and samples are also built as standalone binary targets (in the
build/bin directory) to make debugging isolated failures easier.
To skip building tests and samples, specify the cmake flag
-DFUSILLI_BUILD_TESTS=OFF.
The benchmark driver is a command line tool that takes a set of args and sub-command args to run operation specific benchmarks:
build/bin/benchmarks/fusilli_benchmark_driver <ARGS> <SUB-COMMAND> <SUB-ARGS>To dump compilation artifacts to disk (${HOME}/.cache/fusilli by default),
specify the --dump flag on the main driver (not the subcommand). The location
to dump to can be configured by setting the FUSILLI_CACHE_DIR environment
variable.
build/bin/benchmarks/fusilli_benchmark_driver --dump <ARGS> <SUB-COMMAND> <SUB-ARGS>To benchmark on a specific GPU when multiple AMD GPUs are present, specify
--device <int> flag corresponding to the device number from rocm-smi. For
example, this will run the benchmark on device 7 (when there are 8 GPUs):
build/bin/benchmarks/fusilli_benchmark_driver --device 7 <ARGS> <SUB-COMMAND> <SUB-ARGS>An invalid device number should result in a runtime error like so:
RUNTIME_FAILURE: iree/runtime/src/iree/hal/drivers/hip/hip_device.c:499: FAILED_PRECONDITION; HIP driver error 'hipErrorInvalidDevice' (101): invalid device ordinal
The easiest way to benchmark on AMD GPU systems is using the rocprofv3 tool
(included in the docker image). Here's a sample command to dump a *.pftrace
file that may be opened using Perfetto for further
analysis.
rocprofv3 --output-format pftrace -r -- build/bin/benchmarks/fusilli_benchmark_driver --iter 10 conv -F 1 --bf16 -n 16 -c 288 --in_d 2 -H 48 -W 32 -k 288 --fil_d 2 -y 1 -x 1 --pad_d 0 -p 0 -q 0 --conv_stride_d 2 -u 1 -v 1 --dilation_d 1 -l 1 -j 1 --in_layout "NDHWC" --out_layout "NDHWC" --fil_layout "NDHWC" --spatial_dim 3To save the benchmark results as csv, specify --output-format csv instead.
To skip building benchmarks, specify the cmake flag
-DFUSILLI_BUILD_BENCHMARKS=OFF.
The Python benchmark wrapper (benchmarks/run_benchmark.py) provides a
convenient way to run multiple benchmarks from a commands file and collect
results:
python benchmarks/run_benchmark.py \
-f <commands_file> \
-o <output_csv> \
[--driver <path_to_driver>] \
[--Xiree-compile=<flag>]Basic usage example:
# Create a commands file
cat > commands.txt <<EOF
--device 0 --iter 100 matmul -M 1024 -N 1024 -K 1024 --a_type bf16 --b_type bf16 --out_type bf16
--device 0 --iter 100 matmul -M 2048 -N 2048 -K 2048 --a_type bf16 --b_type bf16 --out_type bf16
EOF
# Run benchmarks and save results to CSV
python benchmarks/run_benchmark.py -f commands.txt -o results.csvThe wrapper automatically:
- Parses each command from the file (one per line)
- Runs each benchmark through the C++ driver
- Collects timing statistics using
rocprofv3(min, max, mean, stddev) - Aggregates results into a CSV file
Key flags:
-f, --commands-file: File containing benchmark commands (one per line)-o, --csv: Output CSV file for results (default:benchmark_results.csv)--driver: Path to benchmark driver (default: auto-detected)--Xiree-compile: Pass additional flags to iree-compile (repeatable, see next section)-d, --output-dir: Directory to save artifacts (default: temporary)--verbose: Enable verbose output-t, --timeout: Timeout in seconds per command (default: 30)
You can pass custom IREE compiler flags using the FUSILLI_EXTRA_COMPILER_FLAGS
environment variable or the --Xiree-compile flag with the Python benchmark
wrapper.
Using the C++ benchmark driver with environment variable:
FUSILLI_EXTRA_COMPILER_FLAGS="--iree-opt-level=O3" \
build/bin/benchmarks/fusilli_benchmark_driver --iter 100 \
matmul -M 8192 -N 2048 -K 4096 --transA \
--a_type bf16 --b_type bf16 --out_type bf16Using the Python benchmark wrapper:
python benchmarks/run_benchmark.py \
--Xiree-compile="--iree-opt-level=O3" \
-o results.csv \
-f commands.txtPassing multiple compiler flags:
Using environment variable:
FUSILLI_EXTRA_COMPILER_FLAGS="--iree-opt-level=O3 --iree-hal-dump-executable-files-to=/tmp/dump" \
build/bin/benchmarks/fusilli_benchmark_driver ...Using Python wrapper:
python benchmarks/run_benchmark.py \
--Xiree-compile="--iree-opt-level=O3" \
--Xiree-compile="--iree-hal-dump-executable-files-to=/tmp/dump" \
-f commands.txt -o results.csvNote
If an extra compiler flag is exposed via CLI but not the C API, please select
the CLI backend (set FUSILLI_COMPILE_BACKEND_USE_CLI=1). Currently,
--iree-codegen-tuning-spec-path requires this since it is not exposed
through the C API. This limitation is being addressed and will be lifted
shortly.
IREE tuning specs (transform dialect libraries) specify optimal compiler code generation parameters such as workgroup sizes, tile sizes, MMA intrinsics, and shared memory allocation suited for specific workloads. You can pass tuning specs using the custom compiler flags feature described above.
Example with C++ benchmark driver:
FUSILLI_COMPILE_BACKEND_USE_CLI=1 \
FUSILLI_EXTRA_COMPILER_FLAGS="--iree-codegen-tuning-spec-path=/path/to/tuning_spec.mlir" \
build/bin/benchmarks/fusilli_benchmark_driver --iter 100 \
matmul -M 8192 -N 2048 -K 4096 --transA \
--a_type bf16 --b_type bf16 --out_type bf16Example with Python benchmark wrapper:
FUSILLI_COMPILE_BACKEND_USE_CLI=1 \
python benchmarks/run_benchmark.py \
--Xiree-compile="--iree-codegen-tuning-spec-path=/path/to/tuning_spec.mlir" \
-o results.csv \
-f commands.txtThis works with gcc builds (code coverage with clang instrumentation is future work).
To generate code coverage metrics:
cmake -GNinja -S. -Bbuild \
-DCMAKE_C_COMPILER=gcc \
-DCMAKE_CXX_COMPILER=g++ \
-DFUSILLI_CODE_COVERAGE=ON \
-DIREE_SOURCE_DIR=</path/to/iree/source>
cmake --build build --target all
ctest --test-dir build -T test -T coverageThis generates the *.gcda and *.gcno files with coverage info. At this
point one may use an IDE to visualize the coverage info inlayed with the source
code. If using VSCode's gcov-viewer extension: Hit Cmd+Shift+P -> Gcov
Viewer: Reload (Import gcda files) to load coverage info and Cmd+Shift+P ->
Gcov Viewer: Reset (Delete gcda files) to reset it.
To generate an HTML (interactive) coverage report:
lcov --capture --directory build --output-file build/coverage.info
# Exclude external sources from being reported in code coverage
# For example:
# /usr/include/c++/13/*
# /usr/include/x86_64-linux-gnu/c++/*
# /usr/local/include/catch2/*
lcov --remove build/coverage.info '/usr/*' '*/iree/*' --output-file build/coverage.info
genhtml build/coverage.info --output-directory coverage_reportThis project is set up to use pre-commit hooks for
lint checks (such as clang-format for C++ and black for python sources). To
install it in your local clone, run pre-commit install. After this, hooks
will automatically run when making commits locally.
To manually run pre-commit on all files:
pre-commit run --all-filesTo run clang-format standalone:
find . -path ./build -prune -o \( -type f \( -name "*.cpp" -o -name "*.h" \) -print \) | xargs clang-format -iWe also use clang-tidy for static analysis. To run clang-tidy during
compilation, specify the cmake flag -DFUSILLI_ENABLE_CLANG_TIDY=ON when
building Fusilli.
Fusilli records execution flow through the logging interface. This is disabled by default but can be enabled for debugging.
To configure logging behavior using environment variables:
| Set output stream \ Enable logging | FUSILLI_LOG_INFO = 0 |
FUSILLI_LOG_INFO = 1 |
|---|---|---|
FUSILLI_LOG_FILE not set |
no logging | no logging |
FUSILLI_LOG_FILE set to stdout or stderr |
no logging | logging to cout / cerr |
FUSILLI_LOG_FILE set to /path/to/file.txt |
no logging | logging to file.txt |
Tests and samples that are built with the cmake flag
-DFUSILLI_ENABLE_LOGGING=ON have their environment variables automatically
configured for logging to cout.
Alternatively, one may call the logging API directly as needed:
- Calling
fusilli::isLoggingEnabled() = <true|false>has the same effect as settingFUSILLI_LOG_INFO = 1|0. - Calling
fusilli::getStream() = <stream_name>has the same effect as setting the output stream usingFUSILLI_LOG_FILE.
| Environment Variable | Description |
|---|---|
FUSILLI_COMPILE_BACKEND_USE_CLI |
Enables the use of the CLI tool to invoke compilation, otherwise uses CAPI |
FUSILLI_EXTERNAL_IREE_COMPILE |
Path to iree-compile binary |
FUSILLI_EXTERNAL_IREE_COMPILER_LIB |
Path to the IREE compiler dynamic library |
FUSILLI_EXTERNAL_ROCM_AGENT_ENUMERATOR |
Path to rocm_agent_enumerator binary |
FUSILLI_EXTERNAL_AMD_SMI |
Path to amd-smi binary (used for GPU SKU detection) |
FUSILLI_EXTRA_COMPILER_FLAGS |
Space-separated list of additional flags to pass to iree-compile (e.g., "--iree-codegen-tuning-spec-path=/path/to/spec.mlir --iree-opt-level=O3") |
