Fusilli

Fusilli is a C++ Graph API and JIT Frontend for IREE that leverages just-in-time compiled and code-generated kernels to accelerate training and inference workloads. Inspired by cuDNN's graph API, it exposes cuDNN-like primitives but is backed by the power of the IREE compiler and runtime stack.

We believe hand-authored GPU kernel libraries are great for highly tuned performance but they are difficult to scale to different models or target architectures and painful to package and release efficiently. This project is founded on the overarching goal to complement the ecosystem of ML frameworks and libraries with a JIT solution, while being competitive to hand-authored kernel libraries. Apart from the core benefit of having a compiler-backed JIT engine that gets progressively and pervasively better, a systemic benefit of this is it helps reduce build times and binary sizes, making it easier to ship software effectively.

Warning

🚧 Fusilli is in early stages of development. The operator coverage is limited but growing. APIs may change. 🚧

Note

The name 'Fusilli' is inspired by the term 'fusion' - a bread-and-butter compiler optimization for improving performance.

Developer Guide

Setup

Although optional, we recommend docker as the canonical development setup for a no-fuss quick start, hermetic and reproducible builds, and consistency with CI. Follow these steps to launch an interactive docker container with the required dependencies pre-installed (and skip to the Build and Test section below).

If you prefer a custom setup instead, the following dependencies need to be brought in to build/test Fusilli:

Build Requirements: cmake, ninja-build, clang, IREE

Test Requirements: catch2, lit, FileCheck, iree-opt, iree-compile

Fusilli interfaces with the IREE compiler through the CLI and C-API and with IREE runtime through its C-API. Selection between the C-API and CLI for the compiler can be controlled via an environment variable. The IREE compiler is a heavy dependency to build (due to MLIR/LLVM), so we recommend using a prebuilt release either from a python nightly package or shared library distribution. The IREE runtime on the other hand is much more lightweight and is designed to be built from source and statically linked in. IREE does not export a shared runtime library to allow for maximum flexibility with low-level and toolchain specific (LTO style) optimizations.

Easiest way to get lit, and the iree-* CLI tools is through pip install. FileCheck comes packaged with clang / llvm distributions. Everything else should be available via apt based install.

Build and Test

Build and test Fusilli as follows:

cmake -GNinja -S. -Bbuild \
    -DCMAKE_C_COMPILER=clang \
    -DCMAKE_CXX_COMPILER=clang++ \
    -DCMAKE_BUILD_TYPE=<Debug|Release|RelWithDebInfo> \
    -DIREE_SOURCE_DIR=</path/to/iree/source>
cmake --build build --target all
ctest --test-dir build

When building on an AMD GPU system, specify -DFUSILLI_SYSTEMS_AMDGPU=ON to enable the AMDGPU build.

To re-run failed tests verbosely:

ctest --test-dir build --rerun-failed --output-on-failure --verbose

To run tests in parallel (concurrently):

ctest --test-dir build --output-on-failure -j $(nproc)

Tests and samples are also built as standalone binary targets (in the build/bin directory) to make debugging isolated failures easier.

To skip building tests and samples, specify the cmake flag -DFUSILLI_BUILD_TESTS=OFF.

Benchmarks

The benchmark driver is a command line tool that takes a set of args and sub-command args to run operation specific benchmarks:

build/bin/benchmarks/fusilli_benchmark_driver <ARGS> <SUB-COMMAND> <SUB-ARGS>

To dump compilation artifacts to disk (${HOME}/.cache/fusilli by default), specify the --dump flag on the main driver (not the subcommand). The location to dump to can be configured by setting the FUSILLI_CACHE_DIR environment variable.

build/bin/benchmarks/fusilli_benchmark_driver --dump <ARGS> <SUB-COMMAND> <SUB-ARGS>

To benchmark on a specific GPU when multiple AMD GPUs are present, specify --device <int> flag corresponding to the device number from rocm-smi. For example, this will run the benchmark on device 7 (when there are 8 GPUs):

build/bin/benchmarks/fusilli_benchmark_driver --device 7 <ARGS> <SUB-COMMAND> <SUB-ARGS>

An invalid device number should result in a runtime error like so:

RUNTIME_FAILURE: iree/runtime/src/iree/hal/drivers/hip/hip_device.c:499: FAILED_PRECONDITION; HIP driver error 'hipErrorInvalidDevice' (101): invalid device ordinal

The easiest way to benchmark on AMD GPU systems is using the rocprofv3 tool (included in the docker image). Here's a sample command to dump a *.pftrace file that may be opened using Perfetto for further analysis.

rocprofv3 --output-format pftrace -r -- build/bin/benchmarks/fusilli_benchmark_driver --iter 10 conv -F 1 --bf16 -n 16 -c 288 --in_d 2 -H 48 -W 32 -k 288 --fil_d 2 -y 1 -x 1 --pad_d 0 -p 0 -q 0 --conv_stride_d 2 -u 1 -v 1 --dilation_d 1 -l 1 -j 1 --in_layout "NDHWC" --out_layout "NDHWC" --fil_layout "NDHWC" --spatial_dim 3

To save the benchmark results as csv, specify --output-format csv instead.

To skip building benchmarks, specify the cmake flag -DFUSILLI_BUILD_BENCHMARKS=OFF.

Python Benchmark Wrapper

The Python benchmark wrapper (benchmarks/run_benchmark.py) provides a convenient way to run multiple benchmarks from a commands file and collect results:

python benchmarks/run_benchmark.py \
  -f <commands_file> \
  -o <output_csv> \
  [--driver <path_to_driver>] \
  [--Xiree-compile=<flag>]

Basic usage example:

# Create a commands file
cat > commands.txt <<EOF
--device 0 --iter 100 matmul -M 1024 -N 1024 -K 1024 --a_type bf16 --b_type bf16 --out_type bf16
--device 0 --iter 100 matmul -M 2048 -N 2048 -K 2048 --a_type bf16 --b_type bf16 --out_type bf16
EOF

# Run benchmarks and save results to CSV
python benchmarks/run_benchmark.py -f commands.txt -o results.csv

The wrapper automatically:

Parses each command from the file (one per line)
Runs each benchmark through the C++ driver
Collects timing statistics using rocprofv3 (min, max, mean, stddev)
Aggregates results into a CSV file

Key flags:

-f, --commands-file: File containing benchmark commands (one per line)
-o, --csv: Output CSV file for results (default: benchmark_results.csv)
--driver: Path to benchmark driver (default: auto-detected)
--Xiree-compile: Pass additional flags to iree-compile (repeatable, see next section)
-d, --output-dir: Directory to save artifacts (default: temporary)
--verbose: Enable verbose output
-t, --timeout: Timeout in seconds per command (default: 30)

Custom Compiler Flags

You can pass custom IREE compiler flags using the FUSILLI_EXTRA_COMPILER_FLAGS environment variable or the --Xiree-compile flag with the Python benchmark wrapper.

Using the C++ benchmark driver with environment variable:

FUSILLI_EXTRA_COMPILER_FLAGS="--iree-opt-level=O3" \
  build/bin/benchmarks/fusilli_benchmark_driver --iter 100 \
  matmul -M 8192 -N 2048 -K 4096 --transA \
  --a_type bf16 --b_type bf16 --out_type bf16

Using the Python benchmark wrapper:

python benchmarks/run_benchmark.py \
  --Xiree-compile="--iree-opt-level=O3" \
  -o results.csv \
  -f commands.txt

Passing multiple compiler flags:

Using environment variable:

FUSILLI_EXTRA_COMPILER_FLAGS="--iree-opt-level=O3 --iree-hal-dump-executable-files-to=/tmp/dump" \
  build/bin/benchmarks/fusilli_benchmark_driver ...

Using Python wrapper:

python benchmarks/run_benchmark.py \
  --Xiree-compile="--iree-opt-level=O3" \
  --Xiree-compile="--iree-hal-dump-executable-files-to=/tmp/dump" \
  -f commands.txt -o results.csv

Note

If an extra compiler flag is exposed via CLI but not the C API, please select the CLI backend (set FUSILLI_COMPILE_BACKEND_USE_CLI=1). Currently, --iree-codegen-tuning-spec-path requires this since it is not exposed through the C API. This limitation is being addressed and will be lifted shortly.

Tuning Specs

IREE tuning specs (transform dialect libraries) specify optimal compiler code generation parameters such as workgroup sizes, tile sizes, MMA intrinsics, and shared memory allocation suited for specific workloads. You can pass tuning specs using the custom compiler flags feature described above.

Example with C++ benchmark driver:

FUSILLI_COMPILE_BACKEND_USE_CLI=1 \
FUSILLI_EXTRA_COMPILER_FLAGS="--iree-codegen-tuning-spec-path=/path/to/tuning_spec.mlir" \
  build/bin/benchmarks/fusilli_benchmark_driver --iter 100 \
  matmul -M 8192 -N 2048 -K 4096 --transA \
  --a_type bf16 --b_type bf16 --out_type bf16

Example with Python benchmark wrapper:

FUSILLI_COMPILE_BACKEND_USE_CLI=1 \
  python benchmarks/run_benchmark.py \
  --Xiree-compile="--iree-codegen-tuning-spec-path=/path/to/tuning_spec.mlir" \
  -o results.csv \
  -f commands.txt

Code Coverage (using gcov + lcov)

This works with gcc builds (code coverage with clang instrumentation is future work).

To generate code coverage metrics:

cmake -GNinja -S. -Bbuild \
    -DCMAKE_C_COMPILER=gcc \
    -DCMAKE_CXX_COMPILER=g++ \
    -DFUSILLI_CODE_COVERAGE=ON \
    -DIREE_SOURCE_DIR=</path/to/iree/source>
cmake --build build --target all
ctest --test-dir build -T test -T coverage

This generates the *.gcda and *.gcno files with coverage info. At this point one may use an IDE to visualize the coverage info inlayed with the source code. If using VSCode's gcov-viewer extension: Hit Cmd+Shift+P -> Gcov Viewer: Reload (Import gcda files) to load coverage info and Cmd+Shift+P -> Gcov Viewer: Reset (Delete gcda files) to reset it.

To generate an HTML (interactive) coverage report:

lcov --capture --directory build --output-file build/coverage.info
# Exclude external sources from being reported in code coverage
# For example:
#   /usr/include/c++/13/*
#   /usr/include/x86_64-linux-gnu/c++/*
#   /usr/local/include/catch2/*
lcov --remove build/coverage.info '/usr/*' '*/iree/*' --output-file build/coverage.info
genhtml build/coverage.info --output-directory coverage_report

Lint

This project is set up to use pre-commit hooks for lint checks (such as clang-format for C++ and black for python sources). To install it in your local clone, run pre-commit install. After this, hooks will automatically run when making commits locally.

To manually run pre-commit on all files:

pre-commit run --all-files

To run clang-format standalone:

find . -path ./build -prune -o \( -type f \( -name "*.cpp" -o -name "*.h" \) -print \) | xargs clang-format -i

We also use clang-tidy for static analysis. To run clang-tidy during compilation, specify the cmake flag -DFUSILLI_ENABLE_CLANG_TIDY=ON when building Fusilli.

Logging

Fusilli records execution flow through the logging interface. This is disabled by default but can be enabled for debugging.

To configure logging behavior using environment variables:

Set output stream \ Enable logging	`FUSILLI_LOG_INFO` = 0	`FUSILLI_LOG_INFO` = 1
`FUSILLI_LOG_FILE` not set	no logging	no logging
`FUSILLI_LOG_FILE` set to `stdout` or `stderr`	no logging	logging to cout / cerr
`FUSILLI_LOG_FILE` set to `/path/to/file.txt`	no logging	logging to file.txt

Tests and samples that are built with the cmake flag -DFUSILLI_ENABLE_LOGGING=ON have their environment variables automatically configured for logging to cout.

Alternatively, one may call the logging API directly as needed:

Calling fusilli::isLoggingEnabled() = <true|false> has the same effect as setting FUSILLI_LOG_INFO = 1|0.
Calling fusilli::getStream() = <stream_name> has the same effect as setting the output stream using FUSILLI_LOG_FILE.

Environment Variables

Environment Variable	Description
`FUSILLI_COMPILE_BACKEND_USE_CLI`	Enables the use of the CLI tool to invoke compilation, otherwise uses CAPI
`FUSILLI_EXTERNAL_IREE_COMPILE`	Path to `iree-compile` binary
`FUSILLI_EXTERNAL_IREE_COMPILER_LIB`	Path to the IREE compiler dynamic library
`FUSILLI_EXTERNAL_ROCM_AGENT_ENUMERATOR`	Path to `rocm_agent_enumerator` binary
`FUSILLI_EXTERNAL_AMD_SMI`	Path to `amd-smi` binary (used for GPU SKU detection)
`FUSILLI_EXTRA_COMPILER_FLAGS`	Space-separated list of additional flags to pass to iree-compile (e.g., `"--iree-codegen-tuning-spec-path=/path/to/spec.mlir --iree-opt-level=O3"`)

Name		Name	Last commit message	Last commit date
Latest commit History 173 Commits
.github/workflows		.github/workflows
benchmarks		benchmarks
build_tools		build_tools
docs		docs
include		include
plugins/hipdnn-plugin		plugins/hipdnn-plugin
samples		samples
tests		tests
.clang-format		.clang-format
.clang-tidy		.clang-tidy
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
CMakeLists.txt		CMakeLists.txt
LICENSE		LICENSE
README.md		README.md
version.json		version.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Fusilli

Developer Guide

Setup

Build and Test

Benchmarks

Python Benchmark Wrapper

Custom Compiler Flags

Tuning Specs

Code Coverage (using gcov + lcov)

Lint

Logging

Environment Variables

About

Uh oh!

Releases

Uh oh!

Contributors 9

Uh oh!

Languages

License

iree-org/fusilli

Folders and files

Latest commit

History

Repository files navigation

Fusilli

Developer Guide

Setup

Build and Test

Benchmarks

Python Benchmark Wrapper

Custom Compiler Flags

Tuning Specs

Code Coverage (using gcov + lcov)

Lint

Logging

Environment Variables

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Uh oh!

Contributors 9

Uh oh!

Languages