Update blog

cmdr2 · cmdr2 · commit cfb143eaed08 · 2025-10-10T15:05:53.000+05:30
diff --git a/content/blog/2025-10-10-1760088945.md b/content/blog/2025-10-10-1760088945.md
@@ -0,0 +1,57 @@
+---
+title: "Post from Oct 10, 2025"
+date: 2025-10-10T09:35:45
+slug: "1760088945"
+tags:
+  - easydiffusion
+  - sdkit
+  - compilers
+---
+
+Some notes on machine-learning compilers, gathered while researching tech for Easy Diffusion's next engine (i.e. sdkit v3). For context, see the [design constraints](https://cmdr2.github.io/notes/2025/10/1760085894/) of the new engine.
+
+## tl;dr summary
+Caveat: This analysis could change in the future.
+
+The current state is:
+1. Vendor-specific compilers are the only performant options on consumer GPUs. For e.g. [TensorRT-RTX](https://docs.nvidia.com/deeplearning/tensorrt-rtx/latest/index.html) for NVIDIA, [MiGraphX](https://rocm.docs.amd.com/projects/AMDMIGraphX/en/latest/) for AMD, [OpenVINO](https://github.com/openvinotoolkit/openvino) for Intel.
+2. Cross-vendor compilers are just not performant enough for Stable Diffusion-class workloads on consumer GPUs. For e.g. like [TVM](https://tvm.apache.org/), [IREE](https://iree.dev/), [XLA](https://openxla.org/xla).
+
+The focus of cross-vendor compilers seems to be either on datacenter hardware, or embedded devices. The performance on desktops and laptops is pretty poor. Mojo doesn't target this category (and doesn't support Windows). Probably because datacenters and embedded devices are currently where the attention (and money) is.
+
+The idea of a cross-vendor ML compiler is clearly awesome, and I think this is the way things should go. But we're not there yet for desktops/laptops, in terms of runtime performance.
+
+## What's an ML compiler?
+The basic idea of an ML compiler is to consider an ML model's execution graph as a program to compile, and produce an optimized set of GPU-specific instructions. The compiler can optimize the execution graph by doing things like fusing operations together, parallelizing operations when possible, and even mapping groups of operators to GPU-specific instructions. It can use its knowledge of the target GPU for optimizing the memory layout and parallelism of operations. Basically what compilers already do for CPUs today, but for GPUs.
+
+We already have a decent graph format: ONNX. Every model that I intend to support has ONNX exports available (and it's easy to export one, for new models).
+
+## Good links for reading more about ML compilers
+
+* [https://huyenchip.com/2021/09/07/a-friendly-introduction-to-machine-learning-compilers-and-optimizers.html](https://huyenchip.com/2021/09/07/a-friendly-introduction-to-machine-learning-compilers-and-optimizers.html)
+* [https://unify.ai/blog/deep-learning-compilers](https://unify.ai/blog/deep-learning-compilers)
+* [https://spj.science.org/doi/10.34133/icomputing.0040](https://spj.science.org/doi/10.34133/icomputing.0040)
+* [https://www.modular.com/blog/democratizing-ai-compute-part-6-what-about-ai-compilers](https://www.modular.com/blog/democratizing-ai-compute-part-6-what-about-ai-compilers)
+
+## ML compiler projects
+Cross-vendor ML compilers:
+- [XLA](https://openxla.org/xla), 2017 (the first major ML compiler)
+- [Apache TVM](https://tvm.apache.org/), 2019
+- [IREE](https://iree.dev/), 2023
+
+Vendor-specific ML compilers:
+- [TensorRT-RTX](https://docs.nvidia.com/deeplearning/tensorrt-rtx/latest/index.html) (NVIDIA-only, Windows and Linux)
+- [MiGraphX](https://rocm.docs.amd.com/projects/AMDMIGraphX/en/latest/) (AMD-only, Linux)
+- [OpenVINO](https://github.com/openvinotoolkit/openvino) (Intel-only, Windows and Linux)
+
+## Testing compilers
+
+On a Windows 11 desktop with NVIDIA 3060 12 GB (CUDA backend):
+* TensorRT-RTX: fastest performance. Supports weight-stripped engines (for custom model weights) and LoRA.
+* IREE: 30x slower than PyTorch on SD 1.5 VAE (130 MB), comparable performance for tiny models (13 MB). So it doesn't look good for larger models (SD 1.5 Unet, 1.7 GB), or Flux (6-12 GB).
+* TVM: I wasn't able to get it working. I managed to compile TVM for CUDA, but wasn't able to compile an ONNX graph with it. They're rewriting major parts of the codebase, and the docs and code are out-of-date. I'm sure this could've been figured out, but I don't feel confident in building a new engine on top of a shifting codebase, for unclear performance on desktops. Maybe once it has stabilized, for a future engine.
+* `torch.compile` (with WSL) still requires torch, which doesn't fit the "< 200 MB" installation size target of the new engine.
+* Executorch isn't focused on desktops/laptops.
+* XLA is pretty confusing. They apparently use cuDNN/cuBLAS finally (which exceeds the "< 200 MB" installation size target of the new engine).
+
+I don't have AMD or Intel GPUs to test MiGraphX or OpenVINO, but I plan on compiling with them anyway and asking for testing help on Easy Diffusion's [Discord server](https://discord.com/invite/u9yhsFmEkB). But from what I've read, their features fit my needs and I don't doubt their performance numbers (since it's their hardware).