Skip to content

Commit 36ffb9c

Browse files
committed
Automated commit from Lambda
1 parent 235d366 commit 36ffb9c

17 files changed

+549
-0
lines changed
Lines changed: 52 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,52 @@
1+
---
2+
layout: post
3+
title: "Post from Sep 04, 2024"
4+
date: 2024-09-04 15:20:49 +0000
5+
slug: 1725463249
6+
tags: [easydiffusion, ai, lab, performance, featured]
7+
---
8+
9+
**tl;dr**: Explored a possible optimization for Flux with `diffusers` when using `enable_sequential_cpu_offload()`. It did not work.
10+
11+
While trying to use Flux (nearly 22 GB of weights) with `diffusers` on a 12 GB graphics card, I noticed that it barely used any GPU memory when using `enable_sequential_cpu_offload()`. And it was super slow. It turns out that the largest module in Flux's transformer model is around 108 MB, so because diffusers streams modules one-at-a-time, the peak VRAM usage never crossed above a few hundred MBs.
12+
13+
And that felt odd - a few hundred MBs being used on a 12 GB graphics card. Wouldn't it be faster if it always kept 8-9 GB of the model weights on the GPU, and streamed the rest? Less data to stream == less time wasted on memory overhead == better rendering speed?
14+
15+
The summary is that, strangely enough, that optimization did not result in a real improvement. Sometimes it was barely any faster. So, IMO not worth the added complexity. Quantization probably has better ROI.
16+
17+
## Idea:
18+
19+
The way a diffusion pipeline usually works is - it first runs the `text encoder` module(s) once, and then runs the `vae` module once (for encoding), and then loops over the `unet`/`transformer` module several times (i.e. `inference steps`), and then finally runs the `vae` module once again (for decoding).
20+
21+
So the idea was to keep a large fraction of the `unet`/`transformer` sub-modules in the GPU (instead of offloading them back the CPU). This way, the second/third/fourth/etc loop of the `unet`/`transformer` would need to transfer less data per loop iteration, and therefore incur less overhead of GPU-to-CPU-and-back transfers. 14 GB transferred per loop, instead of 22 GB.
22+
23+
24+
## Which modules did I pin?
25+
26+
For deciding which modules to "pin" to the GPU, I tried both orders - sorting by the smallest modules and pinning those first, as well as sorting by the largest modules and pinning those first. The first approach intended to reduce the I/O wait time during computation, while waiting for lots of small modules. The second approach intended to keep the big modules pinned to the GPU, to avoid large transfers.
27+
28+
Neither approach seemed to change the result.
29+
30+
31+
## Results:
32+
33+
Unfortunately, the performance gain is non-existent, to very marginal. I ran each test twice, to ensure that the OS would have the page files warmed up equally.
34+
35+
For a 512x512 image with 4 steps:
36+
* With this optimization: With 8 GB "pinned" (and the rest streamed), the overall image generation took 124 seconds.
37+
* Without this optimization: With everything streamed, the overall image generation took 122 seconds.
38+
39+
In other runs with 4 steps, the optimization was sometimes faster by 5-10 seconds, or sometimes similar.
40+
41+
With increased steps (e.g. 10 steps), the optimization is usually better by 15-20 seconds (i.e. 160 seconds total vs 180 seconds).
42+
43+
44+
## Possible Explanation (of why it didn't work):
45+
46+
OS Paging or Driver caching or PyTorch caching. The first loop iteration would obviously be very slow, since it would read everything (including the "pinned" modules) from the CPU to the GPU.
47+
48+
The subsequent inference loops were actually faster with the optimization (2.5s vs 4s). But since the first iteration constituted nearly 95% of the total time, any savings due to this optimization were only affecting the subsequent 5% of the total time.
49+
50+
And I think paging or driver/pytorch caching is making the corresponding I/O transfer times very similar after the 2nd iteration. While 2.5 sec (for optimized) is faster than 4 sec (for unoptimized) in subsequent iterations, the improvement is not really very impactful. 4 seconds for transferring 22 GB is already pretty comparable to the optimized version. Presumably due to heavy OS paging.
51+
52+
So the basic premise of this experiment turned out to be wrong. Subsequent iterations of the `unet`/`transformer` **do not** incur a heavy I/O overhead while streaming GPU modules. Therefore pinning a large chunk of that module to the GPU didn't save much time. Not enough to make a difference to the overall render time.
Lines changed: 147 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,147 @@
1+
---
2+
layout: post
3+
title: "Post from Oct 16, 2024"
4+
date: 2024-10-16 18:10:25 +0000
5+
slug: 1729102225
6+
tags: [stable-diffusion, c++, cuda, easydiffusion, lab, performance, featured]
7+
---
8+
9+
**tl;dr** - *Today, I worked on using stable-diffusion.cpp in a simple C++ program. As a linked library, as well as compiling sd.cpp from scratch (with and without CUDA). The intent was to get a tiny and fast-starting executable UI for Stable Diffusion working. Also, ChatGPT is very helpful!*
10+
11+
## Part 1: Using sd.cpp as a library
12+
13+
First, I tried calling the [stable-diffusion.cpp](https://github.com/leejet/stable-diffusion.cpp) library from a simple C++ program (which just loads the model and renders an image). Via dynamic linking. That worked, and its performance was the same as the example `sd.exe` CLI, and it detected and used the GPU correctly.
14+
15+
The basic commands for this were (using MinGW64):
16+
```shell
17+
gendef stable-diffusion.dll
18+
dlltool --dllname stable-diffusion.dll --output-lib libstable-diffusion.a --input-def stable-diffusion.def
19+
g++ -o your_program your_program.cpp -L. -lstable-diffusion
20+
```
21+
22+
And I had to set a `CMAKE_GENERATOR="MinGW Makefiles"` environment variable. The steps will be different if using MSVC's `cl.exe`.
23+
24+
I figured that I could write a simple HTTP server in C++ that wraps sd.cpp. Using a different language would involve keeping the language binding up-to-date with sd.cpp's header file. For e.g. the [Go-lang wrapper](https://github.com/seasonjs/stable-diffusion) is currently out-of-date with sd.cpp's latest header.
25+
26+
This thin-wrapper C++ server wouldn't be too complex, it would just act as a rendering backend process for a more complex Go-lang based server (which would implement other user-facing features like model management, task queue management etc).
27+
28+
Here's a simple C++ example:
29+
```cpp
30+
#include "stable-diffusion.h"
31+
#include <iostream>
32+
33+
int main() {
34+
// Create the Stable Diffusion context
35+
sd_ctx_t* ctx = new_sd_ctx("F:\\path\\to\\sd-v1-5.safetensors", "", "", "", "", "", "", "", "", "", "",
36+
false, false, false, -1, SD_TYPE_F16, STD_DEFAULT_RNG, DEFAULT, false, false, false);
37+
38+
if (ctx == NULL) {
39+
std::cerr << "Failed to create Stable Diffusion context." << std::endl;
40+
return -1;
41+
}
42+
43+
// Generate image using txt2img
44+
sd_image_t* image = txt2img(ctx, "A beautiful landscape painting", "", 0, 7.5f, 1.0f, 512, 512,
45+
EULER_A, 25, 42, 1, NULL, 0.0f, 0.0f, false, "");
46+
47+
if (image == NULL) {
48+
std::cerr << "txt2img failed." << std::endl;
49+
free_sd_ctx(ctx);
50+
return -1;
51+
}
52+
53+
// Output image details
54+
std::cout << "Generated image: " << image->width << "x" << image->height << std::endl;
55+
56+
// Cleanup
57+
free_sd_ctx(ctx);
58+
59+
return 0;
60+
}
61+
```
62+
63+
## Part 2: Compiling sd.cpp from scratch (as a sub-folder in my project)
64+
65+
*Update: This code is now available in [a github repo](https://github.com/cmdr2/sd.cpp).*
66+
67+
The next experiment was to compile sd.cpp from scratch on my PC (using the MinGW compile as well as Microsoft's VS compiler). I used sd.cpp as a git submodule in my project, and linked to it staticly.
68+
69+
I needed this initially to investigate a segfault inside a function of `stable-diffusion.dll`, which I wasn't able to trace (even with `gdb`). Plus it was fun to compile the entire thing and see the entire Stable Diffusion implementation fit into a tiny binary that starts up really quickly. A few megabytes for the CPU-only build.
70+
71+
My folder tree was:
72+
```
73+
- stable-diffusion.cpp # sub-module dir
74+
- src/main.cpp
75+
- CMakeLists.txt
76+
```
77+
78+
`src/main.cpp` is the same as before, except for this change at the start of `int main()` (in order to capture the logs):
79+
```cpp
80+
void sd_log_cb(enum sd_log_level_t level, const char* log, void* data) {
81+
std::cout << log;
82+
}
83+
84+
int main(int argc, char* argv[]) {
85+
sd_set_log_callback(sd_log_cb, NULL);
86+
87+
// ... rest of the code is the same
88+
}
89+
```
90+
91+
And `CMakeLists.txt` is:
92+
```cmake
93+
cmake_minimum_required(VERSION 3.13)
94+
project(sd2)
95+
96+
# Set C++ standard
97+
set(CMAKE_CXX_STANDARD 17)
98+
set(CMAKE_CXX_STANDARD_REQUIRED ON)
99+
100+
# Add submodule directory for stable-diffusion
101+
add_subdirectory(stable-diffusion.cpp)
102+
103+
# Include directories for stable-diffusion and its dependencies
104+
include_directories(stable-diffusion.cpp src)
105+
106+
# Create executable from your main.cpp
107+
add_executable(sd2 src/main.cpp)
108+
109+
# Link with the stable-diffusion library
110+
target_link_libraries(sd2 stable-diffusion)
111+
```
112+
113+
Compiled using:
114+
```shell
115+
cmake
116+
cmake --build . --config Release
117+
```
118+
119+
This ran on the CPU, and was obviously slow. But good to see it running!
120+
121+
**Tiny note:** I noticed that compiling with `g++` (mingw64) resulted in faster iterations/sec compared to MSVC. For e.g. `3.5 sec/it` vs `4.5 sec/it` for SD 1.5 (euler_a, 256x256, fp32). Not sure why.
122+
123+
## Part 3: Compiling the CUDA version of sd.cpp
124+
125+
Just for the heck of it, I also installed the [CUDA Toolkit](https://developer.nvidia.com/cuda-downloads) and compiled the cuda version of my example project. That took some fiddling. I had to [copy some files around to make it work](https://github.com/NVlabs/tiny-cuda-nn/issues/164#issuecomment-1280749170), and point the `CUDAToolkit_ROOT` environment variable to where the CUDA toolkit was installed (for e.g. `C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.6`).
126+
127+
Compiled using:
128+
```
129+
cmake -DSD_CUBLAS=ON
130+
cmake --build . --config Release
131+
```
132+
133+
The compilation took a *long* time, since it compiled all the cuda kernels inside ggml. But it worked, and was as fast as the official `sd.exe` build for CUDA (which confirmed that nothing was misconfigured).
134+
135+
It resulted in a 347 MB binary (which compresses to a 71 MB .7z file for download). That's really good, compared to the 6 GB+ (uncompressed) behemoths in python-land for Stable Diffusion. Even including the CUDA DLLs (that are needed separately) that's "only" another 600 MB uncompressed (300 MB .7z compressed), which is still better.
136+
137+
## Conclusions
138+
139+
The binary size (and being a single static binary) and the startup time is hands-down excellent. So that's pretty promising.
140+
141+
But in terms of performance, sd.cpp seems to be significantly slower for SD 1.5 than Forge WebUI (or even a basic diffusers pipeline). `3 it/sec` vs `7.5 it/sec` for a SD 1.5 image (euler_a, 512x512, fp16) on my NVIDIA 3060 12GB. I tested with the official `sd.exe` build. I don't know if this is just my PC, but [another user](https://github.com/leejet/stable-diffusion.cpp/discussions/29#discussioncomment-10246618) reported something similar.
142+
143+
Interestingly, the implementation for the `Flux` model in sd.cpp runs as fast as Forge WebUI, and is pretty efficient with memory.
144+
145+
Also, I don't think it's really practical or necessary to compile sd.cpp from scratch, but I wanted to have the freedom to use things like the CLIP implementation inside sd.cpp, which isn't exposed via the DLL. But that could also be achieved by submitting a PR to the sd.cpp project, and maybe they'd be okay with exposing the useful inner models in the main DLL as well.
146+
147+
But it'll be interesting to link this with the fast-starting Go frontend (from yesterday), or maybe even just as a fast-starting standalone C++ server. Projects like Jellybox exist (Go-lang frontend and sd.cpp backend), but it's interesting to play with this anyway, to see how small and fast an SD UI can be made.
Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,13 @@
1+
---
2+
layout: post
3+
title: "Post from Nov 19, 2024"
4+
date: 2024-11-19 19:18:15 +0000
5+
slug: 1732043895
6+
tags: [easydiffusion, stable-diffusion]
7+
---
8+
9+
Spent a few days getting a C++ based version of Easy Diffusion working, using stable-diffusion.cpp. I'm working with a fork of stable-diffusion.cpp [here](https://github.com/cmdr2/stable-diffusion.cpp), to add a few changes like per-step callbacks, live image previews etc.
10+
11+
It doesn't have a UI yet, and currently hardcodes a model path. It exposes a RESTful API server (written using the `Crow` C++ library), and uses a simple task manager that runs image generation tasks on a thread. The generated images are available at an API endpoint, and it shows the binary JPEG/PNG image (instead of base64 encoding).
12+
13+
The general intent of this project is to play with ideas for version 4 of [Easy Diffusion](https://github.com/easydiffusion/easydiffusion), and make it as lightweight, easy to install and fast as possible. Cutting out as much unnecessary bloat as possible.
Lines changed: 19 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,19 @@
1+
---
2+
layout: post
3+
title: "Post from Nov 21, 2024"
4+
date: 2024-11-21 15:17:56 +0000
5+
slug: 1732202276
6+
tags: [easydiffusion, stable-diffusion, c++]
7+
---
8+
9+
Spent some more time on the [v4 experiments](https://github.com/cmdr2/easy-diffusion4) for Easy Diffusion (i.e. C++ based, fast-startup, lightweight). `stable-diffusion.cpp` is missing a few features, which will be necessary for Easy Diffusion's typical workflow. I wasn't keen on forking stable-diffusion.cpp, but it's probably faster to work on [a fork](https://github.com/cmdr2/stable-diffusion.cpp) for now.
10+
11+
For now, I've added live preview and per-step progress callbacks (based on a few pending pull-requests on sd.cpp). And protection from `GGML_ASSERT` killing the entire process. I've been looking at the ability to load individual models (like the vae) without needing to reload the entire SD model.
12+
13+
### sd.cpp for Flux in ED in 3.5?
14+
15+
As a side-idea, I could use sd.cpp as the Flux and SD3 backend for the current version of Easy Diffusion. The Forge backend in ED 3.5 beta is a bit prone to crashing, and I'm only using it for Flux.
16+
17+
Long-term considerations side, it might be an interesting experiment to try sd.cpp in ED 3.5 and see if it's more stable than Forge for that purpose.
18+
19+
I could write a simple web server with a similar API as Forge and ship it.
Lines changed: 15 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,15 @@
1+
---
2+
layout: post
3+
title: "Post from Dec 14, 2024"
4+
date: 2024-12-14 19:47:38 +0000
5+
slug: 1734205658
6+
tags: [easydiffusion, ui, design, v4]
7+
---
8+
9+
Worked on a few UI design ideas for Easy Diffusion v4. I've uploaded the work-in-progress mockups at [https://github.com/easydiffusion/files](https://github.com/easydiffusion/files).
10+
11+
So far, I've mocked out the design for the outer skeleton. That is, the new tabbed interface, the status bar, and the unified main menu. I also worked on how they would look like on mobile devices.
12+
13+
It gives me a rough idea of the `Vue` components that would need to be written, and the surface area that plugins can impact. For e.g. plugins can add a new menu entry only in the `Plugins` sub-menu.
14+
15+
The mockups draw inspiration from earlier version of Easy Diffusion (obviously), Google Chrome, Firefox, and VS Code.
Lines changed: 25 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,25 @@
1+
---
2+
layout: post
3+
title: "Post from Dec 17, 2024"
4+
date: 2024-12-17 11:03:10 +0000
5+
slug: 1734433390
6+
tags: [easydiffusion, v4, ui]
7+
---
8+
9+
Notes on two directions for ED4's UI that I'm unlikely to continue on.
10+
11+
One is to start a desktop app with a full-screen webview (for the app UI). The other is writing the tabbed browser-like shell of ED4 in a compiled language (like Go or C++) and loading the contents of the tabs as regular webpages (by using webviews). So it would load URLs like `http://localhost:9000/ui/image_editor` and `http://localhost:9000/ui/settings` etc.
12+
13+
In the first approach, we would start an empty full-screen webview, and let the webpage draw the entire UI, including the tabbed shell. The only purpose of this would be to start a desktop app instead of opening a browser tab, while being very lightweight (compared to Electron/Tauri style implementations).
14+
15+
In the second approach, the shell would essentially be like a 2008-era Google Chrome [1], that's super lightweight and fast. And the purpose of this would be to have a fast-starting UI, and provide a scaffolding for other apps like this that need tabbed interfaces.
16+
17+
Realistically, neither of the two approaches are actually *really* necessary for ED4's goals [2]. It's already really fast to open a browser tab, and I don't see a strong justification for the added project complexity of compiling webviews and maintaining a native-language shell. For e.g. I use a custom locally-hosted diary app, which also opens a browser tab for the UI, and I've never once felt that its startup time was too slow for my taste. On the contrary, I'm always pleased by how quickly it starts up.
18+
19+
I don't really care whether ED starts in a browser tab or runs as a dedicated desktop app. I just want ED4's UI to be interactable within a few hundred milliseconds of launching it. That's the goal.
20+
21+
---
22+
23+
[1] To be honest, the second approach is an old pet idea of mine (from 2010), of writing things like IDEs etc in a fast, lightweight tabbed UI (like 2008-era Chrome), back when IDEs used to be massive trucks that took forever to load (Eclipse, Netbeans, Visual Studio etc). Chrome was also very novel in writing the rest of their user interface in HTML (e.g. Settings, Bookmarks, Downloads etc). So this is more of a pet itch, rather than something that came out of ED4's project needs. I might explore this again one day, but it doesn't really matter that much to me right now.
24+
25+
[2] Another downside of the second approach is that it prevents ED from being used remotely from other computers via a web browser.
Lines changed: 19 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,19 @@
1+
---
2+
layout: post
3+
title: "Post from Jan 03, 2025"
4+
date: 2025-01-03 15:38:31 +0000
5+
slug: 1735918711
6+
tags: [easydiffusion, ui, v4]
7+
---
8+
9+
Spent a few days prototyping a UI for Easy Diffusion v4. Files are at [this repo](https://github.com/easydiffusion/files/blob/main/ED4-ui-design/prototype).
10+
11+
The main focus was to get a simple but pluggable UI, that was backed by a reactive data model, and to allow splitting the codebase into individual components (with their own files). And require only a text editor and a browser to develop, i.e. no compilation or nodejs-based developer experiences.
12+
13+
I really want something that is easy to understand - for an outside developer and for myself (for e.g. if I'm returning to a portion of the codebase after a while). And with very little friction to start developing for it.
14+
15+
It uses Vue, but directly in the browser. I use [vue3-sfc-loader](https://github.com/FranckFreiburger/vue3-sfc-loader) to allow the UI to be divided into separate component files, without requiring compilation.
16+
17+
I got a basic tabbed interface shell working, and laid out the foundational data structures, and tested that plugins could add new tabs as well.
18+
19+
Next, I'm going to experiment with [PrimeVue](https://primevue.org/) for fleshing out a simple UI. I looked at quite a few UI libraries (including classic Bootstrap), and PrimeVue seems closest to my own mindset - like if I designed a UI library, it would look a lot like PrimeVue. And it seems to have most of the components that I require.
Lines changed: 15 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,15 @@
1+
---
2+
layout: post
3+
title: "Post from Jan 04, 2025"
4+
date: 2025-01-04 19:57:06 +0000
5+
slug: 1736020626
6+
tags: [easydiffusion, amd, directml]
7+
---
8+
9+
Spent most of the day doing some support work for Easy Diffusion, and experimenting with [torch-directml](https://pypi.org/project/torch-directml/) for AMD support on Windows.
10+
11+
From the initial experiments, torch-directml seems to work properly with Easy Diffusion. I ran it on my NVIDIA card, and another user ran it on their AMD Radeon RX 7700 XT.
12+
13+
It's 7-10x faster than the CPU, so looks promising. It's 2x slower than CUDA on my NVIDIA card, but users with NVIDIA cards are not the target audience of this change.
14+
15+
I still need to run the full set of automated tests, so there's a chance of some corner scenario breaking.

0 commit comments

Comments
 (0)