|
| 1 | +--- |
| 2 | +layout: post |
| 3 | +title: "Post from Oct 16, 2024" |
| 4 | +date: 2024-10-16 18:10:25 +0000 |
| 5 | +slug: 1729102225 |
| 6 | +tags: [stable-diffusion, c++, cuda, easydiffusion, lab, performance, featured] |
| 7 | +--- |
| 8 | + |
| 9 | +**tl;dr** - *Today, I worked on using stable-diffusion.cpp in a simple C++ program. As a linked library, as well as compiling sd.cpp from scratch (with and without CUDA). The intent was to get a tiny and fast-starting executable UI for Stable Diffusion working. Also, ChatGPT is very helpful!* |
| 10 | + |
| 11 | +## Part 1: Using sd.cpp as a library |
| 12 | + |
| 13 | +First, I tried calling the [stable-diffusion.cpp](https://github.com/leejet/stable-diffusion.cpp) library from a simple C++ program (which just loads the model and renders an image). Via dynamic linking. That worked, and its performance was the same as the example `sd.exe` CLI, and it detected and used the GPU correctly. |
| 14 | + |
| 15 | +The basic commands for this were (using MinGW64): |
| 16 | +```shell |
| 17 | +gendef stable-diffusion.dll |
| 18 | +dlltool --dllname stable-diffusion.dll --output-lib libstable-diffusion.a --input-def stable-diffusion.def |
| 19 | +g++ -o your_program your_program.cpp -L. -lstable-diffusion |
| 20 | +``` |
| 21 | + |
| 22 | +And I had to set a `CMAKE_GENERATOR="MinGW Makefiles"` environment variable. The steps will be different if using MSVC's `cl.exe`. |
| 23 | + |
| 24 | +I figured that I could write a simple HTTP server in C++ that wraps sd.cpp. Using a different language would involve keeping the language binding up-to-date with sd.cpp's header file. For e.g. the [Go-lang wrapper](https://github.com/seasonjs/stable-diffusion) is currently out-of-date with sd.cpp's latest header. |
| 25 | + |
| 26 | +This thin-wrapper C++ server wouldn't be too complex, it would just act as a rendering backend process for a more complex Go-lang based server (which would implement other user-facing features like model management, task queue management etc). |
| 27 | + |
| 28 | +Here's a simple C++ example: |
| 29 | +```cpp |
| 30 | +#include "stable-diffusion.h" |
| 31 | +#include <iostream> |
| 32 | + |
| 33 | +int main() { |
| 34 | + // Create the Stable Diffusion context |
| 35 | + sd_ctx_t* ctx = new_sd_ctx("F:\\path\\to\\sd-v1-5.safetensors", "", "", "", "", "", "", "", "", "", "", |
| 36 | + false, false, false, -1, SD_TYPE_F16, STD_DEFAULT_RNG, DEFAULT, false, false, false); |
| 37 | + |
| 38 | + if (ctx == NULL) { |
| 39 | + std::cerr << "Failed to create Stable Diffusion context." << std::endl; |
| 40 | + return -1; |
| 41 | + } |
| 42 | + |
| 43 | + // Generate image using txt2img |
| 44 | + sd_image_t* image = txt2img(ctx, "A beautiful landscape painting", "", 0, 7.5f, 1.0f, 512, 512, |
| 45 | + EULER_A, 25, 42, 1, NULL, 0.0f, 0.0f, false, ""); |
| 46 | + |
| 47 | + if (image == NULL) { |
| 48 | + std::cerr << "txt2img failed." << std::endl; |
| 49 | + free_sd_ctx(ctx); |
| 50 | + return -1; |
| 51 | + } |
| 52 | + |
| 53 | + // Output image details |
| 54 | + std::cout << "Generated image: " << image->width << "x" << image->height << std::endl; |
| 55 | + |
| 56 | + // Cleanup |
| 57 | + free_sd_ctx(ctx); |
| 58 | + |
| 59 | + return 0; |
| 60 | +} |
| 61 | +``` |
| 62 | + |
| 63 | +## Part 2: Compiling sd.cpp from scratch (as a sub-folder in my project) |
| 64 | + |
| 65 | +*Update: This code is now available in [a github repo](https://github.com/cmdr2/sd.cpp).* |
| 66 | + |
| 67 | +The next experiment was to compile sd.cpp from scratch on my PC (using the MinGW compile as well as Microsoft's VS compiler). I used sd.cpp as a git submodule in my project, and linked to it staticly. |
| 68 | + |
| 69 | +I needed this initially to investigate a segfault inside a function of `stable-diffusion.dll`, which I wasn't able to trace (even with `gdb`). Plus it was fun to compile the entire thing and see the entire Stable Diffusion implementation fit into a tiny binary that starts up really quickly. A few megabytes for the CPU-only build. |
| 70 | + |
| 71 | +My folder tree was: |
| 72 | +``` |
| 73 | +- stable-diffusion.cpp # sub-module dir |
| 74 | +- src/main.cpp |
| 75 | +- CMakeLists.txt |
| 76 | +``` |
| 77 | + |
| 78 | +`src/main.cpp` is the same as before, except for this change at the start of `int main()` (in order to capture the logs): |
| 79 | +```cpp |
| 80 | +void sd_log_cb(enum sd_log_level_t level, const char* log, void* data) { |
| 81 | + std::cout << log; |
| 82 | +} |
| 83 | + |
| 84 | +int main(int argc, char* argv[]) { |
| 85 | + sd_set_log_callback(sd_log_cb, NULL); |
| 86 | + |
| 87 | + // ... rest of the code is the same |
| 88 | +} |
| 89 | +``` |
| 90 | +
|
| 91 | +And `CMakeLists.txt` is: |
| 92 | +```cmake |
| 93 | +cmake_minimum_required(VERSION 3.13) |
| 94 | +project(sd2) |
| 95 | +
|
| 96 | +# Set C++ standard |
| 97 | +set(CMAKE_CXX_STANDARD 17) |
| 98 | +set(CMAKE_CXX_STANDARD_REQUIRED ON) |
| 99 | +
|
| 100 | +# Add submodule directory for stable-diffusion |
| 101 | +add_subdirectory(stable-diffusion.cpp) |
| 102 | +
|
| 103 | +# Include directories for stable-diffusion and its dependencies |
| 104 | +include_directories(stable-diffusion.cpp src) |
| 105 | +
|
| 106 | +# Create executable from your main.cpp |
| 107 | +add_executable(sd2 src/main.cpp) |
| 108 | +
|
| 109 | +# Link with the stable-diffusion library |
| 110 | +target_link_libraries(sd2 stable-diffusion) |
| 111 | +``` |
| 112 | + |
| 113 | +Compiled using: |
| 114 | +```shell |
| 115 | +cmake |
| 116 | +cmake --build . --config Release |
| 117 | +``` |
| 118 | + |
| 119 | +This ran on the CPU, and was obviously slow. But good to see it running! |
| 120 | + |
| 121 | +**Tiny note:** I noticed that compiling with `g++` (mingw64) resulted in faster iterations/sec compared to MSVC. For e.g. `3.5 sec/it` vs `4.5 sec/it` for SD 1.5 (euler_a, 256x256, fp32). Not sure why. |
| 122 | + |
| 123 | +## Part 3: Compiling the CUDA version of sd.cpp |
| 124 | + |
| 125 | +Just for the heck of it, I also installed the [CUDA Toolkit](https://developer.nvidia.com/cuda-downloads) and compiled the cuda version of my example project. That took some fiddling. I had to [copy some files around to make it work](https://github.com/NVlabs/tiny-cuda-nn/issues/164#issuecomment-1280749170), and point the `CUDAToolkit_ROOT` environment variable to where the CUDA toolkit was installed (for e.g. `C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.6`). |
| 126 | + |
| 127 | +Compiled using: |
| 128 | +``` |
| 129 | +cmake -DSD_CUBLAS=ON |
| 130 | +cmake --build . --config Release |
| 131 | +``` |
| 132 | + |
| 133 | +The compilation took a *long* time, since it compiled all the cuda kernels inside ggml. But it worked, and was as fast as the official `sd.exe` build for CUDA (which confirmed that nothing was misconfigured). |
| 134 | + |
| 135 | +It resulted in a 347 MB binary (which compresses to a 71 MB .7z file for download). That's really good, compared to the 6 GB+ (uncompressed) behemoths in python-land for Stable Diffusion. Even including the CUDA DLLs (that are needed separately) that's "only" another 600 MB uncompressed (300 MB .7z compressed), which is still better. |
| 136 | + |
| 137 | +## Conclusions |
| 138 | + |
| 139 | +The binary size (and being a single static binary) and the startup time is hands-down excellent. So that's pretty promising. |
| 140 | + |
| 141 | +But in terms of performance, sd.cpp seems to be significantly slower for SD 1.5 than Forge WebUI (or even a basic diffusers pipeline). `3 it/sec` vs `7.5 it/sec` for a SD 1.5 image (euler_a, 512x512, fp16) on my NVIDIA 3060 12GB. I tested with the official `sd.exe` build. I don't know if this is just my PC, but [another user](https://github.com/leejet/stable-diffusion.cpp/discussions/29#discussioncomment-10246618) reported something similar. |
| 142 | + |
| 143 | +Interestingly, the implementation for the `Flux` model in sd.cpp runs as fast as Forge WebUI, and is pretty efficient with memory. |
| 144 | + |
| 145 | +Also, I don't think it's really practical or necessary to compile sd.cpp from scratch, but I wanted to have the freedom to use things like the CLIP implementation inside sd.cpp, which isn't exposed via the DLL. But that could also be achieved by submitting a PR to the sd.cpp project, and maybe they'd be okay with exposing the useful inner models in the main DLL as well. |
| 146 | + |
| 147 | +But it'll be interesting to link this with the fast-starting Go frontend (from yesterday), or maybe even just as a fast-starting standalone C++ server. Projects like Jellybox exist (Go-lang frontend and sd.cpp backend), but it's interesting to play with this anyway, to see how small and fast an SD UI can be made. |
0 commit comments