Replies: 2 comments 8 replies
-
The |
Beta Was this translation helpful? Give feedback.
-
I managed to accelerate model loading by using multithreading, achieving up to 3× faster loading compared to master https://github.com/rmatif/stable-diffusion.cpp/blob/ref-tensor-loading/model.cpp @leejet I can open a PR if you think this can be merged Regarding the SDXL use case: what the civitai folks are doing is keeping the model in RAM and loading it directly from RAM → VRAM. What I noticed is that we are severely CPU-bound in this process. I tried using mmap from ggml here: https://github.com/rmatif/stable-diffusion.cpp/tree/add-mmap Storing the model in a ramdisk and loading it into RAM takes ~4.3s while using a warm load with mmap takes ~1.1s. The warm load is essentially memory-bound, which means the cpu overhead is actually huge. From my measurements with multithreading we can achieve ramdisk → RAM in ~1.3s, meaning the cpu overhead will be around ~200ms. With further optimizations such as avoiding regex I think we could potentially reduce this even further
CUDA does indeed support asynchronous loading and it's actually doing great, the RAM -> VRAM (once the cpu overhead reduced) is not a bottleneck imo |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Following up on the discussion in #772, I took a look at the code, and did a few tests to identify potential optimization opportunities.
All tests were done on an AMD Ryzen 5 3400G, RX 7600 XT, and SSD storage, running Linux, with display output routed to the iGPU. Mostly on Vulkan, though ROCm behaves similarly.
During model loading, each tensor is processed sequentially in three steps:
a) read from the model file;
b) converted (dequantized), depending on the tensor’s weight type;
c) loaded into VRAM (on devices where VRAM is separate from system RAM).
Step (a) is, as expected, I/O-bound and varies significantly between cold and hot cache scenarios. The file cache appears effective at minimizing this cost for hot cache.
Step (b) is CPU-bound and currently single-threaded. The overhead depends on the model’s structure: in the SDXL .safetensors file I tested, there were many small tensors requiring conversion (half in number, around 10% in bytes), but most were loaded as-is.
Step (c) is more unpredictable. Its duration can vary widely depending on what else is happening on the GPU. For example, in one instance, simply loading a small LLM model before sd.cpp caused image model loading to take 2 seconds longer (for a total of 12 seconds). In another test, after a system sleep and resume cycle, loading time was roughly halved. I ruled out thermal throttling, though I can’t be 100% certain no
PEBKACother external factor was involved.For #772 , a low-hanging fruit could be decoupling (a+b) from (c): we could add support to the model context to persist the model in RAM, and only load/unload it into VRAM when needed. This would allow a single process to cache a few models, and switch between them with just the cost of (c).
I made a quick test converting the models fully into RAM before loading into VRAM:
wbruna@579972e . @JustMaier, could you please run a few tests in your setup with this change? The "phase 2" logs should give us a good estimate of the potential performance gains from keeping models cached in a persistent sd.cpp process.
Another option could be allowing overlap between (a), (b) and (c) for different tensors. I noticed ggml has support for loading weights asynchronously into VRAM, although it doesn't look like my hardware has support for it.
Beta Was this translation helpful? Give feedback.
All reactions