Model loading optimizations #789

wbruna · 2025-09-05T00:54:15Z

wbruna
Sep 5, 2025

Following up on the discussion in #772, I took a look at the code, and did a few tests to identify potential optimization opportunities.

All tests were done on an AMD Ryzen 5 3400G, RX 7600 XT, and SSD storage, running Linux, with display output routed to the iGPU. Mostly on Vulkan, though ROCm behaves similarly.

During model loading, each tensor is processed sequentially in three steps:
a) read from the model file;
b) converted (dequantized), depending on the tensor’s weight type;
c) loaded into VRAM (on devices where VRAM is separate from system RAM).

Step (a) is, as expected, I/O-bound and varies significantly between cold and hot cache scenarios. The file cache appears effective at minimizing this cost for hot cache.

Step (b) is CPU-bound and currently single-threaded. The overhead depends on the model’s structure: in the SDXL .safetensors file I tested, there were many small tensors requiring conversion (half in number, around 10% in bytes), but most were loaded as-is.

Step (c) is more unpredictable. Its duration can vary widely depending on what else is happening on the GPU. For example, in one instance, simply loading a small LLM model before sd.cpp caused image model loading to take 2 seconds longer (for a total of 12 seconds). In another test, after a system sleep and resume cycle, loading time was roughly halved. I ruled out thermal throttling, though I can’t be 100% certain no ~~PEBKAC~~ other external factor was involved.

For #772 , a low-hanging fruit could be decoupling (a+b) from (c): we could add support to the model context to persist the model in RAM, and only load/unload it into VRAM when needed. This would allow a single process to cache a few models, and switch between them with just the cost of (c).

I made a quick test converting the models fully into RAM before loading into VRAM:
wbruna@579972e . @JustMaier, could you please run a few tests in your setup with this change? The "phase 2" logs should give us a good estimate of the potential performance gains from keeping models cached in a persistent sd.cpp process.

Another option could be allowing overlap between (a), (b) and (c) for different tensors. I noticed ggml has support for loading weights asynchronously into VRAM, although it doesn't look like my hardware has support for it.

leejet · 2025-09-05T01:36:22Z

leejet
Sep 5, 2025
Maintainer

The --offload-to-cpu parameter has already achieved a similar function, allowing all weights to be placed in RAM and only loaded into VRAM when needed. You can checkout this pr #778.

2 replies

wbruna Sep 5, 2025
Author

--offload-to-cpu

Nice. Unloading from VRAM is currently only controlled by the free_params_immediately context flag, right? So setting that flag to false (to avoid paying for the RAM->VRAM cost on each inference), together with a new unload_weights function, would pretty much work as the 'low-hanging fruit' I suggested above. The process could keep an sd_ctx_t for each "pre-loaded" model, and one (or a few) active; and call unload_weights when switching that active context.

leejet Sep 6, 2025
Maintainer

The --offload-to-cpu option will automatically offload the weights from the GPU once the computation is completed.

rmatif · 2025-09-05T08:39:43Z

rmatif
Sep 5, 2025

I managed to accelerate model loading by using multithreading, achieving up to 3× faster loading compared to master

https://github.com/rmatif/stable-diffusion.cpp/blob/ref-tensor-loading/model.cpp

@leejet I can open a PR if you think this can be merged

Regarding the SDXL use case: what the civitai folks are doing is keeping the model in RAM and loading it directly from RAM → VRAM. What I noticed is that we are severely CPU-bound in this process.

I tried using mmap from ggml here: https://github.com/rmatif/stable-diffusion.cpp/tree/add-mmap

Storing the model in a ramdisk and loading it into RAM takes ~4.3s while using a warm load with mmap takes ~1.1s.

The warm load is essentially memory-bound, which means the cpu overhead is actually huge. From my measurements with multithreading we can achieve ramdisk → RAM in ~1.3s, meaning the cpu overhead will be around ~200ms. With further optimizations such as avoiding regex I think we could potentially reduce this even further

I noticed ggml has support for loading weights asynchronously into VRAM, although it doesn't look like my hardware has support for it

CUDA does indeed support asynchronous loading and it's actually doing great, the RAM -> VRAM (once the cpu overhead reduced) is not a bottleneck imo

6 replies

wbruna Sep 5, 2025
Author

I managed to accelerate model loading by using multithreading, achieving up to 3× faster loading compared to master

https://github.com/rmatif/stable-diffusion.cpp/blob/ref-tensor-loading/model.cpp

@leejet I can open a PR if you think this can be merged

Cool. Ping me when you do, I can help reviewing it.

Regarding the SDXL use case: what the civitai folks are doing is keeping the model in RAM and loading it directly from RAM → VRAM. What I noticed is that we are severely CPU-bound in this process.

Note the bottleneck will likely depend on the CPU, backend and weight types. If the model's weights already have their final types, most of that CPU cost can vanish (even a straight conversion to gguf already speeds things up).

I tried using mmap from ggml here: https://github.com/rmatif/stable-diffusion.cpp/tree/add-mmap

Storing the model in a ramdisk and loading it into RAM takes ~4.3s while using a warm load with mmap takes ~1.1s.

The warm load is essentially memory-bound, which means the cpu overhead is actually huge. From my measurements with multithreading we can achieve ramdisk → RAM in ~1.3s, meaning the cpu overhead will be around ~200ms. With further optimizations such as avoiding regex I think we could potentially reduce this even further

Interesting. I saw you're using memcpy between the mapped file and the local buffers. It'd likely be a bigger, more invasive change, but for weights that don't need conversion, you could instead use the mapped area directly as the tensor buffer. This would help reducing memory pressure in cases where the same file may be loaded more than once - like Civitai's :-) And together with @leejet 's new offload-to-cpu code path, some models could then run directly from the page cache.

CUDA does indeed support asynchronous loading and it's actually doing great, the RAM -> VRAM (once the cpu overhead reduced) is not a bottleneck imo

Maybe not for plain CUDA :-) The RAM -> VRAM path on my end can take around 60% of the loading time.

rmatif Sep 5, 2025

Note the bottleneck will likely depend on the CPU, backend and weight types. If the model's weights already have their final types, most of that CPU cost can vanish (even a straight conversion to gguf already speeds things up).

Even with a good CPU and no conversion, it still takes around 4s to load from ramdisk → RAM

Interesting. I saw you're using memcpy between the mapped file and the local buffers. It'd likely be a bigger, more invasive change, but for weights that don't need conversion, you could instead use the mapped area directly as the tensor buffer. This would help reducing memory pressure in cases where the same file may be loaded more than once - like Civitai's :-)

Dropping memcpy would indeed be more effective. I also thought about adding a "pinged" option, where we explicitly request the model to be pinged and avoid being paged or moved. I didn’t pursue it further, since they mainly rely on safetensors and prefer having more explicit control over memory management using ramdisk directly rather than relying on mmap

And together with @leejet 's new offload-to-cpu code path, some models could then run directly from the page cache.

I had already considered this, but I’m skeptical in the case of SDXL. Its model arch isn’t linear, so I doubt it can be done easily, it would probably require a significant amount of work imo

Maybe not for plain CUDA :-) The RAM -> VRAM path on my end can take around 60% of the loading time.

On my testing it's about ~20% I think it's reasonable

leejet Sep 6, 2025
Maintainer

Great! PR is welcome. Where does this speed improvement come from? Is it due to multithreaded memcpy?

Green-Sky Sep 6, 2025

Using more (f)reads in parallel, I presume. The same rules as to why reading mmapped files is faster.

rmatif Sep 6, 2025

Great! PR is welcome. Where does this speed improvement come from? Is it due to multithreaded memcpy?

It's a combinaison of parallelizing metadata processing, executing concurrent disk reads and hiding the latency by overlapping I/O and cpu work

Model loading optimizations #789

Uh oh!

wbruna Sep 5, 2025

Replies: 2 comments · 8 replies

Uh oh!

leejet Sep 5, 2025 Maintainer

Uh oh!

Uh oh!

wbruna Sep 5, 2025 Author

Uh oh!

Uh oh!

leejet Sep 6, 2025 Maintainer

Uh oh!

Uh oh!

rmatif Sep 5, 2025

Uh oh!

wbruna Sep 5, 2025 Author

Uh oh!

rmatif Sep 5, 2025

Uh oh!

leejet Sep 6, 2025 Maintainer

Uh oh!

Green-Sky Sep 6, 2025

Uh oh!

rmatif Sep 6, 2025

wbruna
Sep 5, 2025

Replies: 2 comments 8 replies

leejet
Sep 5, 2025
Maintainer

wbruna Sep 5, 2025
Author

leejet Sep 6, 2025
Maintainer

rmatif
Sep 5, 2025

wbruna Sep 5, 2025
Author

leejet Sep 6, 2025
Maintainer