Skip to content

Conversation

@wbruna
Copy link

@wbruna wbruna commented Sep 9, 2025

This updates the embedded sd.cpp code to master-c648001, plus a few sd.cpp pending fixes master-52a97b3 master-0ebe6fe master-301-fd693ac master-306-2abe945 .

For anyone wondering, it does not bring in any new functionality; I'll be happy if everything is still working as before 🙂 The diff is massive, and at several points I had to basically rewrite the needed changes.

I did the sync through the following branches:

The included pending PRs are The pendings PRs below got merged upstream:

For sdtype_adapter.cpp, instead of mirroring each new supported parameter, I decided to clean everything up, keeping only the parameters that can be changed through the Koboldcpp API. The others are either fixed on the code (like batch_count) or left at their defaults set by sd.cpp. Most changes are unrelated to the sd.cpp update, so I could submit them separately if you wish.

For the other changes, it may be easier to compare the previous diff chunks (added in 6cf0a96df1a8 ) against the new one between the branches above.

The PR is ready in the I'm-not-aware-of-anything-missing sense, but I've only tested it lightly, so I'll leave it as draft for now. I'll be reviewing and testing it for a few days. Feel free to bug me with anything you notice.

@LostRuins
Copy link
Owner

LostRuins commented Sep 13, 2025

i'll be away for about a week, will try to look at this after. once you've synced and tested you can unmark as draft first.

@LostRuins
Copy link
Owner

Hmm seems like your previous PR conflicts with this?

@wbruna
Copy link
Author

wbruna commented Sep 13, 2025

Yeah, I didn't touch the stb headers with this one, so we'll have at least a small conflict on main. I'll fix that one shortly.

Looks like leejet/stable-diffusion.cpp#484 is about to be included upstream, too! I should at least grab the current PR branch.

Edit: all four fixes got merged 🎉 I'll probably just reconstruct the branches.

@wbruna wbruna force-pushed the kcpp_update_sdcpp_2509 branch from 1e596c5 to c99ac4d Compare September 13, 2025 11:39
@wbruna wbruna force-pushed the kcpp_update_sdcpp_2509 branch from c99ac4d to 9c61e8d Compare September 14, 2025 12:45
@wbruna wbruna changed the title update stable-diffusion.cpp to master-c648001 (+fixes) update stable-diffusion.cpp to master-52a97b3 Sep 14, 2025
@wbruna
Copy link
Author

wbruna commented Sep 14, 2025

Removed local changes due to merged upstream PRs:

  • do not force CLIP to F32 (automatic now, since Koboldcpp doesn't enable textual inversions)
  • VAE tiling fixes
  • do not force VAE to F32 on SDXL
  • NULL handling for sd_img_gen_params_t::input_id_images_path
  • model loading progress display throttling (now using upstream version)

Also removed:

And included:

  • forcing T5 to run on CPU (upstream now relies on the clip-on-cpu flag)
    The flag isn't enough for us, because we may not know we need T5 until after the model is loaded (it could be a single 'fat' model file)

@wbruna wbruna force-pushed the kcpp_update_sdcpp_2509 branch from 9c61e8d to 1fdea05 Compare September 14, 2025 15:04
@wbruna wbruna changed the title update stable-diffusion.cpp to master-52a97b3 update stable-diffusion.cpp to master-0ebe6fe Sep 15, 2025
@wbruna
Copy link
Author

wbruna commented Sep 15, 2025

master-0ebe6fe changed the Photomaker interface to receive a list of images instead of an image directory, which was exactly what our kcpp_img_gen_params_t did; so we don't need that diff anymore 🎉 I've removed the extra parameter and related processing code from the image generation functions. The only remaining Photomaker-related diff is the code that disables it for non-SDXL models.

if (conditioner_wtype != GGML_TYPE_F32) {
LOG_INFO("CLIP: Forcing CPU backend for T5");
clip_on_cpu = true;
}
Copy link
Author

@wbruna wbruna Sep 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not really sure about this. For instance, t5xxl_q8_0.gguf makes black images on ROCm, but works fine on Vulkan. Forcing it to CPU in this case adds around 45s to generation time... OTOH, a Q4_K quant crashes on both.

@wbruna wbruna changed the title update stable-diffusion.cpp to master-0ebe6fe update stable-diffusion.cpp to master-301-fd693ac Sep 18, 2025
@LostRuins
Copy link
Owner

i'm back from my trip, so I'll take a look at this over the next few days. kobold is overdue for a release though, so it'll probably have to wait for the next one.

@wbruna
Copy link
Author

wbruna commented Sep 21, 2025

Welcome back!

Yeah, the changes are extensive, and there is at least one still unsolved problematic upstream issue (844). But take a look at the three commits before the sd updates, they may be simple enough to include right now (I can put them in a separate PR if that helps).

@LostRuins
Copy link
Owner

I can put them in a separate PR if that helps

Yes please, that would be good

@wbruna wbruna force-pushed the kcpp_update_sdcpp_2509 branch from a48487c to 0bdc8e9 Compare September 21, 2025 10:09
@wbruna
Copy link
Author

wbruna commented Sep 21, 2025

Image generation is kinda broken with the ggml version on the 1.99 release 😕 I'll open an upstream issue shortly.

Edit: leejet/stable-diffusion.cpp#847

@LostRuins
Copy link
Owner

Thanks, I will temporarily reverse ggml-org#16056 for a stable patch while this is investigated upstream.

Btw your other PR #1746 caused a merge conflict

@wbruna wbruna force-pushed the kcpp_update_sdcpp_2509 branch from 0bdc8e9 to 0eaf498 Compare September 21, 2025 15:37
@LostRuins
Copy link
Owner

@wbruna can you see if 1.99.1 has the prompt adherence work fine? I reverted the commit you listed.

@wbruna
Copy link
Author

wbruna commented Sep 21, 2025

@LostRuins , 1.99.1 seems to be working fine 👍

0cc4m's PR (16151) seems to work, too , but I need to run more tests - edit: test results are good, so I'll go back to the sdtype_adapter refactoring.

@LostRuins
Copy link
Owner

merged 16151 for 1.99.2

@henk717
Copy link
Collaborator

henk717 commented Sep 22, 2025

Will this be updated to support the new qwen model once that PR lands? Thats the main thing our users are waiting for

@wbruna wbruna force-pushed the kcpp_update_sdcpp_2509 branch from 0eaf498 to ff08284 Compare September 22, 2025 21:33
@wbruna
Copy link
Author

wbruna commented Sep 22, 2025

Will this be updated to support the new qwen model once that PR lands? Thats the main thing our users are waiting for

Since there are a lot of changes, I'd be more comfortable if we gave the update a bit of testing before including changes for specific support. Of course, if Qwen-Image doesn't need any other special parameters or treatment, it may be just a matter of merging in the new revisions.

@wbruna wbruna force-pushed the kcpp_update_sdcpp_2509 branch from ff08284 to 8374423 Compare September 22, 2025 21:59
@wbruna wbruna mentioned this pull request Sep 22, 2025
@wbruna
Copy link
Author

wbruna commented Sep 22, 2025

@LostRuins , I moved the sdtype_adapter pre-update refactoring to #1753. I'll then focus this PR on the parts which need to be changed for the sd.cpp update.

Things that we'll need to consider at some point or another:

  • we're still waiting for a fix to SDLX Black output afer PR 55c2e05 leejet/stable-diffusion.cpp#838 . It doesn't actually block this update, though, since we could apply a small diff to force one thread only for model loading
  • the keep_clip_on_cpu parameter. It'd be really nice if we could run T5 on the GPU if possible, but that depends on the backend and the model quants. So the sensible approach would be yet another config parameter... Maybe a multi-state like the conv2d one: CLIP on CPU, VAE on CPU, both, neither?
  • leejet used the flow-shift on the Qwen Image example, and that'd be another loading-time parameter
  • possible new parameters for image generation: scheduler (the new smoothstep scheduler really help with Chroma generation, and I've always missed the ability to change it for specific gens), timestep-shift (the Nitro few-steps models would be really nice for in-chat image generation), guidance, img-cfg-scale...

Marking the PR as 'ready' since I've been testing it for several days.

@wbruna wbruna marked this pull request as ready for review September 22, 2025 22:40
@wbruna wbruna force-pushed the kcpp_update_sdcpp_2509 branch from 8374423 to 0eb04d9 Compare September 24, 2025 14:36
@wbruna wbruna changed the title update stable-diffusion.cpp to master-301-fd693ac update stable-diffusion.cpp to master-306-2abe945 Sep 24, 2025
@wbruna
Copy link
Author

wbruna commented Sep 24, 2025

Updated to master-306-2abe945 , which includes the fixes for leejet/stable-diffusion.cpp#838 and leejet/stable-diffusion.cpp#837 . Just a minor API change, and I don't think there is any major bug left, so it should be ready to go.

Oh, and I forgot about another nice load-time config parameter: offload_params_to_cpu . We need a bigger config screen 😅

@LostRuins
Copy link
Owner

cool, is this ready for review? still kinda busy but i'm hoping to get qwen image and wan into the next release.

@wbruna
Copy link
Author

wbruna commented Sep 25, 2025

It's ready for review. I still need to do a full round of testing, to catch anything I've missed, but all the code should be already in place.

As I mentioned before, the diff here is hard to follow, especially for the first updates. But the branches koboldcpp_sd_base and koboldcpp_sd_changes are up to date, so:

The koboldcpp_sd_changes branch builds and runs, so it can be used for testing too.

@wbruna
Copy link
Author

wbruna commented Sep 25, 2025

Tested: SD1.5, SDXL, LoRAs, SD3.5M, img2img, inpaint, Chroma, Flux Schnell, Photomaker, Kontext, TAESD (fixed a bug).

@LostRuins
Copy link
Owner

Remind me again - have we always been running the T5 on CPU previously? And what backend issues are there with using it on GPU

@LostRuins
Copy link
Owner

Alright did some basic testing and everything seems fine mostly. Will merge shortly

@LostRuins LostRuins merged commit 42087c3 into LostRuins:concedo_experimental Sep 27, 2025
@LostRuins
Copy link
Owner

Merged. I'll download and check out WAN sometime, but for now please report any issues found

@wbruna
Copy link
Author

wbruna commented Sep 27, 2025

Remind me again - have we always been running the T5 on CPU previously? And what backend issues are there with using it on GPU

Yes; that behavior was changed as part of the Wan changes (hard to see on the web because of the size of the diff):

leejet/stable-diffusion.cpp@cb1d975#diff-a8e875e8f7684a2f9e35212d88ee9574001952eb2c049385e73d5e24fd6e2a83L323

image

But it's not always needed: it depends on the T5 quant and the backend. For me, on Vulkan it works for legacy quants and crashes with k-quants (that may have changed in the last versions); ROCm either crashes or makes all-black images. So we should probably add a loading time control to turn on keep_clip_on_cpu...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants