Skip to content

Conversation

@wbruna
Copy link

@wbruna wbruna commented Aug 17, 2025

q4_0 may degrade quality significantly, especially for smaller models like SD 1.5 and SDXL. q8_0 provides a middle-ground, giving half the memory savings of q4_0 but with less loading time and quality loss.

Depends on #1678 for the combobox auxiliary function.

@LostRuins
Copy link
Owner

Please unmark from draft when ready for review

@wbruna wbruna marked this pull request as ready for review August 20, 2025 11:51
@wbruna
Copy link
Author

wbruna commented Aug 20, 2025

Also tested other quants. q6_K could be nice to have: loading time seems more or less the same as q4_0, and it makes a lot of difference for SDXL. With the current code, it's just a matter of adding the types and saving values to the lists; but I'm not sure if it wouldn't clutter the interface.

@LostRuins
Copy link
Owner

It would definitely be kind of clutter. For power users who want a specific quant, it's far better for them to do the pre-quantization themselves and then load the GGUF. This option is more for those who want to keep safetensors and have a convenient option.

Additionally, k-quants present some challenges in that they have a block size requirement (128) that's larger than a classic quant (32) and can fail on some architectures.

So if you had to put 3 quants, I'd rather they be q4_0, q5_1 and q8_0. I think 3 quant types are the max we should have for this dynamic quant option.

@wbruna
Copy link
Author

wbruna commented Aug 20, 2025

It would definitely be kind of clutter. For power users who want a specific quant, it's far better for them to do the pre-quantization themselves and then load the GGUF. This option is more for those who want to keep safetensors and have a convenient option.

I agree; I do that myself, especially to be able to apply different quants to different parts of the model. It's just q8_0 that's really convenient to have.

Additionally, k-quants present some challenges in that they have a block size requirement (128) that's larger than a classic quant (32) and can fail on some architectures.

Ah, that's unfortunate.

So if you had to put 3 quants, I'd rather they be q4_0, q5_1 and q8_0. I think 3 quant types are the max we should have for this dynamic quant option.

I don't know if it's just my CPU showing its age, but the _1 conversions seem too slow to be used on-demand:

quant loading time
original 20s
q8_1 crashes
q8_0 94s
q6_K 860s
q5_1 2280s
q5_0 817s
q4_1 2017s
q4_0 826s

So maybe it would be better to leave it with just q8_0 and q4_0 for now, and see if other users have a need for an additional option.

@wbruna wbruna force-pushed the koboldcpp_quant_sd branch from 2a4fd75 to c0e4404 Compare August 20, 2025 16:15
@wbruna
Copy link
Author

wbruna commented Aug 20, 2025

Just squashed the help and cleanup changes, and rebased. Should be good to go, unless you prefer adding that extra quant.

@LostRuins
Copy link
Owner

Just q8_0 should be fine. I'll review later

@LostRuins
Copy link
Owner

Okay took a look through, and while I do want the q8_0 quant added, I think we should follow the same approach as the existing --quantkv flag for consistency and backwards compatibility.

For reference, take a look at the current current --quantkv flag.

So we do the same thing, 0=no quant, 1=q8, 2=q4, same as we already have in quantkv.
image
Also for quantkv I used a slider but I agree a dropdown is probably better for sdquant, to save space, since the SD panel is nearly full.

For --quantkv it mandates 1 parameter, but since --sdquant didn't enforce parameters in the past, we could make it optional and default to q4 if used without a parameter to prevent breaking old scripts.

Thus, how about this?

sdparsergrouplora.add_argument("--sdquant",  metavar=('[quantization level 0/1/2]'), help="If specified, loads the model quantized to save memory. 0=off, 1=q8, 2=q4", type=int, choices=[0,1,2], nargs="?", const=2, default=0)

Unspecified = no quant
--sdquant without params = q4
--sdquant [level] = selected level where 0=off,1=q8, 2=q4

Also, we need to handle correctly migrating past kcpps save files. This can be done inside convert_invalid_args, but I think I can handle that part so don't worry about it.

@wbruna
Copy link
Author

wbruna commented Aug 21, 2025

Yeah, the optional parameter seems sensible. I'll give it a try.

I also see a benefit in keeping the parameter different from a quantization name: being able to change its inner workings without changing its purpose - in the same way we don't care, say, how exactly a -3 level is different than -6 on gzip, we just know one is faster, the other saves more space. For instance, a 'light' compression level could mean 'all q8_0 except the VAE at f16' at first, and later be switched to 'all q8_0' under the hood, if we find out that works better.

But since the '1' in current .kcpps files means 'q4' right now, it may be challenging to switch it to mean 'q8' without some other change to differentiate new and old files. I only managed to keep its meaning in the new code by never writing that value back to the new files.

@LostRuins
Copy link
Owner

LostRuins commented Aug 21, 2025

no, True in kcpps means q4.
1 shouldn't exist. The convert_invalid_args should be able to tell the difference.

Didn't test this but it might work

def strict_equal(a, b):
    return type(a) is type(b) and a == b

print(strict_equal(1, True))   # False
print(strict_equal(1, 1))      # True
print(strict_equal(True, True))# True

dont really need a fully function but just an example

wbruna added 3 commits August 21, 2025 19:17
q4_0 may degrade quality significantly, especially for smaller
models like SD 1.5 and SDXL. q8_0 provides a middle-ground,
giving half the memory savings of q4_0 but loading faster and
with less quality loss.
@wbruna wbruna force-pushed the koboldcpp_quant_sd branch from c0e4404 to 9450531 Compare August 21, 2025 22:22
@wbruna
Copy link
Author

wbruna commented Aug 21, 2025

Done; please take a look.

It seemed a bit unhelpful to add a "0=off,1=q8, 2=q4" to the tooltip, so I abused the dropdown strings to add that information. But I'm sure there's a less hackish way to do that...

@LostRuins
Copy link
Owner

image

Alright seems good enough to merge, i'll probably resize that dropdown a bit as its really janky when scrolled horizontally now (see image).

I'll also edit this to be a little more defensive, so it doesnt break completely should there be some unknown value in future.

 sd_quant_types = {0: -1, 1: 8, 2: 2} # enum sd_type_t
    inputs.quant = sd_quant_types[args.sdquant]

@LostRuins LostRuins merged commit 2f8b0ec into LostRuins:concedo_experimental Aug 22, 2025
@LostRuins
Copy link
Owner

minor adjustments for sdquant: allow backend to do the translation for the type more defensively, adjust the UI dropdown for clarity.

80dabbb

@wbruna
Copy link
Author

wbruna commented Aug 22, 2025

Looks good to me. IMHO, it'd be a little less confusing if the values in the interface matched the values on the command line, but no strong feelings about it :-)

@wbruna wbruna deleted the koboldcpp_quant_sd branch August 22, 2025 16:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants