Integrate TQ2_0 into Vulkan #33

zoq · 2025-10-09T16:18:42Z

Rebased #22 on temp-latest.

Signed-off-by: Marcus Edel <[email protected]>

olyasir

It adds support for TQ2_0, which uses the range (-1, 0, 1, 2), rather than TQ1_0 (-1, 0, 1).
So if we quantize BitNet using TQ2_0, the value 2 would never actually be used — meaning we’d be using about 25% more memory when storing in this format? why not use TQ1_0, is seems to be better alighted with bitnet quantization

src/CMakeLists.txt

infinitalo · 2025-10-16T14:55:13Z

So if we quantize BitNet using TQ2_0, the value 2 would never actually be used — meaning we’d be using about 25% more memory when storing in this format? why not use TQ1_0, is seems to be better alighted with bitnet quantization

@olyasir that's a good point you bring up regarding TQ2_0 vs TQ1_0. You're right about it being more memory usage, and if you want us to implement it in TQ1_0, we have enough time in the SLM project to do it. I just wanted to clarify the difference between the two types:

Model size:

Both formats use blocks of 256 weights + a 16-bit scale, TQ2_0 is 2.0625 bits per weight and TQ1_0 is 1.6875 bits per weight, which means we're using ~22.2% more space with TQ2_0 than if we were to use TQ1_0.
Not all tensors in a model are TQ2_0, which means in practice this difference will be a bit less than ~22.2%.
The original TQ1_0/TQ2_0 PR in llama.cpp has numbers for the impact of TQ1_0 vs TQ2_0 in ternary model sizes:

Model	F16	TQ1_0	TQ2_0
https://huggingface.co/1bitLLM/bitnet_b1_58-large	1391.26 MiB	176.65 MiB	207.03 MiB
https://huggingface.co/SpectraSuite/TriLM_390M_Unpacked	750.39 MiB	128.04 MiB	140.98 MiB
https://huggingface.co/SpectraSuite/TriLM_1.5B_Unpacked	2892.09 MiB	401.54 MiB	460.04 MiB
https://huggingface.co/SpectraSuite/TriLM_2.4B_Unpacked	4696.86 MiB	603.59 MiB	703.26 MiB
https://huggingface.co/SpectraSuite/TriLM_3.9B_Unpacked	7616.43 MiB	948.16 MiB	1112.70 MiB

Inference speed:

TQ1_0 packs 5 trits (ternary digits) per byte, as opposed to 4 trits per byte for TQ2_0. This means that the trits are not aligned to a power of two when using TQ1_0, and it requires more expensive operations to pack and unpack.
The original TQ1_0/TQ2_0 PR in llama.cpp has numbers for the impact of TQ1_0 vs TQ2_0 in speed:

CPU	F16	Q8_0	Q4_K	Q2_K	TQ1_0	TQ2_0
Intel Core m3-8100Y (AVX2)	30.60 GB/s	67.03 GB/s	64.17 GB/s	81.73 GB/s	70.31 GB/s	141.83 GB/s
Arm Cortex A72 (NEON)	3.84 GB/s	9.51 GB/s	9.26 GB/s	9.79 GB/s	11.81 GB/s	15.78 GB/s
Arm Cortex A53 (NEON)	4.30 GB/s	5.87 GB/s	5.76 GB/s	5.84 GB/s	8.97 GB/s	10.29 GB/s
AWS `t4g` (NEON)	8.69 GB/s	22.35 GB/s	25.34 GB/s	22.84 GB/s	33.34 GB/s	44.80 GB/s
AWS `t4g` (DOTPROD)	49.17 GB/s	42.63 GB/s	45.40 GB/s	29.84 GB/s	40.44 GB/s	65.76 GB/s

Note: These numbers are for CPU because the original PR doesn't implement support for TQ2_0 in any GPU backend.

@olyasir To reiterate, we can implement TQ1_0 support and it shouldn't take too long, but I just wanted to show the trade-offs first so you can make a decision on whether you think it's worth it.

What do you think? Should we implement it?

Signed-off-by: Marcus Edel <[email protected]>

makaveli10 added 3 commits October 9, 2025 12:14

ggml-vulkan: Add TQ2_0 dequantize and mul_mat vec

af6603d

ggml-vulkan: Enable coopmat support for Android

87d471b

ggml-vulkan: Add mul_mm path for TQ2_0

9a7ba54

github-actions bot added Vulkan ggml testing labels Oct 9, 2025

zoq added 5 commits October 9, 2025 14:22

Use the correct subgroup size for TQ2_0.

aafd00f

Signed-off-by: Marcus Edel <[email protected]>

Add Vulkan TQ2_0 shader.

911d0d9

Signed-off-by: Marcus Edel <[email protected]>

SET_ROWS and GET_ROWS has no TQ2_0 support yet.

5f19b2a

Signed-off-by: Marcus Edel <[email protected]>

Use the vector/matrix shader for larger matrix/vector computations.

6651f60

Signed-off-by: Marcus Edel <[email protected]>

Link against "lc++" on Android, for exception handling symbols.

01fe180

Signed-off-by: Marcus Edel <[email protected]>

github-actions bot added the examples label Oct 10, 2025

zoq and others added 2 commits October 13, 2025 19:41

Linking with c++_shared for Android/Termux compatibility.

7b0b9af

Signed-off-by: Marcus Edel <[email protected]>

Test TQ2_0 dequant + pipelines

9c941a8

olyasir reviewed Oct 16, 2025

View reviewed changes

src/CMakeLists.txt Outdated Show resolved Hide resolved

zoq changed the base branch from temp-latest to temp-latest-finetuning October 16, 2025 20:06

zoq changed the base branch from temp-latest-finetuning to temp-latest October 16, 2025 20:25

zoq added 3 commits October 16, 2025 16:27

Make sure the output model can start with a number.

787fcba

Signed-off-by: Marcus Edel <[email protected]>

Linking against c++_shared is done automatically.

f09743f

Signed-off-by: Marcus Edel <[email protected]>

Add support for microsoft/bitnet-b1.58-2B-4T (HF to GGUF).

cb8128c

Signed-off-by: Marcus Edel <[email protected]>

github-actions bot added the python label Oct 16, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Integrate TQ2_0 into Vulkan #33

Integrate TQ2_0 into Vulkan #33

Uh oh!

zoq commented Oct 9, 2025

Uh oh!

olyasir left a comment

Uh oh!

Uh oh!

infinitalo commented Oct 16, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Integrate TQ2_0 into Vulkan #33

Are you sure you want to change the base?

Integrate TQ2_0 into Vulkan #33

Uh oh!

Conversation

zoq commented Oct 9, 2025

Uh oh!

olyasir left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

infinitalo commented Oct 16, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants