Skip to content

Conversation

zoq
Copy link

@zoq zoq commented Oct 9, 2025

Rebased #22 on temp-latest.

Copy link

@olyasir olyasir left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It adds support for TQ2_0, which uses the range (-1, 0, 1, 2), rather than TQ1_0 (-1, 0, 1).
So if we quantize BitNet using TQ2_0, the value 2 would never actually be used — meaning we’d be using about 25% more memory when storing in this format? why not use TQ1_0, is seems to be better alighted with bitnet quantization

@infinitalo
Copy link

So if we quantize BitNet using TQ2_0, the value 2 would never actually be used — meaning we’d be using about 25% more memory when storing in this format? why not use TQ1_0, is seems to be better alighted with bitnet quantization

@olyasir that's a good point you bring up regarding TQ2_0 vs TQ1_0. You're right about it being more memory usage, and if you want us to implement it in TQ1_0, we have enough time in the SLM project to do it. I just wanted to clarify the difference between the two types:

Model size:

  • Both formats use blocks of 256 weights + a 16-bit scale, TQ2_0 is 2.0625 bits per weight and TQ1_0 is 1.6875 bits per weight, which means we're using ~22.2% more space with TQ2_0 than if we were to use TQ1_0.
  • Not all tensors in a model are TQ2_0, which means in practice this difference will be a bit less than ~22.2%.
  • The original TQ1_0/TQ2_0 PR in llama.cpp has numbers for the impact of TQ1_0 vs TQ2_0 in ternary model sizes:
Model F16 TQ1_0 TQ2_0
https://huggingface.co/1bitLLM/bitnet_b1_58-large 1391.26 MiB 176.65 MiB 207.03 MiB
https://huggingface.co/SpectraSuite/TriLM_390M_Unpacked 750.39 MiB 128.04 MiB 140.98 MiB
https://huggingface.co/SpectraSuite/TriLM_1.5B_Unpacked 2892.09 MiB 401.54 MiB 460.04 MiB
https://huggingface.co/SpectraSuite/TriLM_2.4B_Unpacked 4696.86 MiB 603.59 MiB 703.26 MiB
https://huggingface.co/SpectraSuite/TriLM_3.9B_Unpacked 7616.43 MiB 948.16 MiB 1112.70 MiB

Inference speed:

  • TQ1_0 packs 5 trits (ternary digits) per byte, as opposed to 4 trits per byte for TQ2_0. This means that the trits are not aligned to a power of two when using TQ1_0, and it requires more expensive operations to pack and unpack.
  • The original TQ1_0/TQ2_0 PR in llama.cpp has numbers for the impact of TQ1_0 vs TQ2_0 in speed:
CPU F16 Q8_0 Q4_K Q2_K TQ1_0 TQ2_0
Intel Core m3-8100Y (AVX2) 30.60 GB/s 67.03 GB/s 64.17 GB/s 81.73 GB/s 70.31 GB/s 141.83 GB/s
Arm Cortex A72 (NEON) 3.84 GB/s 9.51 GB/s 9.26 GB/s 9.79 GB/s 11.81 GB/s 15.78 GB/s
Arm Cortex A53 (NEON) 4.30 GB/s 5.87 GB/s 5.76 GB/s 5.84 GB/s 8.97 GB/s 10.29 GB/s
AWS t4g (NEON) 8.69 GB/s 22.35 GB/s 25.34 GB/s 22.84 GB/s 33.34 GB/s 44.80 GB/s
AWS t4g (DOTPROD) 49.17 GB/s 42.63 GB/s 45.40 GB/s 29.84 GB/s 40.44 GB/s 65.76 GB/s

Note: These numbers are for CPU because the original PR doesn't implement support for TQ2_0 in any GPU backend.

@olyasir To reiterate, we can implement TQ1_0 support and it shouldn't take too long, but I just wanted to show the trade-offs first so you can make a decision on whether you think it's worth it.

What do you think? Should we implement it?

@zoq zoq changed the base branch from temp-latest to temp-latest-finetuning October 16, 2025 20:06
@zoq zoq changed the base branch from temp-latest-finetuning to temp-latest October 16, 2025 20:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants