Skip to content

Conversation

@ikawrakow
Copy link
Owner

@ikawrakow ikawrakow commented Oct 24, 2024

  • Change IQ1_BN and IQ2_BN to have per row scales. In that way we can handle Bitnet models with and without separate tensor scales
  • Remove IQ1_TN and IQ2_TN. With the above change these are now redundant. IQ1_BN and IQ2_BN are also faster, so no reason to keep these around
  • Change build_bitnet() to use the standard llm_build_kv() function for the self attention portion. I was hoping this would also allow to use FA, but nope, the Bitnet models have a strange head size of 100 that is not supported by the FA implementations.

Everything works except - can you guess? - Metal. There is something wrong with the dot product kernels and I simply don't see what. I have to fix Metal before merging.

On CUDA (RTX-4080) we now get 368 t/s for TG-128 with the 3.3B Bitnet model (IQ2_BN). When I first added Bitnet support we were at ~320 t/s, so quite an improvement since then.

Update

I wasted quite some time trying to figure out why the Bitnet changes don't work on Metal. At the end it turned out that it is PR #98 that breaks the Metal back-end. So, this PR reverts #98.

@agray3 Do you have the ability to investigate why #98 breaks the Metal back-end?

Iwan Kawrakow added 12 commits October 24, 2024 12:48
Why? It is becoming burdensome to maintain the special Bitnet
conversion in convert_hf_to_gguf.py, so I thnk it is better
to make iq1_bn and iq2_bn just work with the mainline
conversion script (which does not generate scales).
Dequantize works, but there is still something wrong
with the dot products.
WIP
Absoolutely don't see what is wrong with the iq1_bn and iq2_bn
vector dot product kernels.
Now that iq1_bn and iq2_bn have per row scales, there is no
reason to also have iq1_tn and iq2_tn.
My main motivation was to enable FA. But FA does not work anyway
because head size is 100 for the Botnet ternary models
(and I had forgotten this little detail).
This reverts commit f2d315b.
As far as I can tell, the commit breaks Metal TG.
@ikawrakow ikawrakow merged commit 6b968f3 into main Oct 25, 2024
Nexesenex pushed a commit to Nexesenex/ik_llama.cpp.nxs that referenced this pull request Nov 11, 2024
* Adapting iq2_bn to work without separate scale tensors

Why? It is becoming burdensome to maintain the special Bitnet
conversion in convert_hf_to_gguf.py, so I thnk it is better
to make iq1_bn and iq2_bn just work with the mainline
conversion script (which does not generate scales).

* Adapting iq1_bn to work without separate scale tensors

* Adapting iq2_bn: CUDA dequantize

* Adapting iq2_bn: CUDA works

* Adapting iq1_bn: CUDA works

* Adapting iq1_bn, iq2_bn: NEON

* Adapting iq1_bn, iq2_bn: Metal

Dequantize works, but there is still something wrong
with the dot products.

* WIP

Absoolutely don't see what is wrong with the iq1_bn and iq2_bn
vector dot product kernels.

* Remove iq1_tn and iq2_tn - Part 1

Now that iq1_bn and iq2_bn have per row scales, there is no
reason to also have iq1_tn and iq2_tn.

* Remove iq1_tn and iq2_tn - Part 2

* Bitnet: use the standard llm_build_kv to build self attention

My main motivation was to enable FA. But FA does not work anyway
because head size is 100 for the Botnet ternary models
(and I had forgotten this little detail).

* Revert "Avoid rebuild of GGML graph for each token (ikawrakow#98)"

This reverts commit f2d315b.
As far as I can tell, the commit breaks Metal TG.

---------

Co-authored-by: Iwan Kawrakow <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant