-
Notifications
You must be signed in to change notification settings - Fork 1.4k
CUDA: don't convert BF16 weights to FP32 #1174
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Did you ask I. Kawrakow for permission to upstream this code? I'm specifically asking because there previously was conflict over attribution. |
If so I guess he changed his mind: |
|
Even so, attribution is simple in git. just add another author. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be nice if the FP16 and BF16 code in ggml_cuda_op_mul_mat were deduplicated but I won't block merging the PR if you don't. In that case please add a corresponding // TODO comment though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I follow; deduplicated how?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could write a template with a typename that is either half or nv_bfloat16, use it as the type for the memory pool, and conditionally set the parameters for cuBLAS.
|
I just noticed that the IK implementation dates back to September of 2024. At this point in time the llama.cpp upstream repository had no CUDA BF16 support whatsoever. In January of 2025 I added BF16 support in ggml-org/llama.cpp#11093 . Did you confirm that this PR improves performance vs. the current llama.cpp master branch? |
I did not benchmark it, but I can do that tonight. |
|
Here's some numbers that speak for themselves (TG is unchanged):
|
JohannesGaessler
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the PR would be good to merge as-is. Unless there are more things you still want to add to it.
That's all for now, I will probably upstream some more in other PRs though. |
|
I changed the title of the PR and the commit message to better reflect what the changes ended up being. |
Upstreamed from ikawrakow/ik_llama.cpp#40