Skip to content

Conversation

@CISC
Copy link
Contributor

@CISC CISC commented Apr 2, 2025

Upstreamed from ikawrakow/ik_llama.cpp#40

@JohannesGaessler
Copy link
Collaborator

Did you ask I. Kawrakow for permission to upstream this code? I'm specifically asking because there previously was conflict over attribution.

@CISC
Copy link
Contributor Author

CISC commented Apr 3, 2025

Did you ask I. Kawrakow for permission to upstream this code? I'm specifically asking because there previously was conflict over attribution.

If so I guess he changed his mind:
ikawrakow/ik_llama.cpp#256 (comment)

@Green-Sky
Copy link
Contributor

Even so, attribution is simple in git. just add another author.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice if the FP16 and BF16 code in ggml_cuda_op_mul_mat were deduplicated but I won't block merging the PR if you don't. In that case please add a corresponding // TODO comment though.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I follow; deduplicated how?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could write a template with a typename that is either half or nv_bfloat16, use it as the type for the memory pool, and conditionally set the parameters for cuBLAS.

@JohannesGaessler
Copy link
Collaborator

I just noticed that the IK implementation dates back to September of 2024. At this point in time the llama.cpp upstream repository had no CUDA BF16 support whatsoever. In January of 2025 I added BF16 support in ggml-org/llama.cpp#11093 . Did you confirm that this PR improves performance vs. the current llama.cpp master branch?

@CISC
Copy link
Contributor Author

CISC commented Apr 3, 2025

I just noticed that the IK implementation dates back to September of 2024. At this point in time the llama.cpp upstream repository had no CUDA BF16 support whatsoever. In January of 2025 I added BF16 support in ggml-org/llama.cpp#11093 . Did you confirm that this PR improves performance vs. the current llama.cpp master branch?

I did not benchmark it, but I can do that tonight.

@CISC
Copy link
Contributor Author

CISC commented Apr 3, 2025

Here's some numbers that speak for themselves (TG is unchanged):

Model CPU GPU n_batch test t/s master t/s cuda-bf16-support
qwen2 1B BF16 Core i7-9700K RTX 3090Ti 128 pp1024 8060.09 ± 11.45 20496.05 ± 21.41
qwen2 1B BF16 Core i7-9700K RTX 3090Ti 256 pp1024 13309.36 ± 4.19 25874.25 ± 30.06
qwen2 1B BF16 Core i7-9700K RTX 3090Ti 512 pp1024 18651.31 ± 9.07 28498.72 ± 74.78
qwen2 1B BF16 Core i7-9700K RTX 3090Ti 1024 pp1024 18848.49 ± 12.21 28934.40 ± 34.94

@CISC CISC requested a review from JohannesGaessler April 4, 2025 14:20
Copy link
Collaborator

@JohannesGaessler JohannesGaessler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the PR would be good to merge as-is. Unless there are more things you still want to add to it.

@CISC
Copy link
Contributor Author

CISC commented Apr 4, 2025

I think the PR would be good to merge as-is. Unless there are more things you still want to add to it.

That's all for now, I will probably upstream some more in other PRs though.

@JohannesGaessler JohannesGaessler merged commit ab9ed73 into ggml-org:master Apr 4, 2025
3 checks passed
@CISC CISC deleted the cuda-bf16-support branch April 4, 2025 19:05
@JohannesGaessler JohannesGaessler changed the title cuda : add bf16 support CUDA: don't convert BF16 weights to FP32 Apr 4, 2025
@JohannesGaessler
Copy link
Collaborator

I changed the title of the PR and the commit message to better reflect what the changes ended up being.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants