CUDA: replace GGML_CUDA_F16 with CUDA arch checks #15433

JohannesGaessler · 2025-08-19T20:51:10Z

This PR removes GGML_CUDA_F16 and replaces it with runtime checks, simplifying the code. For the vector dot products FP16 instructions are used if available. For the dequantization FP32 is used unconditionally since it's going to be severely I/O bound anyways. I benchmarked the performance changes with GGML_CUDA_FORCE_CUBLAS and GGML_CUDA_F16 to get the maximum possible impact, even then on the hardware I've tested it's negligible.

I also updated the description of GGML_CUDA_FORCE_CUBLAS to make it clear that it can cause numerical issues.

Performance changes

GPU	Model	Test	t/s master	t/s PR	Speedup
RTX 3090	llama 8B IQ1_S - 1.5625 bpw	pp512	4516.69	4517.07	1.00
RTX 3090	llama 8B IQ2_S - 2.5 bpw	pp512	4225.26	4218.92	1.00
RTX 3090	llama 8B IQ2_XS - 2.3125 bpw	pp512	4404.14	4385.29	1.00
RTX 3090	llama 8B IQ2_XXS - 2.0625 bpw	pp512	4485.57	4445.61	0.99
RTX 3090	llama 8B IQ3_S - 3.4375 bpw	pp512	4273.64	4269.88	1.00
RTX 3090	llama 8B IQ3_S mix - 3.66 bpw	pp512	4267.31	4266.22	1.00
RTX 3090	llama 8B IQ3_XS - 3.3 bpw	pp512	4288.24	4284.21	1.00
RTX 3090	llama 8B IQ3_XXS - 3.0625 bpw	pp512	4296.56	4258.87	0.99
RTX 3090	llama 8B IQ4_NL - 4.5 bpw	pp512	4375.14	4368.41	1.00
RTX 3090	llama 8B IQ4_XS - 4.25 bpw	pp512	4396.06	4416.11	1.00
RTX 3090	llama 8B Q2_K_M	pp512	4460.51	4410.20	0.99
RTX 3090	llama 8B Q3_K_S	pp512	4310.35	4316.67	1.00
RTX 3090	llama 8B Q4_0	pp512	4552.34	4538.60	1.00
RTX 3090	llama 8B Q4_1	pp512	4527.40	4503.60	0.99
RTX 3090	llama 8B Q4_K_S	pp512	4491.89	4475.83	1.00
RTX 3090	llama 8B Q5_0	pp512	4191.83	4157.96	0.99
RTX 3090	llama 8B Q5_1	pp512	4171.84	4134.00	0.99
RTX 3090	llama 8B Q5_K_S	pp512	4451.02	4417.23	0.99
RTX 3090	llama 8B Q6_K	pp512	4394.49	4390.02	1.00
RTX 3090	llama 8B Q8_0	pp512	4428.68	4425.86	1.00
RTX 4090	llama 8B IQ1_S - 1.5625 bpw	pp512	7947.82	7971.10	1.00
RTX 4090	llama 8B IQ2_S - 2.5 bpw	pp512	7162.13	7155.12	1.00
RTX 4090	llama 8B IQ2_XS - 2.3125 bpw	pp512	7879.92	7930.62	1.01
RTX 4090	llama 8B IQ2_XXS - 2.0625 bpw	pp512	7893.50	8016.31	1.02
RTX 4090	llama 8B IQ3_S - 3.4375 bpw	pp512	7139.30	7168.06	1.00
RTX 4090	llama 8B IQ3_S mix - 3.66 bpw	pp512	7079.53	7179.97	1.01
RTX 4090	llama 8B IQ3_XS - 3.3 bpw	pp512	7187.43	7175.11	1.00
RTX 4090	llama 8B IQ3_XXS - 3.0625 bpw	pp512	7176.19	7179.57	1.00
RTX 4090	llama 8B IQ4_NL - 4.5 bpw	pp512	7602.11	7626.61	1.00
RTX 4090	llama 8B IQ4_XS - 4.25 bpw	pp512	7750.87	7753.12	1.00
RTX 4090	llama 8B Q2_K_M	pp512	7832.69	7947.18	1.01
RTX 4090	llama 8B Q3_K_S	pp512	8048.60	8029.50	1.00
RTX 4090	llama 8B Q4_0	pp512	8041.64	8046.61	1.00
RTX 4090	llama 8B Q4_1	pp512	7941.27	7934.84	1.00
RTX 4090	llama 8B Q4_K_S	pp512	7845.53	7952.08	1.01
RTX 4090	llama 8B Q5_0	pp512	7534.07	7554.32	1.00
RTX 4090	llama 8B Q5_1	pp512	7468.80	7502.81	1.00
RTX 4090	llama 8B Q5_K_S	pp512	7861.12	7864.96	1.00
RTX 4090	llama 8B Q6_K	pp512	7556.17	7565.11	1.00
RTX 4090	llama 8B Q8_0	pp512	7598.39	7606.20	1.00
RX 6800	llama 8B IQ1_S - 1.5625 bpw	pp512	534.34	536.70	1.00
RX 6800	llama 8B IQ2_S - 2.5 bpw	pp512	532.94	532.46	1.00
RX 6800	llama 8B IQ2_XS - 2.3125 bpw	pp512	534.46	534.69	1.00
RX 6800	llama 8B IQ2_XXS - 2.0625 bpw	pp512	535.33	536.33	1.00
RX 6800	llama 8B IQ3_S - 3.4375 bpw	pp512	533.06	533.16	1.00
RX 6800	llama 8B IQ3_S mix - 3.66 bpw	pp512	532.78	531.49	1.00
RX 6800	llama 8B IQ3_XS - 3.3 bpw	pp512	531.87	533.05	1.00
RX 6800	llama 8B IQ3_XXS - 3.0625 bpw	pp512	533.43	533.02	1.00
RX 6800	llama 8B IQ4_NL - 4.5 bpw	pp512	527.24	527.23	1.00
RX 6800	llama 8B IQ4_XS - 4.25 bpw	pp512	527.76	527.26	1.00
RX 6800	llama 8B Q2_K_M	pp512	524.23	524.16	1.00
RX 6800	llama 8B Q3_K_S	pp512	510.42	512.24	1.00
RX 6800	llama 8B Q4_0	pp512	531.68	531.70	1.00
RX 6800	llama 8B Q4_1	pp512	530.83	531.24	1.00
RX 6800	llama 8B Q4_K_S	pp512	528.70	530.09	1.00
RX 6800	llama 8B Q5_0	pp512	506.11	505.82	1.00
RX 6800	llama 8B Q5_1	pp512	509.87	511.09	1.00
RX 6800	llama 8B Q5_K_S	pp512	518.27	519.38	1.00
RX 6800	llama 8B Q6_K	pp512	531.59	531.37	1.00
RX 6800	llama 8B Q8_0	pp512	531.39	527.88	0.99

IMbackK

No performance changes on CDNA either, as expected. Looks good to me from static analysis.

IMbackK · 2025-08-20T11:31:38Z

docs/build.md

Perhaps mention that CDNA and RDNA4+ dose high precision accumulation in the cublas path.

…5433)"

JohannesGaessler requested a review from IMbackK August 19, 2025 20:51

github-actions bot added documentation Improvements or additions to documentation Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels Aug 19, 2025

IMbackK approved these changes Aug 20, 2025

View reviewed changes

docs/build.md Outdated

Copy link

Collaborator

IMbackK Aug 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps mention that CDNA and RDNA4+ dose high precision accumulation in the cublas path.

CUDA: replace GGML_CUDA_F16 with CUDA arch checks

6b2e61c

JohannesGaessler force-pushed the cuda-remove-f16 branch from 0234f3d to 6b2e61c Compare August 20, 2025 13:39

JohannesGaessler merged commit 7a6e91a into ggml-org:master Aug 20, 2025
47 checks passed

ggerganov mentioned this pull request Aug 21, 2025

Misc. bug: Long-prompt decode crash with MoE #15481

Closed

qnixsynapse pushed a commit to janhq/llama.cpp that referenced this pull request Aug 22, 2025

CUDA: replace GGML_CUDA_F16 with CUDA arch checks (ggml-org#15433)

2b76bf5

Minh141120 pushed a commit to janhq/llama.cpp that referenced this pull request Aug 22, 2025

CUDA: replace GGML_CUDA_F16 with CUDA arch checks (ggml-org#15433)

75acdb2

Nexesenex added a commit to Nexesenex/croco.cpp that referenced this pull request Oct 6, 2025

Revert "CUDA: replace GGML_CUDA_F16 with CUDA arch checks (ggml-org#1…

e6bfc3f

…5433)"

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

CUDA: replace GGML_CUDA_F16 with CUDA arch checks #15433

CUDA: replace GGML_CUDA_F16 with CUDA arch checks #15433

Uh oh!

JohannesGaessler commented Aug 19, 2025

Uh oh!

IMbackK left a comment

Uh oh!

IMbackK Aug 20, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

CUDA: replace GGML_CUDA_F16 with CUDA arch checks #15433

CUDA: replace GGML_CUDA_F16 with CUDA arch checks #15433

Uh oh!

Conversation

JohannesGaessler commented Aug 19, 2025

Uh oh!

IMbackK left a comment

Choose a reason for hiding this comment

Uh oh!

IMbackK Aug 20, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants