Skip to content

Conversation

@JohannesGaessler
Copy link
Collaborator

This PR removes GGML_CUDA_F16 and replaces it with runtime checks, simplifying the code. For the vector dot products FP16 instructions are used if available. For the dequantization FP32 is used unconditionally since it's going to be severely I/O bound anyways. I benchmarked the performance changes with GGML_CUDA_FORCE_CUBLAS and GGML_CUDA_F16 to get the maximum possible impact, even then on the hardware I've tested it's negligible.

I also updated the description of GGML_CUDA_FORCE_CUBLAS to make it clear that it can cause numerical issues.

Performance changes
GPU Model Test t/s master t/s PR Speedup
RTX 3090 llama 8B IQ1_S - 1.5625 bpw pp512 4516.69 4517.07 1.00
RTX 3090 llama 8B IQ2_S - 2.5 bpw pp512 4225.26 4218.92 1.00
RTX 3090 llama 8B IQ2_XS - 2.3125 bpw pp512 4404.14 4385.29 1.00
RTX 3090 llama 8B IQ2_XXS - 2.0625 bpw pp512 4485.57 4445.61 0.99
RTX 3090 llama 8B IQ3_S - 3.4375 bpw pp512 4273.64 4269.88 1.00
RTX 3090 llama 8B IQ3_S mix - 3.66 bpw pp512 4267.31 4266.22 1.00
RTX 3090 llama 8B IQ3_XS - 3.3 bpw pp512 4288.24 4284.21 1.00
RTX 3090 llama 8B IQ3_XXS - 3.0625 bpw pp512 4296.56 4258.87 0.99
RTX 3090 llama 8B IQ4_NL - 4.5 bpw pp512 4375.14 4368.41 1.00
RTX 3090 llama 8B IQ4_XS - 4.25 bpw pp512 4396.06 4416.11 1.00
RTX 3090 llama 8B Q2_K_M pp512 4460.51 4410.20 0.99
RTX 3090 llama 8B Q3_K_S pp512 4310.35 4316.67 1.00
RTX 3090 llama 8B Q4_0 pp512 4552.34 4538.60 1.00
RTX 3090 llama 8B Q4_1 pp512 4527.40 4503.60 0.99
RTX 3090 llama 8B Q4_K_S pp512 4491.89 4475.83 1.00
RTX 3090 llama 8B Q5_0 pp512 4191.83 4157.96 0.99
RTX 3090 llama 8B Q5_1 pp512 4171.84 4134.00 0.99
RTX 3090 llama 8B Q5_K_S pp512 4451.02 4417.23 0.99
RTX 3090 llama 8B Q6_K pp512 4394.49 4390.02 1.00
RTX 3090 llama 8B Q8_0 pp512 4428.68 4425.86 1.00
RTX 4090 llama 8B IQ1_S - 1.5625 bpw pp512 7947.82 7971.10 1.00
RTX 4090 llama 8B IQ2_S - 2.5 bpw pp512 7162.13 7155.12 1.00
RTX 4090 llama 8B IQ2_XS - 2.3125 bpw pp512 7879.92 7930.62 1.01
RTX 4090 llama 8B IQ2_XXS - 2.0625 bpw pp512 7893.50 8016.31 1.02
RTX 4090 llama 8B IQ3_S - 3.4375 bpw pp512 7139.30 7168.06 1.00
RTX 4090 llama 8B IQ3_S mix - 3.66 bpw pp512 7079.53 7179.97 1.01
RTX 4090 llama 8B IQ3_XS - 3.3 bpw pp512 7187.43 7175.11 1.00
RTX 4090 llama 8B IQ3_XXS - 3.0625 bpw pp512 7176.19 7179.57 1.00
RTX 4090 llama 8B IQ4_NL - 4.5 bpw pp512 7602.11 7626.61 1.00
RTX 4090 llama 8B IQ4_XS - 4.25 bpw pp512 7750.87 7753.12 1.00
RTX 4090 llama 8B Q2_K_M pp512 7832.69 7947.18 1.01
RTX 4090 llama 8B Q3_K_S pp512 8048.60 8029.50 1.00
RTX 4090 llama 8B Q4_0 pp512 8041.64 8046.61 1.00
RTX 4090 llama 8B Q4_1 pp512 7941.27 7934.84 1.00
RTX 4090 llama 8B Q4_K_S pp512 7845.53 7952.08 1.01
RTX 4090 llama 8B Q5_0 pp512 7534.07 7554.32 1.00
RTX 4090 llama 8B Q5_1 pp512 7468.80 7502.81 1.00
RTX 4090 llama 8B Q5_K_S pp512 7861.12 7864.96 1.00
RTX 4090 llama 8B Q6_K pp512 7556.17 7565.11 1.00
RTX 4090 llama 8B Q8_0 pp512 7598.39 7606.20 1.00
RX 6800 llama 8B IQ1_S - 1.5625 bpw pp512 534.34 536.70 1.00
RX 6800 llama 8B IQ2_S - 2.5 bpw pp512 532.94 532.46 1.00
RX 6800 llama 8B IQ2_XS - 2.3125 bpw pp512 534.46 534.69 1.00
RX 6800 llama 8B IQ2_XXS - 2.0625 bpw pp512 535.33 536.33 1.00
RX 6800 llama 8B IQ3_S - 3.4375 bpw pp512 533.06 533.16 1.00
RX 6800 llama 8B IQ3_S mix - 3.66 bpw pp512 532.78 531.49 1.00
RX 6800 llama 8B IQ3_XS - 3.3 bpw pp512 531.87 533.05 1.00
RX 6800 llama 8B IQ3_XXS - 3.0625 bpw pp512 533.43 533.02 1.00
RX 6800 llama 8B IQ4_NL - 4.5 bpw pp512 527.24 527.23 1.00
RX 6800 llama 8B IQ4_XS - 4.25 bpw pp512 527.76 527.26 1.00
RX 6800 llama 8B Q2_K_M pp512 524.23 524.16 1.00
RX 6800 llama 8B Q3_K_S pp512 510.42 512.24 1.00
RX 6800 llama 8B Q4_0 pp512 531.68 531.70 1.00
RX 6800 llama 8B Q4_1 pp512 530.83 531.24 1.00
RX 6800 llama 8B Q4_K_S pp512 528.70 530.09 1.00
RX 6800 llama 8B Q5_0 pp512 506.11 505.82 1.00
RX 6800 llama 8B Q5_1 pp512 509.87 511.09 1.00
RX 6800 llama 8B Q5_K_S pp512 518.27 519.38 1.00
RX 6800 llama 8B Q6_K pp512 531.59 531.37 1.00
RX 6800 llama 8B Q8_0 pp512 531.39 527.88 0.99

@github-actions github-actions bot added documentation Improvements or additions to documentation Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels Aug 19, 2025
Copy link
Collaborator

@IMbackK IMbackK left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No performance changes on CDNA either, as expected. Looks good to me from static analysis.

docs/build.md Outdated
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps mention that CDNA and RDNA4+ dose high precision accumulation in the cublas path.

@JohannesGaessler JohannesGaessler merged commit 7a6e91a into ggml-org:master Aug 20, 2025
47 checks passed
qnixsynapse pushed a commit to janhq/llama.cpp that referenced this pull request Aug 22, 2025
Minh141120 pushed a commit to janhq/llama.cpp that referenced this pull request Aug 22, 2025
Nexesenex added a commit to Nexesenex/croco.cpp that referenced this pull request Oct 6, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation ggml changes relating to the ggml tensor library for machine learning Nvidia GPU Issues specific to Nvidia GPUs

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants