-
Notifications
You must be signed in to change notification settings - Fork 13.3k
vulkan: Add ACC_TYPE_VEC2 implementation #16203
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Here are performance results from my devices. It's very good for Nvidia Ampere (which won't be using the code in practice due to coopmat), but neutral or negative on AMD. Not sure why this is. RTX 3090 without coopmat or integer dot
AMD Radeon Pro VII without integer dot
Intel A770 without integer dot
|
So it should be: // Old
const uint sums_idx = (cr * WNITER + wsic) * (WMITER * TN) + cc * TN + wsir;
// New
const uint sums_idx = (cr * WNITER + wsic) * (WMITER * TN) + cc * WMITER + wsir; But for some reason these two are giving whole different perfomance. |
No, that doesn't fix the issue for me. Can you reproduce it on your end? All the |
For me everything is passing, including |
Now the tests are passing, but I'm seeing a major regression with mul_mat_id on Nvidia. Do you also see that? Intel is also not looking good, but I am not sure why it is affecting only the FA run. RTX 3090 without coopmat or integer dot
AMD Radeon Pro VII without integer dot
Intel A770 without integer dot
|
I'm still waiting to hear whether you also see the Nvidia regression. |
Performance Comparison (Without
|
Kernel | Before(us/run) | After(us/run) | Δ % |
---|---|---|---|
MUL_MAT(type_a=f32,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1) |
5792.89 | 5763.41 | +0.51% |
MUL_MAT(type_a=f16,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1) |
5381.34 | 5394.30 | -0.24% |
MUL_MAT(type_a=bf16,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1) |
5257.40 | 5242.01 | +0.29% |
MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1) |
2723.99 | 2730.32 | -0.23% |
MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1) |
2753.02 | 2778.81 | -0.94% |
MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1) |
2855.15 | 2856.38 | -0.04% |
MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1) |
2843.80 | 2844.25 | -0.02% |
MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1) |
2867.90 | 2863.70 | +0.15% |
MUL_MAT(type_a=mxfp4,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1) |
4949.09 | 4326.33 | +12.58% |
MUL_MAT(type_a=q2_K,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1) |
4870.27 | 4313.31 | +11.44% |
MUL_MAT(type_a=q3_K,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1) |
5247.09 | 4759.04 | +9.30% |
MUL_MAT(type_a=q4_K,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1) |
5375.22 | 4748.48 | +11.66% |
MUL_MAT(type_a=q5_K,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1) |
5743.62 | 5168.56 | +10.01% |
MUL_MAT(type_a=q6_K,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1) |
5503.70 | 4901.42 | +10.94% |
MUL_MAT(type_a=iq2_xxs,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1) |
4988.72 | 4357.20 | +12.66% |
MUL_MAT(type_a=iq2_xs,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1) |
5158.45 | 4028.82 | +21.90% |
MUL_MAT(type_a=iq2_s,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1) |
4958.32 | 4393.31 | +11.40% |
MUL_MAT(type_a=iq3_xxs,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1) |
5027.39 | 4315.70 | +14.16% |
MUL_MAT(type_a=iq1_s,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1) |
6110.26 | 5009.41 | +18.02% |
MUL_MAT(type_a=iq1_m,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1) |
5718.64 | 5030.39 | +12.04% |
MUL_MAT(type_a=iq4_nl,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1) |
4870.25 | 4238.55 | +12.97% |
MUL_MAT(type_a=iq3_s,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1) |
4950.08 | 4366.56 | +11.79% |
MUL_MAT(type_a=iq4_xs,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1) |
5310.09 | 4723.28 | +11.05% |
Performance Comparison (Without coopmat
and coopmat2
) AMD Radeon RX 7800 XT
Kernel | Before(us/run) | After(us/run) | Δ % |
---|---|---|---|
MUL_MAT(type_a=f32,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1) |
8877.93 | 8830.00 | +0.54% |
MUL_MAT(type_a=f16,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1) |
6353.04 | 6526.51 | -2.73% |
MUL_MAT(type_a=bf16,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1) |
7147.87 | 7219.50 | -1.00% |
MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1) |
3327.31 | 3350.99 | -0.71% |
MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1) |
3540.99 | 3556.59 | -0.44% |
MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1) |
3452.07 | 3483.37 | -0.91% |
MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1) |
3738.56 | 3779.61 | -1.10% |
MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1) |
3771.78 | 3799.56 | -0.74% |
MUL_MAT(type_a=mxfp4,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1) |
5477.72 | 5822.40 | -6.29% |
MUL_MAT(type_a=q2_K,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1) |
5977.67 | 6078.14 | -1.68% |
MUL_MAT(type_a=q3_K,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1) |
7511.03 | 7829.84 | -4.24% |
MUL_MAT(type_a=q4_K,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1) |
8022.06 | 8015.15 | +0.09% |
MUL_MAT(type_a=q5_K,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1) |
7063.25 | 7409.50 | -4.90% |
MUL_MAT(type_a=q6_K,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1) |
7477.86 | 7793.75 | -4.22% |
MUL_MAT(type_a=iq2_xxs,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1) |
5475.90 | 5892.24 | -7.60% |
MUL_MAT(type_a=iq2_xs,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1) |
5515.12 | 5853.49 | -6.14% |
MUL_MAT(type_a=iq2_s,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1) |
5620.81 | 5733.57 | -2.01% |
MUL_MAT(type_a=iq3_xxs,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1) |
5732.37 | 6135.72 | -7.04% |
MUL_MAT(type_a=iq1_s,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1) |
5404.80 | 5637.84 | -4.31% |
MUL_MAT(type_a=iq1_m,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1) |
5527.86 | 5741.80 | -3.87% |
MUL_MAT(type_a=iq4_nl,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1) |
5360.15 | 6463.96 | -20.59% |
MUL_MAT(type_a=iq3_s,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1) |
5796.81 | 6032.64 | -4.07% |
MUL_MAT(type_a=iq4_xs,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1) |
8312.47 | 7739.69 | +6.89% |
Signed-off-by: Stefan Savic <[email protected]>
76ad361
to
5d009b8
Compare
Yeah, that fixed the regression, nice job! I'm wondering why it doesn't make a positive difference for AMD or Intel. Intel especially doesn't like it that much, but I don't think that's a big enough problem to stop this PR, especially since Intel doesn't work that well anyways without integer dot. Thank you for the contribution and patience! RTX 3090 without coopmat or integer dot
AMD Radeon Pro VII without integer dot
Intel A770 without integer dot
|
* cuda : remove legacy copy-op pointer indirection code (ggml-org#16485) * remove legacy copy-op pointer indirection code * further removal of copy-op indirection code * renamed check_node_graph_compatibility_and_refresh_copy_ops function * CUDA: add fp kernel for larger batch size MoE (ggml-org#16512) * CUDA: kernel for larger batch sizes for MoE * WIP * WIP * WIP * WIP * WIP * WIP * fixup * tests * Move mmq_ids_helper to mmid * cleanup * Remove redundant checks * CUDA: use fastdiv + ggml_cuda_mad for mmvf (ggml-org#16557) * CUDA: use fastdiv + ggml_cuda_mad for mmvf * use bf16 directly + fix formatting * Add exception for HIP code * CUDA: enable FA for FP32 KV cache (ggml-org#16546) * vulkan: Improve build time for MSVC (ggml-org#16545) Enable CMP0147 so custom build steps (invoking vulkan-shader-gen) are run in parallel. Enable /MP so source files are compiled in parallel. * vulkan: Support FA with K/V in F32 (ggml-org#16543) * CUDA + openCL: fix bug in accessing rms_norm->src while doing fusion (ggml-org#16577) * vulkan: Add ACC_TYPE_VEC2 implementation (ggml-org#16203) Signed-off-by: Stefan Savic <[email protected]> Co-authored-by: Stefan Savic <[email protected]> * metal : avoid using Metal's gpuAddress property (ggml-org#16576) * metal : avoid using Metal's gpuAddress property * metal : fix rope kernels buffer check --------- Signed-off-by: Stefan Savic <[email protected]> Co-authored-by: Anav Prasad <[email protected]> Co-authored-by: Aman Gupta <[email protected]> Co-authored-by: Johannes Gäßler <[email protected]> Co-authored-by: Jeff Bolz <[email protected]> Co-authored-by: SavicStefan <[email protected]> Co-authored-by: Stefan Savic <[email protected]> Co-authored-by: Georgi Gerganov <[email protected]>
This PR adds the implementation for
ACC_TYPE_VEC2
. This change, with non-coopmat
shaders, usingACC_TYPE_VEC2
improves caching behavior, as accessing 32-bit values is generally more efficient than accessing 16-bit values.Performance Comparison (Without
coopmat
andcoopmat2
) NVIDIA GeForce RTX 4060 TiPerformance before(Without
coopmat
andcoopmat2
) NVIDIA GeForce RTX 4060 TiPerformance after(Without
coopmat
andcoopmat2
) NVIDIA GeForce RTX 4060 Ti