CUDA: Conv2d tensor core #16828

mnehete32 · 2025-10-28T20:25:30Z

Added Tensor Core to the code from #16088, have made modification such that it was giving best result on tensor cores. Below result are on RTX 2070 gpu.

FP16 Tensor Core perf

  CONV_2D(ne_input=[19,19,256,16],ne_kernel=[4,4,256,4096],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                     55 runs - 18401.09 us/run - 137.42 GFLOP/run -   7.47 TFLOPS
  CONV_2D(ne_input=[19,19,8,16],ne_kernel=[4,4,8,128],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):               28424 runs -    35.24 us/run - 133.69 MFLOP/run -   3.79 TFLOPS
  CONV_2D(ne_input=[19,19,8,16],ne_kernel=[4,4,8,130],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):               19899 runs -    50.62 us/run - 135.78 MFLOP/run -   2.68 TFLOPS
  CONV_2D(ne_input=[19,19,4,16],ne_kernel=[2,2,4,4],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                122880 runs -     8.58 us/run - 642.82 kFLOP/run -  74.95 GFLOPS
  CONV_2D(ne_input=[224,224,3,1],ne_kernel=[3,3,3,8],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                38288 runs -    28.19 us/run -  20.90 MFLOP/run - 741.40 GFLOPS
  CONV_2D(ne_input=[224,224,1,1],ne_kernel=[2,2,1,8],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                57344 runs -    18.43 us/run -   2.78 MFLOP/run - 151.07 GFLOPS
  CONV_2D(ne_input=[224,224,1,8],ne_kernel=[2,2,1,8],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                 8978 runs -   134.73 us/run -  22.28 MFLOP/run - 165.35 GFLOPS
  CONV_2D(ne_input=[58,58,32,1],ne_kernel=[3,3,32,64],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):               28611 runs -    34.96 us/run - 115.40 MFLOP/run -   3.30 TFLOPS
  CONV_2D(ne_input=[58,58,32,8],ne_kernel=[3,3,32,64],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                4251 runs -   235.69 us/run - 923.24 MFLOP/run -   3.92 TFLOPS
  CONV_2D(ne_input=[16,16,128,8],ne_kernel=[3,3,128,512],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                     3465 runs -   293.17 us/run -   1.85 GFLOP/run -   6.31 TFLOPS

@etasnadi @Green-Sky @JohannesGaessler

* removed flash-attenion definition

…conv2d_tensor_core

CUDA: uint to int and added assertion

* Extra: reduces bank conflicts

…conv2d_tensor_core

…ensor_core

mnehete32 · 2025-10-28T22:01:20Z

Keeping this as a draft until the implicit or Vulkan changes are merged. I’ll integrate the tensor core kernel with that code.

mnehete32 · 2025-11-01T04:37:30Z

Hey @Green-Sky, could we also get a sd.cpp perf analysis for this draft?

I’ve exposed the tensor core kernel through conv2d_direct.

Green-Sky · 2025-11-01T12:14:43Z

Ran a bench on this pr and added it here #15805 (comment) .

Looks like this is now the fastest version !

VAE decoding is also slightly faster than im2col+matmul (maybe, might be within error).

sd1 fp16 512x768

method	time sample	memory sampling	time decoding	memory decoding
CUDA imcol+mul	0.21s	189.38 MB	0.75s	2496.09 MB
CUDA direct (master)	2.96s	132.71 MB	16.79s	1056.09 MB
CUDA direct (bssrdf pr `c1f67c1`)	0.37s	132.71 MB	1.00s	1056.09 MB
CUDA direct (mnehete32_tensor pr `e3f94c6`)	0.30s	132.71 MB	0.74s	1056.09 MB

sd1 fp16 768x1024 (like the old table)

method	time sample	memory sampling	time decoding	memory decoding
CUDA imcol+mul	0.58s	373.64 MB	1.55s	4992.19 MB
CUDA direct (master)	6.29s	260.30 MB	34.94s	2112.19 MB
CUDA direct (bssrdf pr `c1f67c1`)	0.85s	260.30 MB	2.03s	2112.19 MB
CUDA direct (mnehete32_tensor pr `e3f94c6`)	0.73s	260.30 MB	1.52s	2112.19 MB

sdxl fp16/q8_0 1024x1280

Diffusion model is q8_0 and vae is fp16.

method	time sample	memory sampling	time decoding	memory decoding
CUDA imcol+mul	0.79s	614.83 MB	OOM	9600.31 MiB (alloc error)
CUDA direct (master)	11.05s	288.43 MB	60.57s	4800.31 MB
CUDA direct (bssrdf pr `c1f67c1`)	1.15s	288.43 MB	3.72s	4800.31 MB
CUDA direct (mnehete32_tensor pr `e3f94c6`)	1.00s	288.43 MB	2.92s	4800.31 MB

Green-Sky · 2025-11-01T14:31:38Z

ggml/src/ggml-cuda/conv2d-tensor-core.cu

+__constant__ __device__ Params P;
+
+// see init_fastdiv_values in ggml-vulkan.cpp
+__inline__ __device__ uint fastdiv(uint n, uint mp, uint L) {


Already exists in common.

llama.cpp/ggml/src/ggml-cuda/common.cuh

Line 653 in 1ae7488

static __device__ __forceinline__ uint32_t fastdiv(uint32_t n, const uint3 fastdiv_values) {

Green-Sky · 2025-11-01T14:32:20Z

ggml/src/ggml-cuda/conv2d-tensor-core.cu

+
+#define CEIL_DIV(M, N) (((M) + (N) - 1) / (N))
+
+static uint32_t ceil_div(uint32_t M, uint32_t N);


llama.cpp/ggml/src/ggml-sycl/common.hpp

Line 532 in 1ae7488

constexpr size_t ceil_div(const size_t m, const size_t n) {

Green-Sky · 2025-11-01T14:32:37Z

ggml/src/ggml-cuda/conv2d-tensor-core.cu

+#include "convert.cuh"
+#include "mma.cuh"
+
+#define CEIL_DIV(M, N) (((M) + (N) - 1) / (N))


Remove makro, and use function instead.

mnehete32 and others added 13 commits September 5, 2025 11:32

CUDA: cov2d with tensor core

19596b1

CUDA: conv2d added comment

96db627

CUDA: conv2d support fp16 without wmma

2cd9fb0

* removed flash-attenion definition

CUDA: conv2d using mma.cuh

d633cee

CUDA: conv2d convert int64_t to int

ac5e0c0

CUDA: conv2d update block size

410171a

Merge branch 'master' of https://github.com/mnehete32/llama.cpp into …

4ae58ad

…conv2d_tensor_core

CUDA: conv2d performance optimization

51f85ff

CUDA: conv2d minor fixes

6049576

CUDA: uint to int and added assertion

Adds CUDA version of Vulkan direct conv2d.

cc3d366

* Extra: reduces bank conflicts

Merge branch 'master' of https://github.com/mnehete32/llama.cpp into …

c7259fa

…conv2d_tensor_core

Merge remote-tracking branch 'etasnadi/conv2d-cuda-opt' into conv2d_t…

1809814

…ensor_core

adding vulkan code like tensor code conv2d

e3f94c6

mnehete32 mentioned this pull request Oct 28, 2025

ggml-cuda: Vulkan direct conv 2D ported to CUDA #16088

Open

github-actions bot added Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels Oct 28, 2025

DajanaV mentioned this pull request Oct 28, 2025

UPSTREAM PR #16828: CUDA: Conv2d tensor core auroralabs-loci/llama.cpp#7

Closed

Green-Sky mentioned this pull request Nov 1, 2025

cuda : Add conv2d Implicit GEMM #15805

Open

Green-Sky suggested changes Nov 1, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

CUDA: Conv2d tensor core #16828

CUDA: Conv2d tensor core #16828

mnehete32 commented Oct 28, 2025

Uh oh!

mnehete32 commented Oct 28, 2025

Uh oh!

mnehete32 commented Nov 1, 2025

Uh oh!

Green-Sky commented Nov 1, 2025 •

edited

Loading

Uh oh!

Green-Sky Nov 1, 2025

Uh oh!

Green-Sky Nov 1, 2025

Uh oh!

Green-Sky Nov 1, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants


		#define CEIL_DIV(M, N) (((M) + (N) - 1) / (N))

		static uint32_t ceil_div(uint32_t M, uint32_t N);

CUDA: Conv2d tensor core #16828

Are you sure you want to change the base?

CUDA: Conv2d tensor core #16828

Conversation

mnehete32 commented Oct 28, 2025

FP16 Tensor Core perf

Uh oh!

mnehete32 commented Oct 28, 2025

Uh oh!

mnehete32 commented Nov 1, 2025

Uh oh!

Green-Sky commented Nov 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

sd1 fp16 512x768

sd1 fp16 768x1024 (like the old table)

sdxl fp16/q8_0 1024x1280

Uh oh!

Green-Sky Nov 1, 2025

Choose a reason for hiding this comment

Uh oh!

Green-Sky Nov 1, 2025

Choose a reason for hiding this comment

Uh oh!

Green-Sky Nov 1, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Green-Sky commented Nov 1, 2025 •

edited

Loading