-
Notifications
You must be signed in to change notification settings - Fork 13.5k
CUDA: Conv2d tensor core #16828
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
CUDA: Conv2d tensor core #16828
Conversation
* removed flash-attenion definition
…conv2d_tensor_core
CUDA: uint to int and added assertion
* Extra: reduces bank conflicts
…conv2d_tensor_core
|
Keeping this as a draft until the implicit or Vulkan changes are merged. I’ll integrate the tensor core kernel with that code. |
|
Hey @Green-Sky, could we also get a sd.cpp perf analysis for this draft? I’ve exposed the tensor core kernel through conv2d_direct. |
|
Ran a bench on this pr and added it here #15805 (comment) . Looks like this is now the fastest version ! VAE decoding is also slightly faster than im2col+matmul (maybe, might be within error). sd1 fp16 512x768
sd1 fp16 768x1024 (like the old table)
sdxl fp16/q8_0 1024x1280Diffusion model is q8_0 and vae is fp16.
|
| __constant__ __device__ Params P; | ||
|
|
||
| // see init_fastdiv_values in ggml-vulkan.cpp | ||
| __inline__ __device__ uint fastdiv(uint n, uint mp, uint L) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Already exists in common.
llama.cpp/ggml/src/ggml-cuda/common.cuh
Line 653 in 1ae7488
| static __device__ __forceinline__ uint32_t fastdiv(uint32_t n, const uint3 fastdiv_values) { |
|
|
||
| #define CEIL_DIV(M, N) (((M) + (N) - 1) / (N)) | ||
|
|
||
| static uint32_t ceil_div(uint32_t M, uint32_t N); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
llama.cpp/ggml/src/ggml-sycl/common.hpp
Line 532 in 1ae7488
| constexpr size_t ceil_div(const size_t m, const size_t n) { |
| #include "convert.cuh" | ||
| #include "mma.cuh" | ||
|
|
||
| #define CEIL_DIV(M, N) (((M) + (N) - 1) / (N)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove makro, and use function instead.
Added Tensor Core to the code from #16088, have made modification such that it was giving best result on tensor cores. Below result are on RTX 2070 gpu.
FP16 Tensor Core perf
@etasnadi @Green-Sky @JohannesGaessler