@@ -69,11 +69,45 @@ cuBLAS and cuDNN. cuBLAS provides an interface for leveraging Tensor
6969Cores to accelerate GEMM operations, while cuDNN offers an interface to
7070hasten neural network operations. To utilize Tensor Cores via cuBLAS
7171doing GEMM, we can use function ` cublasGemmEx ` , its signature is shown
72- in Code [ \[ lst: cublasGemmEx \] ] ( #lst:cublasGemmEx ) {reference-type="ref"
73- reference="lst: cublasGemmEx "}.
72+ in Code :numref:` lst:cublasGemmEx ` .
7473
75- [caption={Fragment types}, label={lst:cublasGemmEx}]
76- cublasStatus_t cublasGemmEx(cublasHandle_t handle, cublasOperation_t transa, cublasOperation_t transb, int m, int n, int k, const void *alpha, const void *A, cudaDataType_t Atype, int lda, const void *B, cudaDataType_t Btype, int ldb, const void *beta, void *C, cudaDataType_t Ctype, int ldc, cublasComputeType_t computeType, cublasGemmAlgo_t algo)
74+ ```
75+ cublasStatus_t cublasGemmEx(cublasHandle_t handle, cublasOperation_t transa, cublasOperation_t transb, int m, int n, int k, const void *alpha, const void *A, cudaDataType_t Atype, int lda, const void *B, cudaDataType_t Btype, int ldb, const void *beta, void *C, cudaDataType_t Ctype, int ldc, cublasComputeType_t computeType, cublasGemmAlgo_t algo)
76+ ```
77+ :label:`lst:cublasGemmEx`
78+
79+ ` handle ` is the cuBLAS handle, which is created using the ` cublasCreate `
80+ function. ` transa ` denotes whether the matrices $\bf{A}$ and $\bf{C}$
81+ are transposed, while ` transb ` denotes whether the matrix $\bf{B}$ is
82+ transposed. ` m ` , ` n ` , and ` k ` are used to describe the shape of the
83+ matrices. ` alpha ` and ` beta ` are used to scale the matrix multiplication
84+ results. ` A ` , ` B ` , and ` C ` are pointers to the starting addresses of the
85+ matrices. ` Atype ` , ` Btype ` , and ` Ctype ` describe the data type of the
86+ matrices. For example, ` CUDA_R_16F ` indicates that the data is stored in
87+ real 16-bit floating point type. ` lda ` , ` ldb ` , and ` ldc ` represent the
88+ leading dimensions of the matrices. ` computeType ` is the data type used
89+ in computation. For instance, ` CUBLAS_COMPUTE_16F ` implies the use of
90+ Tensor Cores for computation in 16-bit floating point. Notably, if the
91+ input data type is 32-bit float, we can use
92+ ` CUBLAS_COMPUTE_32F_FAST_16F ` to perform the computation in 16-bit
93+ floating point and achieve acceleration using Tensor Cores. ` algo ` is
94+ the algorithm used in computation, and ` CUBLAS_GEMM_DEFAULT ` is commonly
95+ used to select the default algorithm.
96+
97+ ### Primitives for Hardware Units
98+
99+ The second approach to accelerator programming involves the use of
100+ programming primitives, such as invoking the CUDA Warp Matrix Multiply
101+ Accumulate (WMMA) API on a device. This approach hinges on the
102+ collaborative design of software and hardware, meaning that the design
103+ of programming APIs at this level is architecture-dependent. For
104+ instance, in the Volta architecture, the control object of WMMA is a
105+ $16\times16$ matrix block, processed by two Tensor Cores at a time. This
106+ notion is tightly linked to the integration of Tensor Cores into a SM.
107+
108+ In the Volta architecture, NVIDIA offers three distinct sizes of WMMA
109+ multiply-accumulate computing interfaces for FP16 input data:
110+ $16\times16\times16$, $32\times8\times16$, and $8\times32\times16$.
77111
78112
79113[ ^ 1 ] : available at
0 commit comments