feat: add torch.compile blogs (#350)

DefTruth · web-flow · commit 497b8c352bb4 · 2025-06-28T12:25:28.000+08:00
* fix comments

* fix comments

* fix comments

* fix comments

* add torch.compile blogs

* add torch.compile blogs

* add torch.compile blogs
diff --git a/.pre-commit-config.yaml b/.pre-commit-config.yaml
@@ -10,8 +10,6 @@ repos:
       - id: end-of-file-fixer
       - id: check-yaml
         args: [--allow-multiple-documents]
-      - id: check-toml
-      - id: check-ast
       - id: check-added-large-files
       - id: check-merge-conflict
       - id: check-shebang-scripts-are-executable
diff --git a/README.md b/README.md
@@ -1,15 +1,3 @@
-<!---
-  <img src='https://github.com/user-attachments/assets/9306862b-2a30-4a87-bb33-0fde9e9d7cea' width=250 >
-      <a href="#cuda-kernel">📚200+ CUDA Kernels</a> | <a href="#my-blogs-part-1"> 📚100+ LLM/CUDA Blogs</a> | <a href="#HGEMM-bench"> ⚡️HGEMM MMA</a> | <a href="#fa-mma-bench"> ⚡️FA-2 MMA </a> <p>
-<img src='https://github.com/user-attachments/assets/b2578723-b7a7-4d8f-bcd1-5008947b808a' >
-<div align="center">
-  <p align="center">
-    <a href="#contribute">愿以青衿涉险苦，为君先踏棘荆途。他年若览通衢阔，莫忘初逢问路吾。</a>
-  </p>
-</div>
---->
-
-
 <div align="center">
   <p align="center">
     <h2>📚 LeetCUDA: Modern CUDA Learn Notes with PyTorch for Beginners 🐑</h2>
@@ -24,14 +12,14 @@
       <img src=https://img.shields.io/github/stars/xlite-dev/LeetCUDA.svg?style=social >
       <img src=https://img.shields.io/badge/Release-v3.0.6-brightgreen.svg >
       <img src=https://img.shields.io/badge/License-GPLv3.0-turquoise.svg >
- </div>
+  </div>
 </div>
 
 📚 **LeetCUDA**: It includes **Tensor/CUDA Cores, TF32/F16/BF16/F8**, [📖200+ CUDA Kernels🔥](#cuda-kernel) with PyTorch, [📖100+ LLM/CUDA🔥](#my-blogs-part-1) blogs, [📖HGEMM⚡️](./kernels/hgemm) which can achieve `98%~100%` TFLOPS of **cuBLAS**, and [📖flash-attn⚡️](./kernels/flash-attn) using Tensor Cores with pure MMA PTX. ♥️ Please consider to leave a ⭐️ Star to support me, my bro ~ ♥️
 
 <div align="center">
   <p align="center">
-    <a href="#contribute">🔥🔥 PR Welcome: Add Your Kernel to LeetCUDA! Let's make it Awesome together! 🎉🎉</a>
+    <a href="#contribute">🔥🔥 PR Welcome: Add Your Kernel to LeetCUDA! Let's make it Awesome together! 🎉🎉</a> <br>
     <a href=https://github.com/xlite-dev/LeetCUDA/graphs/contributors > <img src=https://opencollective.com/leetcuda/contributors.svg height=40px > </a>
   </p>
 </div>
@@ -52,7 +40,7 @@
 ## 📖 News 🔥🔥
 <div id="news"></div>
 
-- [2025-06-16]: [🤗CacheDiT](https://github.com/vipshop/cache-dit) is release! A **Training-free** and **Easy-to-use** Cache Acceleration Toolbox for Diffusion Transformers (**DBCache**, **DBPrune**, **FBCache**, etc)🔥. Feel free to take a try!  
+- [2025-06-16]: [🤗CacheDiT](https://github.com/vipshop/cache-dit) is release! A **Training-free** and **Easy-to-use** Cache Acceleration Toolbox for Diffusion Transformers (**DBCache**, **DBPrune**, **FBCache**, etc)🔥. Feel free to take a try!
 
 <div align='center'>
   <img src='https://github.com/user-attachments/assets/a5ec4320-d2f9-4254-888a-170b2d9e3784' height=170px>
@@ -77,31 +65,6 @@
 
 ## 📖 Contents
 <div id="contents"></div>
-<!---
-- [📖 HGEMM-MMA 🎉🎉](#HGEMM-bench)
-  - [📚 CUDA/Tensor Cores](#HGEMM-bench)
-  - [📚 Tile Block(Br, Bc)](#HGEMM-bench)
-  - [📚 Tile MMAs/Warps](#HGEMM-bench)
-  - [📚 Pack LDST(128 bits)](#HGEMM-bench)
-  - [📚 Multi Stages(2~4)](#HGEMM-bench)
-  - [📚 Block/Warp Swizzle](#HGEMM-bench)
-  - [📚 SMEM Swizzle](#HGEMM-bench)
-  - [📚 Register Double Buffers](#HGEMM-bench)
-  - [📚 Collective Store(Shfl)](#HGEMM-bench)
-  - [📚 Layout NN/TN](#HGEMM-bench)
-- [📖 FlashAttention-MMA 🎉🎉](#fa-mma-bench)
-- [📖 200+ CUDA Kernels 🔥🔥](#cuda-kernel)
-- [📖 100+ 高性能计算文章 💡💡](#my-blogs-part-1)
-  - [📚 大模型推理优化原理](#my-blogs-part-1)
-  - [📚 大模型分布式训推原理](#my-blogs-part-1)
-  - [📚 CV/C++/模型部署优化](#my-blogs-part-1)
-  - [📚 CUDA优化入门与实践](#other-blogs)
-  - [📚 Tensor Cores入门教程](#other-blogs)
-  - [📚 CuTe系列详解与实践](#other-blogs)
-  - [📚 GPU指令集架构精解](#other-blogs)
-  - [📚 GPU通信架构精解](#other-blogs)
-- [📖 How to Contribute 👀👇](#contribute)
---->
 
 - [📖 HGEMM-MMA 🎉🎉](#HGEMM-bench)
 - [📖 FlashAttention-MMA 🎉🎉](#fa-mma-bench)
@@ -521,7 +484,7 @@ The kernels listed here will guide you through a step-by-step progression, rangi
 
 |📖 类型-标题|📖 作者| 📖 推荐 |
 |:---|:---|:---|
-| [[Diffusion推理]📖DiT推理加速综述: Caching](https://zhuanlan.zhihu.com/p/711223667)|@DefTruth|⭐️⭐️⭐|  
+| [[Diffusion推理]📖DiT推理加速综述: Caching](https://zhuanlan.zhihu.com/p/711223667)|@DefTruth|⭐️⭐️⭐|
 | [[Triton编程][基础]📖Triton极简入门: Triton Vector Add](https://zhuanlan.zhihu.com/p/1902778199261291694)|@DefTruth|⭐️⭐️⭐|
 | [[Triton编程][基础]📖Triton Fused Softmax Kernel详解: 从Python源码到PTX](https://zhuanlan.zhihu.com/p/1899562146477609112)|@DefTruth|⭐️⭐️⭐|
 | [[Triton编程][基础]📖vLLM Triton Merge Attention States Kernel详解](https://zhuanlan.zhihu.com/p/1904937907703243110)|@DefTruth|⭐️⭐️⭐|
@@ -665,6 +628,15 @@ The kernels listed here will guide you through a step-by-step progression, rangi
 | [[Tensor Cores]📖Nvidia Tensor Core-MMA PTX编程入门](https://zhuanlan.zhihu.com/p/621855199)|@木子知|⭐️⭐️⭐️|
 | [[Tensor Cores]📖CUDA Ampere Tensor Core HGEMM 矩阵乘法优化](https://zhuanlan.zhihu.com/p/555339335)|@nicholaswilde|⭐️⭐️⭐️|
 | [[GPU通信架构][精解]📖NVIDIA GPGPU（四）- 通信架构](https://zhuanlan.zhihu.com/p/680262016)|@Bruce|⭐️⭐️⭐️|
+| [[torch.compile][原理]📖Torch.compile流程解析: 介绍](https://zhuanlan.zhihu.com/p/9418379234)|@StarCap|⭐️⭐️⭐️|
+| [[torch.compile][原理]📖Torch.compile流程解析: TorchDynamo](https://zhuanlan.zhihu.com/p/9640728231)|@StarCap|⭐️⭐️⭐️|
+| [[torch.compile][原理]📖Torch.compile流程解析: AOTAutograd](https://zhuanlan.zhihu.com/p/9997263922)|@StarCap|⭐️⭐️⭐️|
+| [[torch.compile][原理]📖Torch.compile流程解析: TorchInductor](https://zhuanlan.zhihu.com/p/11224299472)|@StarCap|⭐️⭐️⭐️|
+| [[torch.compile][原理]📖Torch.compile流程解析: 算子融合](https://zhuanlan.zhihu.com/p/21053905491)|@StarCap|⭐️⭐️⭐️|
+| [[torch.compile][实践]📖Torch.compile使用指南](https://zhuanlan.zhihu.com/p/620163218)|@jhang|⭐️⭐️⭐️|
+| [[torch.compile][实践]📖Torch.compile详细示例解析教程](https://zhuanlan.zhihu.com/p/855291863)|@Bbuf|⭐️⭐️⭐️|
+| [[torch.compile][原理]📖一文搞懂TorchDynamo原理](https://zhuanlan.zhihu.com/p/630933479)|@吾乃阿尔法|⭐️⭐️⭐️|
+| [[torch.compile][原理]📖理解torch.compile基本原理和使用方式](https://zhuanlan.zhihu.com/p/12712224407)|@俯仰|⭐️⭐️⭐️|
 
 ## ©️License ([©️back👆🏻](#contents))
 
diff --git a/kernels/dot-product/dot_product.cu b/kernels/dot-product/dot_product.cu
@@ -17,8 +17,8 @@
 #define BFLOAT2(value) (reinterpret_cast<__nv_bfloat162 *>(&(value))[0])
 #define LDST128BITS(value) (reinterpret_cast<float4 *>(&(value))[0])
 
-// -------------------------------------- FP32
-// -------------------------------------- Warp Reduce Sum
+// FP32
+// Warp Reduce Sum
 template <const int kWarpSize = WARP_SIZE>
 __device__ __forceinline__ float warp_reduce_sum_f32(float val) {
 #pragma unroll
@@ -87,8 +87,8 @@ __global__ void dot_prod_f32x4_f32_kernel(float *a, float *b, float *y, int N) {
     atomicAdd(y, prod);
 }
 
-// -------------------------------------- FP16
-// -------------------------------------- Warp Reduce Sum: Half
+// FP16
+// Warp Reduce Sum: Half
 template <const int kWarpSize = WARP_SIZE>
 __device__ __forceinline__ half warp_reduce_sum_f16_f16(half val) {
 #pragma unroll
@@ -199,8 +199,6 @@ __global__ void dot_prod_f16x8_pack_f32_kernel(half *a, half *b, float *y,
     atomicAdd(y, prod);
 }
 
-// --------------------- PyTorch bindings for custom kernel
-// -----------------------
 #define STRINGFY(str) #str
 #define TORCH_BINDING_COMMON_EXTENSION(func)                                   \
   m.def(STRINGFY(func), &func, STRINGFY(func));
diff --git a/kernels/hardswish/hardswish.cu b/kernels/hardswish/hardswish.cu
@@ -19,20 +19,17 @@
 #define BFLOAT2(value) (reinterpret_cast<__nv_bfloat162 *>(&(value))[0])
 #define LDST128BITS(value) (reinterpret_cast<float4 *>(&(value))[0])
 
-// 定义 CHECK_TORCH_TENSOR_DTYPE 宏
 #define CHECK_TORCH_TENSOR_DTYPE(T, th_type)                                   \
   if (((T).options().dtype() != (th_type))) {                                  \
     std::cout << "Tensor Info:" << (T).options() << std::endl;                 \
     throw std::runtime_error("Tensor dtype must be " #th_type);                \
   }
 
-// 定义 TORCH_BINDING_COMMON_EXTENSION 宏
 #define STRINGFY(str) #str
 #define TORCH_BINDING_COMMON_EXTENSION(func)                                   \
   m.def(STRINGFY(func), &func, STRINGFY(func));
 
-// HARDSWISH 计算函数
-//  FP32
+// FP32
 __device__ __forceinline__ float hardswish(float x) {
   if (x >= THRESHOLD_A) {
     return x;
@@ -43,7 +40,7 @@ __device__ __forceinline__ float hardswish(float x) {
   }
 }
 
-//  FP16
+// FP16
 __device__ __forceinline__ half hardswish_half(half x) {
   if (x > __float2half(THRESHOLD_A)) {
     return x;
@@ -54,8 +51,7 @@ __device__ __forceinline__ half hardswish_half(half x) {
   }
 }
 
-// CUDA 核函数
-//  FP32
+// FP32
 __global__ void hardswish_f32_kernel(float *x, float *y, int N) {
   int idx = blockIdx.x * blockDim.x + threadIdx.x;
   if (idx < N)
@@ -75,7 +71,7 @@ __global__ void hardswish_f32x4_kernel(float *x, float *y, int N) {
   }
 }
 
-//  FP16
+// FP16
 __global__ void hardswish_f16_kernel(half *x, half *y, int N) {
   int idx = blockIdx.x * blockDim.x + threadIdx.x;
   if (idx < N)
diff --git a/kernels/hgemm/cublas/hgemm_cublas.cu b/kernels/hgemm/cublas/hgemm_cublas.cu
@@ -173,8 +173,7 @@ int main(int argc, char *argv[]) {
 }
 // build torch python binding
 #else
-// --------------------- PyTorch bindings for custom kernel
-// -----------------------
+
 #include <torch/extension.h>
 #include <torch/types.h>
 
diff --git a/kernels/hgemm/cutlass/hgemm_mma_stage_tn_cute.cu b/kernels/hgemm/cutlass/hgemm_mma_stage_tn_cute.cu
@@ -469,8 +469,7 @@ int main() {
 
 #include <torch/extension.h>
 #include <torch/types.h>
-// --------------------- PyTorch bindings for custom kernel
-// -----------------------
+
 #define STRINGFY(str) #str
 #define TORCH_BINDING_COMMON_EXTENSION(func)                                   \
   m.def(STRINGFY(func), &func, STRINGFY(func));
diff --git a/kernels/hgemm/mma/basic/hgemm_mma.cu b/kernels/hgemm/mma/basic/hgemm_mma.cu
@@ -288,8 +288,6 @@ __global__ void __launch_bounds__(256)
   }
 }
 
-// --------------------- PyTorch bindings for custom kernel
-// -----------------------
 #define STRINGFY(str) #str
 #define TORCH_BINDING_COMMON_EXTENSION(func)                                   \
   m.def(STRINGFY(func), &func, STRINGFY(func));
diff --git a/kernels/hgemm/mma/basic/hgemm_mma_stage.cu b/kernels/hgemm/mma/basic/hgemm_mma_stage.cu
@@ -2039,8 +2039,6 @@ int main(int argc, char *argv[]) {
 
 #else
 
-// --------------------- PyTorch bindings for custom kernel
-// -----------------------
 #include <torch/extension.h>
 #include <torch/types.h>
 #define STRINGFY(str) #str
diff --git a/kernels/hgemm/mma/basic/hgemm_mma_stage_tn.cu b/kernels/hgemm/mma/basic/hgemm_mma_stage_tn.cu
@@ -492,8 +492,6 @@ int main(int argc, char *argv[]) {
 
 #else
 
-// --------------------- PyTorch bindings for custom kernel
-// -----------------------
 #include <torch/extension.h>
 #include <torch/types.h>
 #define STRINGFY(str) #str
diff --git a/kernels/hgemm/mma/swizzle/hgemm_mma_stage_swizzle.cu b/kernels/hgemm/mma/swizzle/hgemm_mma_stage_swizzle.cu
@@ -738,8 +738,6 @@ int main(int argc, char *argv[]) {
 
 #else
 
-// --------------------- PyTorch bindings for custom kernel
-// -----------------------
 #include <torch/extension.h>
 #include <torch/types.h>
 #define STRINGFY(str) #str
diff --git a/kernels/hgemm/mma/swizzle/hgemm_mma_stage_tn_swizzle.cu b/kernels/hgemm/mma/swizzle/hgemm_mma_stage_tn_swizzle.cu
@@ -569,8 +569,7 @@ int main(int argc, char *argv[]) {
 }
 
 #else
-// --------------------- PyTorch bindings for custom kernel
-// -----------------------
+
 #include <torch/extension.h>
 #include <torch/types.h>
 #define STRINGFY(str) #str
diff --git a/kernels/hgemm/mma/swizzle/hgemm_mma_stage_tn_swizzle_x2.cu b/kernels/hgemm/mma/swizzle/hgemm_mma_stage_tn_swizzle_x2.cu
@@ -752,8 +752,7 @@ int main(int argc, char *argv[]) {
 }
 
 #else
-// --------------------- PyTorch bindings for custom kernel
-// -----------------------
+
 #include <torch/extension.h>
 #include <torch/types.h>
 #define STRINGFY(str) #str
diff --git a/kernels/hgemm/mma/swizzle/hgemm_mma_stage_tn_swizzle_x4.cu b/kernels/hgemm/mma/swizzle/hgemm_mma_stage_tn_swizzle_x4.cu
@@ -823,8 +823,7 @@ int main(int argc, char *argv[]) {
 }
 
 #else
-// --------------------- PyTorch bindings for custom kernel
-// -----------------------
+
 #include <torch/extension.h>
 #include <torch/types.h>
 #define STRINGFY(str) #str
diff --git a/kernels/hgemm/naive/hgemm.cu b/kernels/hgemm/naive/hgemm.cu
@@ -18,8 +18,8 @@
 #define LDST64BITS(value) (reinterpret_cast<float2 *>(&(value))[0])
 #define LDST128BITS(value) (reinterpret_cast<float4 *>(&(value))[0])
 
-// -------------------------------------- FP16
-// -------------------------------------- HGEMM naive: compute one c[i,j]
+// FP16
+// HGEMM naive: compute one c[i,j]
 // element per threads, all row major
 __global__ void hgemm_naive_f16_kernel(half *a, half *b, half *c, int M, int N,
                                        int K) {
@@ -789,8 +789,6 @@ __global__ void hgemm_t_8x8_sliced_k_f16x8_pack_bcf_dbuf_kernel(
   }
 }
 
-// --------------------- PyTorch bindings for custom kernel
-// -----------------------
 #define STRINGFY(str) #str
 #define TORCH_BINDING_COMMON_EXTENSION(func)                                   \
   m.def(STRINGFY(func), &func, STRINGFY(func));
diff --git a/kernels/hgemm/naive/hgemm_async.cu b/kernels/hgemm/naive/hgemm_async.cu
@@ -743,8 +743,6 @@ hgemm_t_16x8_sliced_k32_f16x8_pack_dbuf_async_kernel(half *a, half *b, half *c,
   }
 }
 
-// --------------------- PyTorch bindings for custom kernel
-// -----------------------
 #define STRINGFY(str) #str
 #define TORCH_BINDING_COMMON_EXTENSION(func)                                   \
   m.def(STRINGFY(func), &func, STRINGFY(func));
diff --git a/kernels/hgemv/hgemv.cu b/kernels/hgemv/hgemv.cu
@@ -16,8 +16,8 @@
 #define HALF2(value) (reinterpret_cast<half2 *>(&(value))[0])
 #define BFLOAT2(value) (reinterpret_cast<__nv_bfloat162 *>(&(value))[0])
 
-// -------------------------------------- FP16
-// -------------------------------------- Warp Reduce Sum
+// FP16
+// Warp Reduce Sum
 template <const int kWarpSize = WARP_SIZE>
 __device__ __forceinline__ half warp_reduce_sum_f16(half val) {
 #pragma unroll
@@ -109,8 +109,6 @@ __global__ void hgemv_k16_f16_kernel(half *A, half *x, half *y, int M, int K) {
   }
 }
 
-// --------------------- PyTorch bindings for custom kernel
-// -----------------------
 #define STRINGFY(str) #str
 #define TORCH_BINDING_COMMON_EXTENSION(func)                                   \
   m.def(STRINGFY(func), &func, STRINGFY(func));
diff --git a/kernels/hgemv/hgemv_cute.cu b/kernels/hgemv/hgemv_cute.cu
@@ -284,8 +284,6 @@ __global__ void hgemv_tensor_core_cute_kernel(typename HgemvConfig_::T *Aptr,
   copy_if(elem_pred2, sum, gC(elem_index2, _));
 }
 
-// --------------------- PyTorch bindings for custom kernel
-// -----------------------
 #define STRINGFY(str) #str
 #define TORCH_BINDING_COMMON_EXTENSION(func)                                   \
   m.def(STRINGFY(func), &func, STRINGFY(func));
diff --git a/kernels/layer-norm/layer_norm.cu b/kernels/layer-norm/layer_norm.cu
@@ -17,8 +17,8 @@
 #define BFLOAT2(value) (reinterpret_cast<__nv_bfloat162 *>(&(value))[0])
 #define LDST128BITS(value) (reinterpret_cast<float4 *>(&(value))[0])
 
-//  FP32
-//  Warp Reduce Sum
+// FP32
+// Warp Reduce Sum
 template <const int kWarpSize = WARP_SIZE>
 __device__ __forceinline__ float warp_reduce_sum_f32(float val) {
 #pragma unroll
@@ -119,8 +119,8 @@ __global__ void layer_norm_f32x4_kernel(float *x, float *y, float g, float b,
     FLOAT4(y[idx]) = reg_y;
 }
 
-//  FP16
-//  Warp Reduce Sum: Half
+// FP16
+// Warp Reduce Sum: Half
 template <const int kWarpSize = WARP_SIZE>
 __device__ __forceinline__ half warp_reduce_sum_f16_f16(half val) {
 #pragma unroll
diff --git a/kernels/mat-transpose/mat_transpose.cu b/kernels/mat-transpose/mat_transpose.cu
@@ -23,8 +23,8 @@
 #define MAX_EXP_F16 __float2half(11.089866488461016f)
 #define MIN_EXP_F16 __float2half(-9.704060527839234f)
 
-//  FP32
-//  col2row means read x[row][col] and
+// FP32
+// col2row means read x[row][col] and
 // write y[col][row] row2col means read x[col][row] and write y[row][col]
 __global__ void mat_transpose_f32_col2row_kernel(float *x, float *y,
                                                  const int row, const int col) {
diff --git a/kernels/reduce/block_all_reduce.cu b/kernels/reduce/block_all_reduce.cu
@@ -17,8 +17,8 @@
 #define BFLOAT2(value) (reinterpret_cast<__nv_bfloat162 *>(&(value))[0])
 #define LDST128BITS(value) (reinterpret_cast<float4 *>(&(value))[0])
 
-//  FP32
-//  Warp Reduce Sum
+// FP32
+// Warp Reduce Sum
 template <const int kWarpSize = WARP_SIZE>
 __device__ __forceinline__ float warp_reduce_sum_f32(float val) {
 #pragma unroll
@@ -751,8 +751,6 @@ __global__ void block_all_reduce_sum_i8x16_pack_i32_kernel(int8_t *a,
     atomicAdd(y, sum);
 }
 
-// --------------------- PyTorch bindings for custom kernel
-// -----------------------
 #define STRINGFY(str) #str
 #define TORCH_BINDING_COMMON_EXTENSION(func)                                   \
   m.def(STRINGFY(func), &func, STRINGFY(func));
diff --git a/kernels/relu/relu.cu b/kernels/relu/relu.cu
@@ -15,8 +15,8 @@
 #define BFLOAT2(value) (reinterpret_cast<__nv_bfloat162 *>(&(value))[0])
 #define LDST128BITS(value) (reinterpret_cast<float4 *>(&(value))[0])
 
-//  FP32
-//  Relu x: N, y: N y=max(0,x)
+// FP32
+// Relu x: N, y: N y=max(0,x)
 // grid(N/256), block(K=256)
 __global__ void relu_f32_kernel(float *x, float *y, int N) {
   int idx = blockIdx.x * blockDim.x + threadIdx.x;
diff --git a/kernels/rms-norm/rms_norm.cu b/kernels/rms-norm/rms_norm.cu
diff --git a/kernels/sgemm/sgemm.cu b/kernels/sgemm/sgemm.cu
diff --git a/kernels/sgemm/sgemm_async.cu b/kernels/sgemm/sgemm_async.cu
diff --git a/kernels/sgemm/sgemm_wmma_tf32_stage.cu b/kernels/sgemm/sgemm_wmma_tf32_stage.cu
diff --git a/kernels/sigmoid/sigmoid.cu b/kernels/sigmoid/sigmoid.cu
diff --git a/kernels/swish/swish.cu b/kernels/swish/swish.cu

Original file line number	Diff line number	Diff line change
`@@ -173,8 +173,7 @@ int main(int argc, char *argv[]) {`
`173`	`173`	`}`
`174`	`174`	`// build torch python binding`
`175`	`175`	`#else`
`176`		`-// --------------------- PyTorch bindings for custom kernel`
`177`		`-// -----------------------`
	`176`	`+`
`178`	`177`	`#include <torch/extension.h>`
`179`	`178`	`#include <torch/types.h>`
`180`	`179`
Original file line number	Diff line number	Diff line change
`@@ -288,8 +288,6 @@ __global__ void __launch_bounds__(256)`
`288`	`288`	`}`
`289`	`289`	`}`
`290`	`290`
`291`		`-// --------------------- PyTorch bindings for custom kernel`
`292`		`-// -----------------------`
`293`	`291`	`#define STRINGFY(str) #str`
`294`	`292`	`#define TORCH_BINDING_COMMON_EXTENSION(func) \`
`295`	`293`	`m.def(STRINGFY(func), &func, STRINGFY(func));`