You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
|| FP32 | Compile |**10.107**|**1.1x**|`torch.compile` provides minor gains|
233
-
234
-
*Runtimes are for a single forward pass. Speedups are calculated relative to the FP32 Eager implementation.*
235
-
236
-
**Key Insights:**
237
-
- 🚀 **Optimized Kernels**: Our custom CUDA kernels forKNN and Chamfer are already highly optimized, showing that `torch.compile` may add overheadin some cases.
238
-
- ⚡ **Half-Precision (FP16)**: For KNN, half-precision provides a solid **1.4x speedup**in eager mode, making it ideal for memory-constrained and performance-critical applications.
239
-
- 🎯 **EMD**: EMD sees a minor benefit from `torch.compile`, while half-precision is not recommended due to numerical stability.
240
-
- 🔥 **GPU Scaling**: All operations show significant performance gains on larger inputs.
241
-
- 📊 **Efficiency**: Optimized CUDA kernels deliver maximum hardware utilization, often outperforming generalized compilation approaches.
221
+
The following table shows the performance for the `B16_N2048_M2048` configuration on an NVIDIA GeForce RTX 3090. `torch.compile` with `reduce-overhead` or `max-autotune` modes can provide significant speedups, especially for EMD.
0 commit comments