Skip to content

Commit 21244e9

Browse files
committed
fixed some dtype issues with knn
1 parent 5d1d3bf commit 21244e9

File tree

6 files changed

+228
-204
lines changed

6 files changed

+228
-204
lines changed

.vscode/extensions.json

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,5 @@
1+
{
2+
"recommendations": [
3+
"ms-python.python"
4+
]
5+
}

README.md

Lines changed: 26 additions & 21 deletions
Original file line numberDiff line numberDiff line change
@@ -218,27 +218,32 @@ python benchmark_flops.py
218218
219219
### 📈 Performance Highlights
220220
221-
Based on benchmarking with an NVIDIA RTX 3090, here’s how `torch-point-ops` performs on a 1024x1024 point cloud configuration:
222-
223-
| Operation | Precision | Mode | Runtime (ms) | Speedup vs FP32 Eager | Notes |
224-
|----------------|-----------|---------|--------------|-----------------------|-------------------------------------|
225-
| **KNN (K=16)** | FP32 | Eager | 3.377 | 1.0x | Baseline performance |
226-
| | **FP16** | Eager | **2.415** | **1.4x** | Faster with half precision |
227-
| | **FP16** | Compile | **4.582** | **~0.7x** | `torch.compile` overhead observed |
228-
| **Chamfer** | FP32 | Eager | 0.250 | 1.0x | Baseline performance |
229-
| | **FP16** | Eager | **0.941** | **~0.3x** | Slower with half precision |
230-
| | **FP16** | Compile | **0.873** | **~0.3x** | `torch.compile` provides no benefit |
231-
| **EMD** | FP32 | Eager | 11.126 | 1.0x | Baseline, FP16 not recommended |
232-
| | FP32 | Compile | **10.107** | **1.1x** | `torch.compile` provides minor gains|
233-
234-
*Runtimes are for a single forward pass. Speedups are calculated relative to the FP32 Eager implementation.*
235-
236-
**Key Insights:**
237-
- 🚀 **Optimized Kernels**: Our custom CUDA kernels for KNN and Chamfer are already highly optimized, showing that `torch.compile` may add overhead in some cases.
238-
- ⚡ **Half-Precision (FP16)**: For KNN, half-precision provides a solid **1.4x speedup** in eager mode, making it ideal for memory-constrained and performance-critical applications.
239-
- 🎯 **EMD**: EMD sees a minor benefit from `torch.compile`, while half-precision is not recommended due to numerical stability.
240-
- 🔥 **GPU Scaling**: All operations show significant performance gains on larger inputs.
241-
- 📊 **Efficiency**: Optimized CUDA kernels deliver maximum hardware utilization, often outperforming generalized compilation approaches.
221+
The following table shows the performance for the `B16_N2048_M2048` configuration on an NVIDIA GeForce RTX 3090. `torch.compile` with `reduce-overhead` or `max-autotune` modes can provide significant speedups, especially for EMD.
222+
223+
| Operation | Precision | Mode | Runtime (ms) | Speedup vs Eager |
224+
|----------------|-----------|---------------------------|--------------|------------------|
225+
| **KNN (K=16)** | FP16 | Compile (default) | 1.173 | 0.94x |
226+
| **KNN (K=16)** | FP16 | Compile (max-autotune) | 1.184 | 0.93x |
227+
| **KNN (K=16)** | FP16 | Compile (reduce-overhead) | 1.173 | 0.94x |
228+
| **KNN (K=16)** | FP16 | Eager | 1.106 | 1.00x |
229+
| **KNN (K=16)** | FP32 | Compile (default) | 0.995 | 0.95x |
230+
| **KNN (K=16)** | FP32 | Compile (max-autotune) | 0.972 | 0.97x |
231+
| **KNN (K=16)** | FP32 | Compile (reduce-overhead) | 1.021 | 0.92x |
232+
| **KNN (K=16)** | FP32 | Eager | 0.943 | 1.00x |
233+
| **Chamfer** | FP16 | Compile (default) | 0.558 | 1.00x |
234+
| **Chamfer** | FP16 | Compile (max-autotune) | 0.557 | 1.00x |
235+
| **Chamfer** | FP16 | Compile (reduce-overhead) | 0.558 | 1.00x |
236+
| **Chamfer** | FP16 | Eager | 0.556 | 1.00x |
237+
| **Chamfer** | FP32 | Compile (default) | 0.233 | 0.98x |
238+
| **Chamfer** | FP32 | Compile (max-autotune) | 0.230 | 0.99x |
239+
| **Chamfer** | FP32 | Compile (reduce-overhead) | 0.133 | 1.71x |
240+
| **Chamfer** | FP32 | Eager | 0.228 | 1.00x |
241+
| **EMD** | FP32 | Compile (default) | 32.378 | 1.00x |
242+
| **EMD** | FP32 | Compile (max-autotune) | 0.326 | 99.28x |
243+
| **EMD** | FP32 | Compile (reduce-overhead) | 0.328 | 98.68x |
244+
| **EMD** | FP32 | Eager | 32.366 | 1.00x |
245+
246+
*Runtimes are for a single forward pass on an NVIDIA GPU. Speedup is relative to the Eager mode of the same precision.*
242247
243248
*The benchmark script tests various configurations and provides detailed timing statistics, theoretical FLOP counts, and performance analysis.*
244249

0 commit comments

Comments
 (0)