-
Notifications
You must be signed in to change notification settings - Fork 102
Description
Hi, thanks for open-sourcing this great CUDA voxelizer.
I tried to reproduce the performance numbers from the README and noticed that my results on a newer GPU are significantly slower than what is reported for a GTX 1050 Ti.
Environment
- GPU: NVIDIA GeForce RTX 4060
- CUDA: 12.1
- OS: Windows 10
- Driver version: [e.g. 555.xx]
- Build: [prebuilt binary / built from source with CMake, Release configuration]
What I did
I voxelized a 256³ grid as in the README.
On my setup, the measured time is about 4.7 ms for resolution 256, while the README mentions about 0.6 ms on a GTX 1050 Ti for the same resolution (excluding file I/O).
Because my card should be significantly faster than a 1050 Ti, I’m wondering if I am misunderstanding the benchmark conditions or missing some important build/runtime settings.
Questions
- Are the numbers in the README pure kernel time (excluding file I/O and host↔device transfers), or something else?
- Do you have recommended CMake / nvcc flags or
CUDAARCHSsettings for newer GPUs like RTX 40-series (e.g. sm_89)? - Are there any known issues or performance pitfalls when running this code with CUDA 12.x or on Ada / RTX 40-series GPUs?
- Is there anything in the sample configuration (e.g. solid vs non-solid mode, specific test mesh) that I should be careful to match exactly?
If it helps, I can share more detailed logs, profiler output, or a minimal repro of how I am timing the kernel.
Thanks in advance for any hints!