Unexpectedly slower performance on RTX 4060 vs README's GTX 1050 Ti numbers

Hi, thanks for open-sourcing this great CUDA voxelizer.

I tried to reproduce the performance numbers from the README and noticed that my results on a newer GPU are significantly slower than what is reported for a GTX 1050 Ti.

**Environment**

- GPU: NVIDIA GeForce RTX 4060
- CUDA: 12.1
- OS: Windows 10
- Driver version: [e.g. 555.xx]
- Build: [prebuilt binary / built from source with CMake, Release configuration]

**What I did**

I voxelized a 256³ grid as in the README.
On my setup, the measured time is about **4.7 ms** for resolution 256, while the README mentions about **0.6 ms** on a GTX 1050 Ti for the same resolution (excluding file I/O).

Because my card should be significantly faster than a 1050 Ti, I’m wondering if I am misunderstanding the benchmark conditions or missing some important build/runtime settings.

**Questions**

1. Are the numbers in the README pure kernel time (excluding file I/O and host↔device transfers), or something else?
2. Do you have recommended CMake / nvcc flags or `CUDAARCHS` settings for newer GPUs like RTX 40-series (e.g. sm_89)?
3. Are there any known issues or performance pitfalls when running this code with CUDA 12.x or on Ada / RTX 40-series GPUs?
4. Is there anything in the sample configuration (e.g. solid vs non-solid mode, specific test mesh) that I should be careful to match exactly?

If it helps, I can share more detailed logs, profiler output, or a minimal repro of how I am timing the kernel.

Thanks in advance for any hints!


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Unexpectedly slower performance on RTX 4060 vs README's GTX 1050 Ti numbers #82

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Unexpectedly slower performance on RTX 4060 vs README's GTX 1050 Ti numbers #82

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions