Skip to content

Conversation

@stephenswat
Copy link
Member

A recent CUDA update adds the __grid_constant__ qualifier which can be added to kernel arguments which guarantees that they remain in constant memory rather than being moved into local memory.

A recent CUDA update adds the `__grid_constant__` qualifier which can be
added to kernel arguments which guarantees that they remain in constant
memory rather than being moved into local memory.
@stephenswat stephenswat added the performance Performance-relevant changes label Sep 29, 2025
@sonarqubecloud
Copy link

@stephenswat
Copy link
Member Author

Performance summary

Here is a summary of the performance effects of this PR:

Graphical

Tabular

KernelReciprocal ThroughputParallelism
b80f8116089e80Deltab80f8116089e80
propagate_to_next_surface11.29 ms11.29 ms0.0%2.762.76
fit_forward4.63 ms4.78 ms3.3%3.713.69
fit_backward2.65 ms2.66 ms0.5%2.782.78
find_tracks1.31 ms1.30 ms-0.6%1.791.79
ccl_kernel827.47 μs825.60 μs-0.2%1.371.37
count_doublets636.58 μs635.59 μs-0.2%1.611.61
count_triplets589.61 μs590.70 μs0.2%1.021.02
find_doublets447.03 μs449.96 μs0.7%3.083.08
Thrust::sort394.78 μs395.07 μs0.1%5.965.96
find_triplets172.68 μs173.69 μs0.6%1.311.31
select_seeds53.72 μs53.27 μs-0.8%1.341.34
remove_duplicates45.37 μs45.82 μs1.0%15.0914.96
fit_prelude26.45 μs26.44 μs-0.1%11.0511.04
populate_grid23.36 μs23.35 μs-0.0%1.221.22
count_grid_capacities22.08 μs22.13 μs0.2%1.221.22
unknown20.61 μs20.56 μs-0.2%2.252.25
apply_interaction17.74 μs17.71 μs-0.2%5.595.60
update_triplet_weights15.21 μs15.05 μs-1.1%1.271.27
estimate_track_params14.39 μs14.36 μs-0.2%2.152.15
form_spacepoints12.30 μs12.24 μs-0.5%1.481.48
fill_finding_propagation_sort_keys11.55 μs11.56 μs0.0%6.086.08
build_tracks9.34 μs9.34 μs0.1%7.497.50
reduce_triplet_counts6.30 μs6.30 μs0.1%3.083.08
fill_finding_duplicate_removal_sort_keys3.82 μs3.82 μs-0.0%19.7319.71
make_barcode_sequence1.01 μs1.02 μs0.4%3.833.83
fill_fitting_sort_keys317.96 ns319.68 ns0.5%11.2511.23
fill_prefix_sum171.90 ns171.92 ns0.0%341.30341.30
Total23.23 ms23.39 ms0.7%2.862.86

Important

All metrics in this report are given as reciprocal throughput, not as wallclock runtime.

Note

This is an automated message produced upon the explicit request of a human being.

@stephenswat stephenswat marked this pull request as draft October 1, 2025 14:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

performance Performance-relevant changes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant