Skip to content

Conversation

@beomki-yeo
Copy link
Contributor

@beomki-yeo beomki-yeo commented Jul 2, 2025

This PR depends on #1057

The greedy ambiguity solver is added to the CUDA full chain

@beomki-yeo beomki-yeo force-pushed the add-ambiguity-solver-to-chain branch 2 times, most recently from 022693e to 126a871 Compare July 2, 2025 22:10
@beomki-yeo beomki-yeo marked this pull request as draft July 2, 2025 23:06
@beomki-yeo beomki-yeo force-pushed the add-ambiguity-solver-to-chain branch from 126a871 to d61322d Compare July 3, 2025 19:36
@beomki-yeo beomki-yeo marked this pull request as ready for review July 3, 2025 19:38
@beomki-yeo beomki-yeo force-pushed the add-ambiguity-solver-to-chain branch 2 times, most recently from 4c9698f to adcb664 Compare July 3, 2025 19:42
@beomki-yeo
Copy link
Contributor Author

beomki-yeo commented Jul 4, 2025

Maybe adding a new algorithm to the chain makes comparison currupted?
BTW, we might need to use different types of markers in the graph as the colors seems overlapping.

I knew that rearrange_tracks is also a bottleneck but didn't expect it would be this much. Good to know..

@beomki-yeo beomki-yeo force-pushed the add-ambiguity-solver-to-chain branch 2 times, most recently from 62adba1 to 88347ea Compare July 9, 2025 23:59
@beomki-yeo
Copy link
Contributor Author

@stephenswat Can you generate the plot again as the issue on covfie hash has been resolved?

@stephenswat
Copy link
Member

Running now! Mind you that due to the large number of kernels the ambiguity resolution needs to run, the benchmark takes about 3 hours per commit.

@stephenswat

This comment was marked as outdated.

@stephenswat
Copy link
Member

Hi @beomki-yeo I tried running the multi-threaded CUDA throughput example with this PR and it is giving me segmentation faults. Have you seen this?

@beomki-yeo
Copy link
Contributor Author

I will check

@beomki-yeo
Copy link
Contributor Author

beomki-yeo commented Jul 10, 2025

It's running normally for me. Could you share your command?

gcc version: 31.1
CUDA Version: 12.4

byeo@hermes:/mnt/nvme0n1/byeo/projects/traccc/traccc_build$    ./bin/traccc_throughput_mt_cuda \
   --digitization-file=geometries/odd/odd-digi-geometric-config.json \
   --detector-file=geometries/odd/odd-detray_geometry_detray.json \
   --material-file=geometries/odd/odd-detray_material_detray.json \
   --grid-file=geometries/odd/odd-detray_surface_grids_detray.json \
   --use-detray-detector=true \
   --use-acts-geom-source=true \
   --input-directory=odd/odd-simulations-20240509/geant4_ttbar_mu200/ \
   --input-events=10
11:23:54    ThroughputExampleOptions      INFO      
11:23:54    ThroughputExampleOptions      INFO      Running Multi-threaded CUDA GPU throughput tests
11:23:54    ThroughputExampleOptions      INFO      
11:23:54    ThroughputExampleOptions      INFO      Detector Options:
11:23:54    ThroughputExampleOptions      INFO      ├ Detector file:                          geometries/odd/odd-detray_geometry_detray.json
11:23:54    ThroughputExampleOptions      INFO      ├ Material file:                          geometries/odd/odd-detray_material_detray.json
11:23:54    ThroughputExampleOptions      INFO      ├ Surface grid file:                      geometries/odd/odd-detray_surface_grids_detray.json
11:23:54    ThroughputExampleOptions      INFO      ├ Use detray detector:                    true
11:23:54    ThroughputExampleOptions      INFO      └ Digitization file:                      geometries/odd/odd-digi-geometric-config.json
11:23:54    ThroughputExampleOptions      INFO      Magnetic Field Options:
11:23:54    ThroughputExampleOptions      INFO      ├ Read magnetic field from file:          false
11:23:54    ThroughputExampleOptions      INFO      ├ Magnetic field file:                    geometries/odd/odd-bfield.cvf
11:23:54    ThroughputExampleOptions      INFO      ├ Magnetic field file format:             binary
11:23:54    ThroughputExampleOptions      INFO      └ Magnetic field value:                   2 T
11:23:54    ThroughputExampleOptions      INFO      Input Data Options:
11:23:54    ThroughputExampleOptions      INFO      ├ Use ACTS geometry source:               true
11:23:54    ThroughputExampleOptions      INFO      ├ Input data format:                      csv
11:23:54    ThroughputExampleOptions      INFO      ├ Input directory:                        odd/odd-simulations-20240509/geant4_ttbar_mu200/
11:23:54    ThroughputExampleOptions      INFO      ├ Number of input events:                 10
11:23:54    ThroughputExampleOptions      INFO      └ Number of skipped events:               0
11:23:54    ThroughputExampleOptions      INFO      Clusterization Options:
11:23:54    ThroughputExampleOptions      INFO      ├ Threads per partition:                  256
11:23:54    ThroughputExampleOptions      INFO      ├ Target cells per thread:                8
11:23:54    ThroughputExampleOptions      INFO      ├ Max cells per thread:                   16
11:23:54    ThroughputExampleOptions      INFO      └ Scratch space multiplier:               256
11:23:54    ThroughputExampleOptions      INFO      Track Seeding Options:
11:23:54    ThroughputExampleOptions      INFO      Track Finding Options:
11:23:54    ThroughputExampleOptions      INFO      ├ Max branches per seed:                  10
11:23:54    ThroughputExampleOptions      INFO      ├ Max branches at surface:                2
11:23:54    ThroughputExampleOptions      INFO      ├ Track candidate range:                  3:100
11:23:54    ThroughputExampleOptions      INFO      ├ Min step length to next surface:        1.200000 mm
11:23:54    ThroughputExampleOptions      INFO      ├ Max step count to next surface:         100
11:23:54    ThroughputExampleOptions      INFO      ├ Max Chi2:                               10.000000
11:23:54    ThroughputExampleOptions      INFO      ├ Max holes per candidate:                3
11:23:54    ThroughputExampleOptions      INFO      ├ PDG number:                             13
11:23:54    ThroughputExampleOptions      INFO      └ Minimum total track momentum:           0.100000
11:23:54    ThroughputExampleOptions      INFO      Track Propagation Options:
11:23:54    ThroughputExampleOptions      INFO      ├ Navigation:
11:23:54    ThroughputExampleOptions      INFO      │ ├ Min mask tolerance:                   0.000010 mm
11:23:54    ThroughputExampleOptions      INFO      │ ├ Max mask tolerance:                   3.000000 mm
11:23:54    ThroughputExampleOptions      INFO      │ ├ Mask tolerance scalar:                0.050000
11:23:54    ThroughputExampleOptions      INFO      │ ├ Path tolerance:                       1.000000 um
11:23:54    ThroughputExampleOptions      INFO      │ ├ Overstep tolerance:                   -999.999939 um
11:23:54    ThroughputExampleOptions      INFO      │ └ Search window:                        0 x 0
11:23:54    ThroughputExampleOptions      INFO      ├ Transport:
11:23:54    ThroughputExampleOptions      INFO      │ ├ Min step size:                        0.000100 mm
11:23:54    ThroughputExampleOptions      INFO      │ ├ Runge-Kutta tolerance:                0.000100 mm
11:23:54    ThroughputExampleOptions      INFO      │ ├ Max step updates:                     10000
11:23:54    ThroughputExampleOptions      INFO      │ ├ Step size constraint:                 340282346638528859811704183484516925440.000000 mm
11:23:54    ThroughputExampleOptions      INFO      │ ├ Path limit:                           5.000000 m
11:23:54    ThroughputExampleOptions      INFO      │ ├ Min step size:                        0.000100 mm
11:23:54    ThroughputExampleOptions      INFO      │ ├ Enable Bethe energy loss:             true
11:23:54    ThroughputExampleOptions      INFO      │ ├ Enable covariance transport:          true
11:23:54    ThroughputExampleOptions      INFO      │ └ Covariance transport:
11:23:54    ThroughputExampleOptions      INFO      │   ├ Enable energy loss gradient:        false
11:23:54    ThroughputExampleOptions      INFO      │   └ Enable B-field gradient:            false
11:23:54    ThroughputExampleOptions      INFO      └ Geometry context:
11:23:54    ThroughputExampleOptions      INFO      Track Fitting Options:
11:23:54    ThroughputExampleOptions      INFO      ├ Number of iterations:                   1
11:23:54    ThroughputExampleOptions      INFO      ├ Particle hypothesis PDG:                13
11:23:54    ThroughputExampleOptions      INFO      ├ Covariance inflation factor:            1000.000000
11:23:54    ThroughputExampleOptions      INFO      ├ Barcode sequence size factor:           5
11:23:54    ThroughputExampleOptions      INFO      ├ Minimum capacity of barcode sequence:   100
11:23:54    ThroughputExampleOptions      INFO      └ Mask tolerance for the backward filter: 5.000000
11:23:54    ThroughputExampleOptions      INFO      Throughput Measurement Options:
11:23:54    ThroughputExampleOptions      INFO      ├ Cold run events:                        10
11:23:54    ThroughputExampleOptions      INFO      ├ Processed events:                       100
11:23:54    ThroughputExampleOptions      INFO      ├ Log file:                               
11:23:54    ThroughputExampleOptions      INFO      ├ Deterministic ordering:                 false
11:23:54    ThroughputExampleOptions      INFO      └ Random seed:                            time-based
11:23:54    ThroughputExampleOptions      INFO      Multi-Threading Options:
11:23:54    ThroughputExampleOptions      INFO      └ Number of CPU thread:                   1
11:23:54    ThroughputExampleOptions      INFO      
INFO: Reading detector files... Done
INFO: Building detector: Cylindrical detector from DD4hep blueprint... Done
INFO: Checking detector consistency...
WARNING: No material in detector
WARNING: No entries in volume finder
INFO: Consistency check: OK
INFO: Host detector construction complete
INFO: Reading detector files... Done
INFO: Building detector: Cylindrical detector from DD4hep blueprint... Done
INFO: Checking detector consistency...
WARNING: No entries in volume finder
INFO: Consistency check: OK
INFO: Host detector construction complete
11:24:03    ThroughputExample             WARNING   17082 duplicate cells found in /mnt/nvme0n1/byeo/projects/traccc/traccc/data/odd/odd-simulations-20240509/geant4_ttbar_mu200/event000000005-cells.csv
11:24:03    ThroughputExample             WARNING   16264 duplicate cells found in /mnt/nvme0n1/byeo/projects/traccc/traccc/data/odd/odd-simulations-20240509/geant4_ttbar_mu200/event000000001-cells.csv
11:24:03    ThroughputExample             WARNING   18654 duplicate cells found in /mnt/nvme0n1/byeo/projects/traccc/traccc/data/odd/odd-simulations-20240509/geant4_ttbar_mu200/event000000003-cells.csv
11:24:03    ThroughputExample             WARNING   18367 duplicate cells found in /mnt/nvme0n1/byeo/projects/traccc/traccc/data/odd/odd-simulations-20240509/geant4_ttbar_mu200/event000000006-cells.csv
11:24:03    ThroughputExample             WARNING   16623 duplicate cells found in /mnt/nvme0n1/byeo/projects/traccc/traccc/data/odd/odd-simulations-20240509/geant4_ttbar_mu200/event000000008-cells.csv
11:24:03    ThroughputExample             WARNING   17965 duplicate cells found in /mnt/nvme0n1/byeo/projects/traccc/traccc/data/odd/odd-simulations-20240509/geant4_ttbar_mu200/event000000007-cells.csv
11:24:03    ThroughputExample             WARNING   16881 duplicate cells found in /mnt/nvme0n1/byeo/projects/traccc/traccc/data/odd/odd-simulations-20240509/geant4_ttbar_mu200/event000000002-cells.csv
11:24:03    ThroughputExample             WARNING   18943 duplicate cells found in /mnt/nvme0n1/byeo/projects/traccc/traccc/data/odd/odd-simulations-20240509/geant4_ttbar_mu200/event000000000-cells.csv
11:24:03    ThroughputExample             WARNING   19031 duplicate cells found in /mnt/nvme0n1/byeo/projects/traccc/traccc/data/odd/odd-simulations-20240509/geant4_ttbar_mu200/event000000004-cells.csv
11:24:03    ThroughputExample             WARNING   19889 duplicate cells found in /mnt/nvme0n1/byeo/projects/traccc/traccc/data/odd/odd-simulations-20240509/geant4_ttbar_mu200/event000000009-cells.csv
Using CUDA device: NVIDIA A30 [id: 0, bus: 71, device: 0]
Using CUDA device: NVIDIA A30 [id: 0, bus: 71, device: 0]
Warm-up processing [==================================================] 100% [00m:00s]                                                                                                                               
Event processing   [==================================================] 100% [00m:00s]                                                                                                                               
11:24:35 AM ThroughputExample             INFO      Reconstructed track parameters: 510430
11:24:35 AM ThroughputExample             INFO      Time totals:                   File reading  1012 ms
11:24:35 AM ThroughputExample             INFO                  Warm-up processing  3049 ms
11:24:35 AM ThroughputExample             INFO                    Event processing  28801 ms
11:24:35 AM ThroughputExample             INFO      Throughput:            Warm-up processing  304.953 ms/event, 3.2792 events/s
11:24:35 AM ThroughputExample             INFO                    Event processing  288.011 ms/event, 3.47209 events/s

@beomki-yeo beomki-yeo force-pushed the add-ambiguity-solver-to-chain branch 2 times, most recently from b4437b2 to 154f3c7 Compare July 10, 2025 19:30
@sonarqubecloud
Copy link

@beomki-yeo
Copy link
Contributor Author

beomki-yeo commented Jul 24, 2025

The above benchmark may not be correct due to the bugs reported in #1083
When I eyeballed the profiling results, count_removable_tracks rearrange_tracks and remove_tracks are the main bottleneck.

image

@beomki-yeo beomki-yeo marked this pull request as draft August 23, 2025 06:08
@beomki-yeo beomki-yeo marked this pull request as ready for review August 29, 2025 03:00
@beomki-yeo beomki-yeo force-pushed the add-ambiguity-solver-to-chain branch from 5e61612 to 4d31318 Compare August 29, 2025 23:09
@beomki-yeo beomki-yeo force-pushed the add-ambiguity-solver-to-chain branch from 4d31318 to 0817b70 Compare August 29, 2025 23:32
@sonarqubecloud
Copy link

@beomki-yeo
Copy link
Contributor Author

This is the output of single thread chain:

This PR

Using CUDA device: NVIDIA A30 [id: 0, bus: 71, device: 0]
Warm-up processing [==================================================] 100% [00m:00s]                                                                                                                         
Event processing   [==================================================] 100% [00m:00s]                                                                                                                         
Reconstructed track parameters: 1469663
Time totals:
                  File reading  7548 ms
            Warm-up processing  32495 ms
              Event processing  9614 ms
Throughput:
            Warm-up processing  3249.58 ms/event, 0.307732 events/s
              Event processing  96.1429 ms/event, 10.4012 events/s

Main

Using CUDA device: NVIDIA A30 [id: 0, bus: 71, device: 0]
Warm-up processing [==================================================] 100% [00m:00s]                                                                                                                         
Event processing   [==================================================] 100% [00m:00s]                                                                                                                         
Reconstructed track parameters: 1919511
Time totals:
                  File reading  7586 ms
            Warm-up processing  958 ms
              Event processing  8343 ms
Throughput:
            Warm-up processing  95.8378 ms/event, 10.4343 events/s
              Event processing  83.4309 ms/event, 11.986 events/s

As the ambiguity solver still increases the even processing time 15% but I think this number can be tolerated considering the importance of ambiguity resolver. There is still room to improve in the ambiguity solver as well

@beomki-yeo
Copy link
Contributor Author

@stephenswat If you don't mind, could you also generate the plots for throughput and physical performance?

@stephenswat
Copy link
Member

Running the compute plots for you now. The physics plots are based on the CUDA seeding example, so I cannot make those from this commit unfortunately; the ambiguity resolution would need to be added to the seeding example.

@stephenswat
Copy link
Member

By the way, on the RTX A5000 we have at CERN the ambiguity resolution tests fail:

[----------] 2 tests from CUDALong/CUDAGreedyResolutionCompareToCPU
[ RUN      ] CUDALong/CUDAGreedyResolutionCompareToCPU.Comparison/0
Event: 0 Seed: 42
 Time for the cpu method 951 ms
 Time for the cuda method 53 ms
/mnt/ssd1/sswatman/traccc/tests/cuda/test_ambiguity_resolution.cpp:922: Failure
Expected equality of these values:
  n_tracks_cpu
    Which is: 1
  res_trk_cands_cuda.capacity()
    Which is: 9851

[  FAILED  ] CUDALong/CUDAGreedyResolutionCompareToCPU.Comparison/0, where GetParam() = (3, 10000, { 3, 500 }, 10000, true) (1235 ms)
[ RUN      ] CUDALong/CUDAGreedyResolutionCompareToCPU.Comparison/1
Event: 0 Seed: 42
 Time for the cpu method 976 ms
 Time for the cuda method 53 ms
/mnt/ssd1/sswatman/traccc/tests/cuda/test_ambiguity_resolution.cpp:922: Failure
Expected equality of these values:
  n_tracks_cpu
    Which is: 1
  res_trk_cands_cuda.capacity()
    Which is: 9861

[  FAILED  ] CUDALong/CUDAGreedyResolutionCompareToCPU.Comparison/1, where GetParam() = (3, 10000, { 3, 500 }, 10000, false) (1361 ms)

Is this something you have seen?

@stephenswat
Copy link
Member

Performance summary

Here is a summary of the performance effects of this PR:

Graphical

Tabular

KernelReciprocal ThroughputParallelism
1241c090817b70Delta1241c090817b70
propagate_to_next_surface12.57 ms12.56 ms-0.0%2.672.67
fit_forward5.99 ms4.63 ms-22.6%3.133.79
rearrange_tracks3.02 msnan1.88
fit_backward3.32 ms2.51 ms-24.4%2.352.84
sort_tracks_per_measurement1.55 msnan1.00
find_tracks1.37 ms1.37 ms0.5%1.881.88
ccl_kernel826.82 μs828.99 μs0.3%1.371.37
count_doublets637.37 μs633.11 μs-0.7%1.611.61
count_triplets589.33 μs590.66 μs0.2%1.021.02
find_doublets448.55 μs451.17 μs0.6%3.083.08
Thrust::sort416.69 μs440.80 μs5.8%7.307.94
update_status340.72 μsnan3.83
fill_inverted_ids287.32 μsnan3.81
find_triplets172.38 μs173.25 μs0.5%1.311.31
block_inclusive_scan123.00 μsnan11.41
add_block_offset90.86 μsnan11.45
remove_duplicates56.18 μs56.31 μs0.2%17.3317.34
select_seeds53.31 μs53.18 μs-0.2%1.341.34
remove_tracks45.17 μsnan192.00
unknown20.52 μs40.61 μs97.9%2.252.87
fit_prelude35.75 μs26.25 μs-26.6%9.3811.04
populate_grid23.41 μs23.37 μs-0.1%1.221.22
estimate_track_params22.75 μs22.77 μs0.1%2.152.15
fill_tracks_per_measurement22.59 μsnan6.52
count_grid_capacities22.14 μs22.09 μs-0.3%1.221.22
apply_interaction19.16 μs19.03 μs-0.7%7.347.32
update_triplet_weights15.01 μs15.12 μs0.7%1.271.27
build_tracks12.80 μs12.81 μs0.1%6.346.34
fill_finding_propagation_sort_keys12.65 μs12.73 μs0.6%7.998.00
form_spacepoints12.35 μs12.28 μs-0.5%1.481.49
sort_updated_tracks7.42 μsnan192.00
fill_track_candidates6.50 μsnan7.76
reduce_triplet_counts6.31 μs6.32 μs0.1%3.083.08
fill_finding_duplicate_removal_sort_keys5.35 μs5.36 μs0.3%22.0322.09
fill_vectors5.31 μsnan6.39
count_shared_measurements4.60 μsnan6.42
fill_unique_meas_id_map2.44 μsnan1.55
make_barcode_sequence1.01 μs1.01 μs0.2%3.833.83
scan_block_offsets870.26 nsnan1240.28
DeviceReduceKernel563.17 nsnan16.51
fill_fitting_sort_keys386.81 ns372.41 ns-3.7%9.4811.80
fill_prefix_sum171.92 ns172.02 ns0.1%341.30341.30
DeviceSelectSweepKernel91.95 nsnan67.25
DeviceReduceSingleTileKernel27.08 nsnan256.01
DeviceCompactInitKernel3.25 nsnan768.00
Total26.65 ms30.02 ms12.7%2.713.14

Important

All metrics in this report are given as reciprocal throughput, not as wallclock runtime.

Warning

At least one kernel incurred a significant performance regression.

Note

This is an automated message produced upon the explicit request of a human being.

@beomki-yeo
Copy link
Contributor Author

beomki-yeo commented Sep 8, 2025

Is this something you have seen?

This type of failure in CUDALong is usually from synchronization or uninitialized vector issue. For myself to reproduce the error, Which CUDA and nvidia driver version did you use for the test?

@beomki-yeo
Copy link
Contributor Author

Just as an update, I found that the tests on RTX A6000 are OK with cuda 12.4 and 12.6.

@beomki-yeo
Copy link
Contributor Author

@stephenswat Would you be able to check if your failure is relevant with #1159 ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants