Skip to content

Conversation

@krasznaa
Copy link
Member

Following #1112, here I try to restore the overall throughput of the GPU applications. To jump to the chase already, I see the following throughput with the code as it is just before #1112 would have been merged in:

./build-old/bin/traccc_throughput_mt_cuda --input-directory /data/Acts/odd-simulations-20240509
/geant4_ttbar_mu200/ --input-events=100 --deterministic --cpu-threads=4
...
Warm-up processing [==================================================] 100% [00m:00s]
Event processing   [==================================================] 100% [00m:00s]
05:20:04 PM ThroughputExample             INFO      Reconstructed track parameters: 2727261
05:20:04 PM ThroughputExample             INFO      Time totals:                   File reading  4601 ms
05:20:04 PM ThroughputExample             INFO                  Warm-up processing  987 ms
05:20:04 PM ThroughputExample             INFO                    Event processing  9778 ms
05:20:04 PM ThroughputExample             INFO      Throughput:            Warm-up processing  98.7184 ms/event, 10.1298 events/s
05:20:04 PM ThroughputExample             INFO                    Event processing  97.7832 ms/event, 10.2267 events/s

And with this PR's code I see:

./build-new/bin/traccc_throughput_mt_cuda --input-directory /data/Acts/odd-simulations-20240509
/geant4_ttbar_mu200/ --input-events=100 --deterministic --cpu-threads=4
...
Warm-up processing [==================================================] 100% [00m:00s]
Event processing   [==================================================] 100% [00m:00s]
05:20:33 PM ThroughputExample             INFO      Reconstructed track parameters: 2727308
05:20:33 PM ThroughputExample             INFO      Time totals:                   File reading  4242 ms
05:20:33 PM ThroughputExample             INFO                  Warm-up processing  1013 ms
05:20:33 PM ThroughputExample             INFO                    Event processing  9917 ms
05:20:33 PM ThroughputExample             INFO      Throughput:            Warm-up processing  101.331 ms/event, 9.8686 events/s
05:20:33 PM ThroughputExample             INFO                    Event processing  99.1775 ms/event, 10.0829 events/s

There is unfortunately still a slight drop, which I intend to look a bit more at still, but the code is creating a more representative description of the reconstructed tracks in this new version in host code than was available before #1112. (The tracks to states jagged indices are copied back to the host in the new version, while in the old version all that info was left on the device.)

Finally, about the PR:

  • Simplified the common code of the throughput applications such that they would only use vecmem::host_memory_resource. Leaving anything more specific to the full chain algorithm classes.
  • Modified the full chain algorithms to:
    • Created pinned host memory resources, with their own caching, internally;
    • Made them pass the cached host and device memory resources to all of their sub-algorithms for the intermediate object creation.
    • Made them copy the final objects first into a buffer in cached and pinned host memory, to then copy it with host-to-host transfers into "host containers, that use regular host memory. (This is the part that should be responsible the remaining performance difference.)

But as I started, I'll still look a bit more at this, to see if it could be made yet a little faster / more efficient.

So that it would be left up to the individual full-chain algorithms
to do with their host memory handling as they wished.
@krasznaa krasznaa added improvement Improve an existing feature examples Changes to the examples labels Aug 18, 2025
@krasznaa
Copy link
Member Author

To add, the current main branch (or rather the version that this PR's branch is currently based on, since it became out of date since), produces the following:

./build-current/bin/traccc_throughput_mt_cuda --input-directory /data/Acts/odd-simulations-2024
0509/geant4_ttbar_mu200/ --input-events=100 --deterministic --cpu-threads=4
...
Warm-up processing [==================================================] 100% [00m:00s]
Event processing   [==================================================] 100% [00m:00s]
05:45:13 PM ThroughputExample             INFO      Reconstructed track parameters: 2727265
05:45:13 PM ThroughputExample             INFO      Time totals:                   File reading  4411 ms
05:45:13 PM ThroughputExample             INFO                  Warm-up processing  1866 ms
05:45:13 PM ThroughputExample             INFO                    Event processing  16169 ms
05:45:13 PM ThroughputExample             INFO      Throughput:            Warm-up processing  186.646 ms/event, 5.35775 events/s
05:45:13 PM ThroughputExample             INFO                    Event processing  161.693 ms/event, 6.18455 events/s

As discussed in #1112 earlier. 🤔

Copy link
Member

@stephenswat stephenswat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good!

@stephenswat stephenswat enabled auto-merge (squash) August 18, 2025 17:18
@sonarqubecloud
Copy link

@stephenswat stephenswat merged commit 2490295 into acts-project:main Aug 18, 2025
25 of 29 checks passed
@krasznaa krasznaa deleted the ThroughputUpdates-main-20250818 branch August 19, 2025 07:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

examples Changes to the examples improvement Improve an existing feature

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants