Improve TLSPH performance on GPUs #968

efaulhaber · 2025-10-30T09:52:25Z

This PR implements many performance optimizations for TLSPH plus a few missing @inbounds for WCSPH.

Here is a benchmark with 1M particles in a perfect grid (representing a simulation with an unpacked RectangularShape) on an Nvidia RTX A4500 (basically a 3080/3070 Ti with double the VRAM). Note that I used a smoothing lenght factor of 2 instead of a realistic factor of sqrt(2). This is just the tlsph-tlsph interaction kernel with penalty force and viscosity.
All benchmarks use the PrecomputedNeighborhoodSearch (made GPU-compatible in trixi-framework/PointNeighbors.jl#10). For the FP64 coordinates, I used #956 and trixi-framework/PointNeighbors.jl#132.

Extra optimizations that are not cleanly implemented yet:

"transposed NL": transposing the neighbor lists in the PrecomputedNeighborhoodSearch to a SoA layout (first all first neighbors, then all second neighbors, etc.)
"sorted NL": sorting the neighbors of each particle in the neighbor lists by particle index.
"transposed arrays": transposing all relevant arrays (dv, current and initial coordinates, PK1).
"manual foreach_neighbor: instead of using foreach_point_neighbor, we can write @threaded for particle ... foreach_neighbor(...). This allows us to pull all reads for particle a (m_a, rho_a, etc.) out of the neighbor loop, and we can accumulate dv_particle over all neighbors and only write once per particle to dv.

The optimizations "transposed NL" and "sorted NL" together allows the GPU to coalesce the load of each n-th neighbor of particles in a warp.

times in ms	`interact!` 2D FP32	`interact!` 3D FP32	`interact!` 2D FP64 coords	`interact!` 3D FP64 coords	Notes
`main`	6.634	129.498	7.590	146.006
This PR + transposed NL + sorted NL	1.888	25.257	2.701	28.276	This is the target that I would like to implement. These optimizations provide the most benefit at the lowest cost.
This PR + transposed NL	4.147	63.825	4.749	78.639	To demonstrate that "sorted NL" is necessary
This PR + sorted NL	6.162	73.511	8.483	79.475	To demonstrate that "transposed NL" is necessary
This PR + transposed NL + sorted NL + transposed arrays	1.896	24.411	2.512	30.630	"transposed arrays" greatly reduces the "uncoalesced access" warnings in Nsight Compute, but does not provide a performance benefit
This PR + transposed NL + sorted NL + manual `foreach_neighbor`	1.669	14.563	2.221	23.914	"manual `foreach_neighbor`" provides a relatively small speedup, but significantly changes the current codebase.
This PR new + transposed NL + sorted NL	1.520	14.724			After running all the benchmarks above, I optimized the penalty force and viscosity on top of the second row above

Copilot

Pull Request Overview

This PR optimizes the Total Lagrangian SPH (TLSPH) implementation by precomputing PK1 / rho^2 to avoid repeated divisions in the interaction loop. The field pk1_corrected is renamed to pk1_rho2 to reflect this change, and several performance improvements are made throughout the codebase.

Renamed pk1_corrected to pk1_rho2 to store precomputed PK1 / rho^2 values
Changed @inline to @propagate_inbounds for bounds checking optimization in accessor functions
Replaced norm() with sqrt(dot()) and precomputed reciprocals to avoid repeated divisions

Reviewed Changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 4 comments.

Show a summary per file

File	Description
src/schemes/structure/total_lagrangian_sph/system.jl	Renamed field from `pk1_corrected` to `pk1_rho2`, updated accessor functions to use `@propagate_inbounds`, and modified stress computation to reconstruct PK1 from stored precomputed values
src/schemes/structure/total_lagrangian_sph/rhs.jl	Updated interaction loop to use precomputed `pk1_rho2` values, replaced `norm()` with `sqrt(dot())`, and added `@inbounds` annotations
src/schemes/structure/total_lagrangian_sph/penalty_force.jl	Changed `@inline` to `@propagate_inbounds` and precomputed `inv_current_distance` to avoid repeated divisions
src/schemes/fluid/weakly_compressible_sph/system.jl	Changed `@inline` to `@propagate_inbounds` for velocity and density accessor functions
src/general/abstract_system.jl	Updated `extract_svector` and `extract_smatrix` to use variadic arguments and added bounds checking
test/systems/tlsph_system.jl	Updated test to initialize `pk1_rho2` instead of `pk1_corrected` with correct precomputed values
test/schemes/structure/total_lagrangian_sph/rhs.jl	Updated mock system to use `pk1_rho2` with precomputed values matching the new implementation

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

src/general/abstract_system.jl

src/schemes/structure/total_lagrangian_sph/system.jl

efaulhaber · 2025-10-31T17:31:15Z

/run-gpu-tests

src/general/abstract_system.jl

efaulhaber · 2025-11-03T10:30:34Z

/run-gpu-tests

Improve TLSPH performance on GPUs

a510ff6

efaulhaber force-pushed the tlsph-performance-gpu branch from 45c4e03 to a510ff6 Compare October 30, 2025 10:27

efaulhaber self-assigned this Oct 31, 2025

efaulhaber added performance gpu labels Oct 31, 2025

Merge branch 'main' into tlsph-performance-gpu

953f42d

efaulhaber requested a review from Copilot October 31, 2025 17:26

Copilot AI reviewed Oct 31, 2025

View reviewed changes

efaulhaber marked this pull request as ready for review October 31, 2025 17:31

efaulhaber requested review from LasNikas and svchb October 31, 2025 17:31

LasNikas reviewed Nov 1, 2025

View reviewed changes

src/general/abstract_system.jl Outdated Show resolved Hide resolved

svchb previously approved these changes Nov 1, 2025

View reviewed changes

Undo varargs change

dd28efc

efaulhaber dismissed svchb’s stale review via dd28efc November 3, 2025 10:13

Merge branch 'main' into tlsph-performance-gpu

cc4deaf

efaulhaber requested a review from LasNikas November 3, 2025 10:14

LasNikas approved these changes Nov 3, 2025

View reviewed changes

LasNikas enabled auto-merge (squash) November 3, 2025 10:28

LasNikas merged commit dff4fa6 into trixi-framework:main Nov 3, 2025
16 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Improve TLSPH performance on GPUs #968

Improve TLSPH performance on GPUs #968

Uh oh!

efaulhaber commented Oct 30, 2025 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

efaulhaber commented Oct 31, 2025

Uh oh!

Uh oh!

efaulhaber commented Nov 3, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Improve TLSPH performance on GPUs #968

Improve TLSPH performance on GPUs #968

Uh oh!

Conversation

efaulhaber commented Oct 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

efaulhaber commented Oct 31, 2025

Uh oh!

Uh oh!

efaulhaber commented Nov 3, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

efaulhaber commented Oct 30, 2025 •

edited

Loading