Skip to content

Conversation

@efaulhaber
Copy link
Member

@efaulhaber efaulhaber commented Oct 30, 2025

This PR implements many performance optimizations for TLSPH plus a few missing @inbounds for WCSPH.

Here is a benchmark with 1M particles in a perfect grid (representing a simulation with an unpacked RectangularShape) on an Nvidia RTX A4500 (basically a 3080/3070 Ti with double the VRAM). Note that I used a smoothing lenght factor of 2 instead of a realistic factor of sqrt(2). This is just the tlsph-tlsph interaction kernel with penalty force and viscosity.
All benchmarks use the PrecomputedNeighborhoodSearch (made GPU-compatible in trixi-framework/PointNeighbors.jl#10). For the FP64 coordinates, I used #956 and trixi-framework/PointNeighbors.jl#132.

Extra optimizations that are not cleanly implemented yet:

  • "transposed NL": transposing the neighbor lists in the PrecomputedNeighborhoodSearch to a SoA layout (first all first neighbors, then all second neighbors, etc.)
  • "sorted NL": sorting the neighbors of each particle in the neighbor lists by particle index.
  • "transposed arrays": transposing all relevant arrays (dv, current and initial coordinates, PK1).
  • "manual foreach_neighbor: instead of using foreach_point_neighbor, we can write @threaded for particle ... foreach_neighbor(...). This allows us to pull all reads for particle a (m_a, rho_a, etc.) out of the neighbor loop, and we can accumulate dv_particle over all neighbors and only write once per particle to dv.

The optimizations "transposed NL" and "sorted NL" together allows the GPU to coalesce the load of each n-th neighbor of particles in a warp.

times in ms interact! 2D FP32 interact! 3D FP32 interact! 2D FP64 coords interact! 3D FP64 coords Notes
main 6.634 129.498 7.590 146.006
This PR + transposed NL + sorted NL 1.888 25.257 2.701 28.276 This is the target that I would like to implement. These optimizations provide the most benefit at the lowest cost.
This PR + transposed NL 4.147 63.825 4.749 78.639 To demonstrate that "sorted NL" is necessary
This PR + sorted NL 6.162 73.511 8.483 79.475 To demonstrate that "transposed NL" is necessary
This PR + transposed NL + sorted NL + transposed arrays 1.896 24.411 2.512 30.630 "transposed arrays" greatly reduces the "uncoalesced access" warnings in Nsight Compute, but does not provide a performance benefit
This PR + transposed NL + sorted NL + manual foreach_neighbor 1.669 14.563 2.221 23.914 "manual foreach_neighbor" provides a relatively small speedup, but significantly changes the current codebase.
This PR new + transposed NL + sorted NL 1.520 14.724 After running all the benchmarks above, I optimized the penalty force and viscosity on top of the second row above

@efaulhaber efaulhaber requested a review from Copilot October 31, 2025 17:26
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR optimizes the Total Lagrangian SPH (TLSPH) implementation by precomputing PK1 / rho^2 to avoid repeated divisions in the interaction loop. The field pk1_corrected is renamed to pk1_rho2 to reflect this change, and several performance improvements are made throughout the codebase.

  • Renamed pk1_corrected to pk1_rho2 to store precomputed PK1 / rho^2 values
  • Changed @inline to @propagate_inbounds for bounds checking optimization in accessor functions
  • Replaced norm() with sqrt(dot()) and precomputed reciprocals to avoid repeated divisions

Reviewed Changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
src/schemes/structure/total_lagrangian_sph/system.jl Renamed field from pk1_corrected to pk1_rho2, updated accessor functions to use @propagate_inbounds, and modified stress computation to reconstruct PK1 from stored precomputed values
src/schemes/structure/total_lagrangian_sph/rhs.jl Updated interaction loop to use precomputed pk1_rho2 values, replaced norm() with sqrt(dot()), and added @inbounds annotations
src/schemes/structure/total_lagrangian_sph/penalty_force.jl Changed @inline to @propagate_inbounds and precomputed inv_current_distance to avoid repeated divisions
src/schemes/fluid/weakly_compressible_sph/system.jl Changed @inline to @propagate_inbounds for velocity and density accessor functions
src/general/abstract_system.jl Updated extract_svector and extract_smatrix to use variadic arguments and added bounds checking
test/systems/tlsph_system.jl Updated test to initialize pk1_rho2 instead of pk1_corrected with correct precomputed values
test/schemes/structure/total_lagrangian_sph/rhs.jl Updated mock system to use pk1_rho2 with precomputed values matching the new implementation

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@efaulhaber efaulhaber marked this pull request as ready for review October 31, 2025 17:31
@efaulhaber
Copy link
Member Author

/run-gpu-tests

@efaulhaber efaulhaber requested review from LasNikas and svchb October 31, 2025 17:31
svchb
svchb previously approved these changes Nov 1, 2025
@efaulhaber efaulhaber requested a review from LasNikas November 3, 2025 10:14
@LasNikas LasNikas enabled auto-merge (squash) November 3, 2025 10:28
@efaulhaber
Copy link
Member Author

/run-gpu-tests

@LasNikas LasNikas merged commit dff4fa6 into trixi-framework:main Nov 3, 2025
16 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants