- 
                Notifications
    
You must be signed in to change notification settings  - Fork 15
 
Improve TLSPH performance on GPUs #968
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve TLSPH performance on GPUs #968
Conversation
45c4e03    to
    a510ff6      
    Compare
  
    There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR optimizes the Total Lagrangian SPH (TLSPH) implementation by precomputing PK1 / rho^2 to avoid repeated divisions in the interaction loop. The field pk1_corrected is renamed to pk1_rho2 to reflect this change, and several performance improvements are made throughout the codebase.
- Renamed 
pk1_correctedtopk1_rho2to store precomputedPK1 / rho^2values - Changed 
@inlineto@propagate_inboundsfor bounds checking optimization in accessor functions - Replaced 
norm()withsqrt(dot())and precomputed reciprocals to avoid repeated divisions 
Reviewed Changes
Copilot reviewed 7 out of 7 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description | 
|---|---|
| src/schemes/structure/total_lagrangian_sph/system.jl | Renamed field from pk1_corrected to pk1_rho2, updated accessor functions to use @propagate_inbounds, and modified stress computation to reconstruct PK1 from stored precomputed values | 
| src/schemes/structure/total_lagrangian_sph/rhs.jl | Updated interaction loop to use precomputed pk1_rho2 values, replaced norm() with sqrt(dot()), and added @inbounds annotations | 
| src/schemes/structure/total_lagrangian_sph/penalty_force.jl | Changed @inline to @propagate_inbounds and precomputed inv_current_distance to avoid repeated divisions | 
| src/schemes/fluid/weakly_compressible_sph/system.jl | Changed @inline to @propagate_inbounds for velocity and density accessor functions | 
| src/general/abstract_system.jl | Updated extract_svector and extract_smatrix to use variadic arguments and added bounds checking | 
| test/systems/tlsph_system.jl | Updated test to initialize pk1_rho2 instead of pk1_corrected with correct precomputed values | 
| test/schemes/structure/total_lagrangian_sph/rhs.jl | Updated mock system to use pk1_rho2 with precomputed values matching the new implementation | 
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| 
           /run-gpu-tests  | 
    
| 
           /run-gpu-tests  | 
    
This PR implements many performance optimizations for TLSPH plus a few missing
@inboundsfor WCSPH.Here is a benchmark with 1M particles in a perfect grid (representing a simulation with an unpacked
RectangularShape) on an Nvidia RTX A4500 (basically a 3080/3070 Ti with double the VRAM). Note that I used a smoothing lenght factor of 2 instead of a realistic factor ofsqrt(2). This is just the tlsph-tlsph interaction kernel with penalty force and viscosity.All benchmarks use the
PrecomputedNeighborhoodSearch(made GPU-compatible in trixi-framework/PointNeighbors.jl#10). For the FP64 coordinates, I used #956 and trixi-framework/PointNeighbors.jl#132.Extra optimizations that are not cleanly implemented yet:
PrecomputedNeighborhoodSearchto a SoA layout (first all first neighbors, then all second neighbors, etc.)dv, current and initial coordinates, PK1).foreach_neighbor: instead of usingforeach_point_neighbor, we can write@threaded for particle ... foreach_neighbor(...). This allows us to pull all reads for particlea(m_a,rho_a, etc.) out of the neighbor loop, and we can accumulatedv_particleover all neighbors and only write once per particle todv.The optimizations "transposed NL" and "sorted NL" together allows the GPU to coalesce the load of each n-th neighbor of particles in a warp.
interact!2D FP32interact!3D FP32interact!2D FP64 coordsinteract!3D FP64 coordsmainforeach_neighborforeach_neighbor" provides a relatively small speedup, but significantly changes the current codebase.