-
Notifications
You must be signed in to change notification settings - Fork 1
Description
Optimization Opportunity: Performance Bottleneck in MPI synchronisation Functions
Description
The sync_step function appears to exhibit performance inefficiencies when using MPI with a 3D mesh. The primary issue is the excessive time spent on MPI gather operations, which accounts for a significant portion of the simulation runtime (ranging from 15% to 30%, depending on the simulation setup).
The same behaviour can be noted in sync_prepare_next function with lower importance.
Observed Behavior
- Relative Impact: As the number of particles increases, the impact of this bottleneck becomes less significant. However, the absolute time spent on MPI gather scales linearly with the number of time steps.
- Bottleneck Context: This issue is currently the main bottleneck in simulations using MPI.
Simulation Reference Details
- Model: Monod
- Mesh Size: 432 compartments
- Particles: 500,000
- Time Steps: 30,001
No feeding with 5g/L glucose liquid concentration everywehre.
See below for data
Expected Behavior
Optimization of the sync_step function should reduce the time spent in MPI gather operations, thereby improving overall simulation performance.
Steps to Reproduce
- Run a simulation with the following configuration:
- Model: Monod
- 3D Mesh: 432 compartments
- Particles: 500,000
- Time Steps: 30,001
- Monitor the time spent on MPI gather operations.
- Observe the relative and absolute time spent in
sync_stepand the scaling behavior with the number of time steps.
How to test
If new MPI logic (e.g., Round-Robin receiving) is implemented, a unit test must be set up.
To validate the main logic, perform the same simulation with the shared implementation. The results should be identical.
Suggested Improvements
Investigate and implement possible optimizations for the sync_step and sync_prepare_next
function, specifically targeting the MPI gather operations.
With the current (v0.2) implementation, bottleneck is not located in the Simulation library but in the Core.
Additional Information
This bottleneck is critical for simulations involving large-scale particle systems and long time series, as it directly impacts scalability and efficiency.
Data
Observation for v0.2.1
type:(Total Time, Call Count, Avg. Time per Call, %Total Time in Kernels, %Total Program Time)
Regions:
- cycleProcess
(REGION) 334.920459 30001 0.011164 100.238762 67.777228 - sync_step
(REGION) 65.529022 30001 0.002184 19.612263 13.260986 - host:sync_update
(REGION) 57.156265 30001 0.001905 17.106370 11.566607 - performStep
(REGION) 49.486652 60002 0.000825 14.810922 10.014521 - host:handle_export
(REGION) 2.475897 30001 0.000083 0.741014 0.501043 - get_particle_properties_opti
(REGION) 1.379025 50 0.027580 0.412730 0.279071 - write_particle_data
(REGION) 1.079375 50 0.021587 0.323047 0.218431 - sync_prepare_next
(REGION) 0.871335 30001 0.000029 0.260783 0.176330 - host:update_flow::advance
(REGION) 0.374768 30001 0.000012 0.112165 0.075841 - host:reduceContribs
(REGION) 0.068611 30001 0.000002 0.020535 0.013885
Kernels:
- mc_cycle_process
(ParFor) 331.776612 30001 0.011059 99.297836 67.141013 - get_particle_properties
(ParFor) 1.354280 50 0.027086 0.405324 0.274063
Same with 10 times time step
Regions:
- cycleProcess
(REGION) 2956.599333 300001 0.009855 100.553325 73.198290 - host:sync_update
(REGION) 540.623581 300001 0.001802 18.386495 13.384540 - performStep
(REGION) 469.075315 600002 0.000782 15.953153 11.613177 - sync_step
(REGION) 259.134276 300001 0.000864 8.813103 6.415542 - sync_prepare_next
(REGION) 6.804463 300001 0.000023 0.231418 0.168462 - host:update_flow::advance
(REGION) 3.197432 300001 0.000011 0.108744 0.079161 - host:handle_export
(REGION) 2.373725 300001 0.000008 0.080730 0.058768 - get_particle_properties_opti
(REGION) 1.271187 50 0.025424 0.043233 0.031472 - write_particle_data
(REGION) 1.040060 50 0.020801 0.035372 0.025749 - host:reduceContribs
(REGION) 0.570312 300001 0.000002 0.019396 0.014120
Kernels:
- mc_cycle_process
(ParFor) 2931.099973 300001 0.009770 99.686097 72.566987 - Kokkos::ScatterView::ReduceDuplicates [duplicated_]
(ParFor) 1.791561 300001 0.000006 0.060931 0.044355 - get_particle_properties
(ParFor) 1.247184 50 0.024944 0.042416 0.030877 - Kokkos::ScatterView::ResetDuplicates [duplicated_]
(ParFor) 1.178054 300001 0.000004 0.040065 0.029166 - InsertNew
(ParFor) 1.120450 600002 0.000002 0.038106 0.027740