Optimization: Performance Bottleneck in MPI synchronisation Functions

# Optimization Opportunity: Performance Bottleneck in MPI synchronisation Functions

## Description
The `sync_step` function appears to exhibit performance inefficiencies when using MPI with a 3D mesh. The primary issue is the excessive time spent on MPI gather operations, which accounts for a significant portion of the simulation runtime (ranging from 15% to 30%, depending on the simulation setup).

The same behaviour can be noted in `sync_prepare_next` function with lower importance. 

### Observed Behavior
- **Relative Impact**: As the number of particles increases, the impact of this bottleneck becomes less significant. However, the absolute time spent on MPI gather scales linearly with the number of time steps.
- **Bottleneck Context**: This issue is currently the main bottleneck in simulations using MPI.

### Simulation Reference Details
- **Model**: Monod
- **Mesh Size**: 432 compartments
- **Particles**: 500,000
- **Time Steps**: 30,001

No feeding with 5g/L glucose liquid concentration everywehre. 

*See below for data*

## Expected Behavior
Optimization of the `sync_step` function should reduce the time spent in MPI gather operations, thereby improving overall simulation performance.

## Steps to Reproduce
1. Run a simulation with the following configuration:
   - Model: Monod
   - 3D Mesh: 432 compartments
   - Particles: 500,000
   - Time Steps: 30,001
2. Monitor the time spent on MPI gather operations.
3. Observe the relative and absolute time spent in `sync_step` and the scaling behavior with the number of time steps.

### How to test
If new MPI logic (e.g., Round-Robin receiving) is implemented, a unit test must be set up.
To validate the main logic, perform the same simulation with the shared implementation. The results should be identical.


## Suggested Improvements
Investigate and implement possible optimizations for the `sync_step`  and `sync_prepare_next`
function, specifically targeting the MPI gather operations.
With the current (v0.2) implementation, bottleneck is not located in the Simulation library but in the Core.

## Additional Information
This bottleneck is critical for simulations involving large-scale particle systems and long time series, as it directly impacts scalability and efficiency.

## Data 

Observation for v0.2.1


*type:(Total Time, Call Count, Avg. Time per Call, %Total Time in Kernels, %Total Program Time)*

Regions: 

- cycleProcess
 (REGION)   334.920459 30001 0.011164 100.238762 67.777228
- sync_step
 (REGION)   65.529022 30001 0.002184 19.612263 13.260986
- host:sync_update
 (REGION)   57.156265 30001 0.001905 17.106370 11.566607
- performStep
 (REGION)   49.486652 60002 0.000825 14.810922 10.014521
- host:handle_export
 (REGION)   2.475897 30001 0.000083 0.741014 0.501043
- get_particle_properties_opti
 (REGION)   1.379025 50 0.027580 0.412730 0.279071
- write_particle_data
 (REGION)   1.079375 50 0.021587 0.323047 0.218431
- sync_prepare_next
 (REGION)   0.871335 30001 0.000029 0.260783 0.176330
- host:update_flow::advance
 (REGION)   0.374768 30001 0.000012 0.112165 0.075841
- host:reduceContribs
 (REGION)   0.068611 30001 0.000002 0.020535 0.013885

-------------------------------------------------------------------------
Kernels: 

- mc_cycle_process
 (ParFor)   331.776612 30001 0.011059 99.297836 67.141013
- get_particle_properties
 (ParFor)   1.354280 50 0.027086 0.405324 0.274063


### Same with 10 times time step

Regions: 

- cycleProcess
 (REGION)   2956.599333 300001 0.009855 100.553325 73.198290
- host:sync_update
 (REGION)   540.623581 300001 0.001802 18.386495 13.384540
- performStep
 (REGION)   469.075315 600002 0.000782 15.953153 11.613177
- sync_step
 (REGION)   259.134276 300001 0.000864 8.813103 6.415542
- sync_prepare_next
 (REGION)   6.804463 300001 0.000023 0.231418 0.168462
- host:update_flow::advance
 (REGION)   3.197432 300001 0.000011 0.108744 0.079161
- host:handle_export
 (REGION)   2.373725 300001 0.000008 0.080730 0.058768
- get_particle_properties_opti
 (REGION)   1.271187 50 0.025424 0.043233 0.031472
- write_particle_data
 (REGION)   1.040060 50 0.020801 0.035372 0.025749
- host:reduceContribs
 (REGION)   0.570312 300001 0.000002 0.019396 0.014120

-------------------------------------------------------------------------
Kernels: 

- mc_cycle_process
 (ParFor)   2931.099973 300001 0.009770 99.686097 72.566987
- Kokkos::ScatterView::ReduceDuplicates [duplicated_]
 (ParFor)   1.791561 300001 0.000006 0.060931 0.044355
- get_particle_properties
 (ParFor)   1.247184 50 0.024944 0.042416 0.030877
- Kokkos::ScatterView::ResetDuplicates [duplicated_]
 (ParFor)   1.178054 300001 0.000004 0.040065 0.029166
- InsertNew
 (ParFor)   1.120450 600002 0.000002 0.038106 0.027740


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimization: Performance Bottleneck in MPI synchronisation Functions #8

Optimization Opportunity: Performance Bottleneck in MPI synchronisation Functions

Description

Observed Behavior

Simulation Reference Details

Expected Behavior

Steps to Reproduce

How to test

Suggested Improvements

Additional Information

Data

Same with 10 times time step

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Optimization: Performance Bottleneck in MPI synchronisation Functions #8

Description

Optimization Opportunity: Performance Bottleneck in MPI synchronisation Functions

Description

Observed Behavior

Simulation Reference Details

Expected Behavior

Steps to Reproduce

How to test

Suggested Improvements

Additional Information

Data

Same with 10 times time step

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions