Meeting 15.11: Proper Implementation

Progress

Patrick
- Time runtime of MPI calls
- Multiple iterations per job
- Research allgather/allreduce algos used by openMPI
- Synchronize processes before running using MPI_Barrier
Elwin
- Use different physical nodes
- Use new repetition format (multiple runs per job)
- New Plots (Memory usage, compute ratio, variation across repetitions)
- Verify results are actually correct
- Use specific algorithm for allreduce
Roy
- allreduce-rabenseifner
- rabenseifner-gather
- allreduce-butterfly --> finished it, and generalized to non-powers-of-2 processes
- started rabenseifner-scatter --> divide final matrix into smaller submatrices (not yet finished)
Noe
- Research allreduce, algos used by openMPI and runtimes as functions of bandwidth, latency and compute-time per byte
- Research Rabenseifner's Algo
- Extend our previous plotting-framework, incl. adding scatter-plots to line plots for per runtime plots, using different aggregate methods such as median, 99th-percentie and mean
- LogP-model of our initial model of butterfly
Dave
- Implement allreduce ring
- Compare with openMPI allreduce ring

allgather-async: All processes send to all other processes. While waiting, the outer product of received vectors can already be computed and added to the total
allreduce-buttefly: Each process calculates whole matrix, use buttefly to distribute temporary matrices (which are composed by adding own matrix and all received matrices)
allreduce-rabenseifner: Each process computes own submatrix, use a butterfly to add all submatrices, use a second butterfly to distribute final submatrices.
rabenseifner-gather: distribute vectors using an allgather approach, each process calculates a (final) submatrix, use a butterfly to aggregate final submatrices
rabenseifner-scatter: distribute vectors using a butterfly, each process calculates a (final) submatrix, use a butterfly to aggregate all final submatrices

Plots can be found here (scroll down).

Only a few selected plots are shown - click the arrow to show the remaining variations
There are some subfolders containing the same plots but for specific algorithms: native for a comparison of all allreduce algorithms, especially the MPI native ones, and some others for specific algorithms (whose plots in the parent directory are messed up somehow)
Elwin: Introduction, high level overview on what we've worked on
- Separate physical nodes, repeated runs (40x) → show plots
- What comes next? logp, newly implemented algorithms, fixed MPI native algorithms
Noe:
- Overview on logp model
Roy (lead), Patrick, Dave:
- Overview on new algorithms (whoever implemented it)
- Show plots, some interpretation
- Suggestion for further improvements (Roy has an idea)
Questions:
- MPI_Send() asynchronous in some cases (small payloads) - how to deal with measuring?
- Benchmark on more than 48 nodes