-
Notifications
You must be signed in to change notification settings - Fork 2
Meeting 15.11: Proper Implementation
Elwin Stephan edited this page Nov 14, 2021
·
8 revisions
-
Patrick
- Time runtime of MPI calls
- Multiple iterations per job
- Research allgather/allreduce algos used by openMPI
- Synchronize processes before running using
MPI_Barrier
-
Elwin
- Use different physical nodes
- Use new repetition format (multiple runs per job)
- New Plots (Memory usage, compute ratio, variation across repetitions)
- Verify results are actually correct
- Use specific algorithm for
allreduce
-
Roy
- allreduce-rabenseifner
- rabenseifner-gather
- allreduce-butterfly --> finished it, and generalized to non-powers-of-2 processes
- started rabenseifner-scatter --> divide final matrix into smaller submatrices (not yet finished)
-
Noe
- Research allreduce, algos used by openMPI and runtimes as functions of bandwidth, latency and compute-time per byte
- Research Rabenseifner's Algo
- Extend our previous plotting-framework, incl. adding scatter-plots to line plots for per runtime plots, using different aggregate methods such as median, 99th-percentie and mean
- LogP-model of our initial model of butterfly
-
Dave
- Implement allreduce ring
- Compare with openMPI allreduce ring
-
allgather-async: All processes send to all other processes. While waiting, the outer product of received vectors can already be computed and added to the total -
allreduce-buttefly: Each process calculates whole matrix, use buttefly to distribute temporary matrices (which are composed by adding own matrix and all received matrices) -
allreduce-rabenseifner: Each process computes own submatrix, use a butterfly to add all submatrices, use a second butterfly to distribute final submatrices. -
rabenseifner-gather: distribute vectors using an allgather approach, each process calculates a (final) submatrix, use a butterfly to aggregate final submatrices -
rabenseifner-scatter: distribute vectors using a butterfly, each process calculates a (final) submatrix, use a butterfly to aggregate all final submatrices
Plots can be found here (scroll down).
-
Only a few selected plots are shown - click the arrow to show the remaining variations
-
There are some subfolders containing the same plots but for specific algorithms:
nativefor a comparison of all allreduce algorithms, especially the MPI native ones, and some others for specific algorithms (whose plots in the parent directory are messed up somehow) -
Elwin: Introduction, high level overview on what we've worked on
- Separate physical nodes, repeated runs (40x) → show plots
- What comes next? logp, newly implemented algorithms, fixed MPI native algorithms
-
Noe:
- Overview on logp model
-
Roy (lead), Patrick, Dave:
- Overview on new algorithms (whoever implemented it)
- Show plots, some interpretation
- Suggestion for further improvements (Roy has an idea)
-
Questions:
-
MPI_Send()asynchronous in some cases (small payloads) - how to deal with measuring? - Benchmark on more than 48 nodes
-