Skip to content

Meeting 15.11: Proper Implementation

Elwin Stephan edited this page Nov 14, 2021 · 8 revisions

Progress

  • Patrick
    • Time runtime of MPI calls
    • Multiple iterations per job
    • Research allgather/allreduce algos used by openMPI
    • Synchronize processes before running using MPI_Barrier
  • Elwin
    • Use different physical nodes
    • Use new repetition format (multiple runs per job)
    • New Plots (Memory usage, compute ratio, variation across repetitions)
    • Verify results are actually correct
    • Use specific algorithm for allreduce
  • Roy
    • allreduce-rabenseifner
    • rabenseifner-gather
    • allreduce-butterfly --> finished it, and generalized to non-powers-of-2 processes
    • started rabenseifner-scatter --> divide final matrix into smaller submatrices (not yet finished)
  • Noe
    • Research allreduce, algos used by openMPI and runtimes as functions of bandwidth, latency and compute-time per byte
    • Research Rabenseifner's Algo
    • Extend our previous plotting-framework, incl. adding scatter-plots to line plots for per runtime plots, using different aggregate methods such as median, 99th-percentie and mean
    • LogP-model of our initial model of butterfly
  • Dave
    • Implement allreduce ring
    • Compare with openMPI allreduce ring

New Algos

  • allgather-async: All processes send to all other processes. While waiting, the outer product of received vectors can already be computed and added to the total
  • allreduce-buttefly: Each process calculates whole matrix, use buttefly to distribute temporary matrices (which are composed by adding own matrix and all received matrices)
  • allreduce-rabenseifner: Each process computes own submatrix, use a butterfly to add all submatrices, use a second butterfly to distribute final submatrices.
  • rabenseifner-gather: distribute vectors using an allgather approach, each process calculates a (final) submatrix, use a butterfly to aggregate final submatrices
  • rabenseifner-scatter: distribute vectors using a butterfly, each process calculates a (final) submatrix, use a butterfly to aggregate all final submatrices

Presentation

Plots can be found here (scroll down).

  • Only a few selected plots are shown - click the arrow to show the remaining variations

  • There are some subfolders containing the same plots but for specific algorithms: native for a comparison of all allreduce algorithms, especially the MPI native ones, and some others for specific algorithms (whose plots in the parent directory are messed up somehow)

  • Elwin: Introduction, high level overview on what we've worked on

    • Separate physical nodes, repeated runs (40x) → show plots
    • What comes next? logp, newly implemented algorithms, fixed MPI native algorithms
  • Noe:

    • Overview on logp model
  • Roy (lead), Patrick, Dave:

    • Overview on new algorithms (whoever implemented it)
    • Show plots, some interpretation
    • Suggestion for further improvements (Roy has an idea)
  • Questions:

    • MPI_Send() asynchronous in some cases (small payloads) - how to deal with measuring?
    • Benchmark on more than 48 nodes

Meeting notes

Clone this wiki locally