- 
                Notifications
    You must be signed in to change notification settings 
- Fork 928
Request refactoring test
Per the discussion on the 2016-07-26 webex, we decided to test several aspects of request refactoring.
Below is a proposal for running various tests / collecting data to evaluate the performance of OMPI with and without threading, and to evaluate the performance after the request code refactoring. The idea is that several organizations would run these tests and collect the data specified.
We suggest that everyone run these tests with the vader BTL:
- Shared memory is the lowest latency, and should easily show any performance differences / problems
- We can all run with vaderon all of our different platforms
- There is a big difference between various BTLs and MTLs in v1.10, v2.0.x, and v2.1.x -- making it difficult to get apples-to-apples comparisons.
Other networks can be run, but vader can be the baseline.
The goal of this testing is twofold:
- Tests 1 and 2 verify that all the threading work / request revamp has not harmed performance (and if it has, these tests will help us identify performance issues and fix them).
- Test 3 shows that the request revamp work enables a new mode of writing MPI applications (based on good MPI_THREAD_MULTIPLEsupport).
Run the osu_mbw_mr benchmark (using the vader BTL) to measure the effect on single threaded performance from before all the threaded improvements / request refactor (*).
(*) NOTE: Per https://github.com/open-mpi/ompi/issues/1902, we expect there to be some performance degradation.  Once this issue is fixed, there should be no performance degradation.  If there is, we should investigate/fix.
- 1.10.3
- 2.0.0
- Master, commit before request refactoring (need to find a suitable git hash here, so that we all test the same thing)
- Make sure to disable debugging! (this is likely from before we switched master to always build optimized)
 
- Master head (need to agree on a specific git hash to test, so that we all test the same thing)
This test should spawn an even number of processes on a single server (using the vader BTL).  Each thread should do a ping-pong with another thread in the same NUMA domain.
- 16 processes/1 process per core, each process uses MPI_THREAD_SINGLE- Use the stock osu_mbw_mrbenchmark
- This is the baseline performance measurement.
 
- Use the stock 
- 16 processes/1 process per core, each process uses MPI_THREAD_MULTIPLE- Use the stock osu_mbw_mrbenchmark, but setOMPI_MPI_THREAD_LEVELto 3, thereby settingMPI_THREAD_MULTIPLE
- The intent of this test is to measure the performance delta between this test and the baseline. We expect the performance delta to be nonzero (because we are now using locking/atomics -- especially once https://github.com/open-mpi/ompi/issues/1902 is fixed).
 
- Use the stock 
- 1 process/16 threads/1 thread per core (obviously using MPI_THREAD_MULTIPLE).- Use Arm's test for this (which essentially runs osu_mbw_mrin each thread).
- The intent of this test is to measure the performance delta between this test and the baseline. We expect the performance delta to be nonzero (because we are now using locking/atomics -- especially once https://github.com/open-mpi/ompi/issues/1902 is fixed).
 
- Use Arm's test for this (which essentially runs 
If the performance difference between the 2nd and 3rd tests and the baseline is large, we will need to investigate why.
- 2.0.0
- Master, commit before request refactoring (need to find a suitable git hash here, so that we all test the same thing)
- Make sure to disable debugging! (this is likely from before we switched master to always build optimized)
 
- Master head (need to agree on a specific git hash to test, so that we all test the same thing)
The goals of the request refactoring were to:
- Decrease lock contention when multiple threads are blocking in MPI_WAIT*
- Decrease CPU cycle / scheduling contention between threads that are blocking in MPI_WAIT*and threads that are not blocking inMPI_WAIT*
Traditional MPI applications are written/executed as one MPI process (and thread) per CPU core.  The request refactoring intended to reduce lock contention between threads, specifically targeted at enabling new forms of MPI_THREAD_MULTIPLE-enabled programming models and applications.  It is unlikely that the request refactoring will show much improvement in "traditional" MPI applications (i.e., one process/thread per CPU core).
With the old request code, if there are N threads blocking in MPI_WAIT* in a single MPI process (that is bound to N cores), each thread will be vying for the lock to enter the progression loop for a single iteration.  Upon exit from the progress loop, if a thread still has requests to wait for, it will repeat the process again: vie for the lock, enter for a single progression iteration, ...etc.  Meaning: there are N threads all actively contending for a lock, and each of the N threads are continually entering/exiting the progress loop.  There is much overhead in this approach.
With the new request code, the thread that succeeds in entering the progress loop will stay in the progress loop until all of its requests have completed (vs. just performing a single progression iteration). All other threads will remain blocked/asleep. Additionally, the thread in the progress loop will selectively wake individual threads as their requests complete (vs. waking all blocked threads to check and see if their requests have completed). Once the thread in the progress loop completes all of its own requests, it will wake a single thread to take its place inside the progress loop and then exit.
Why is this useful?
The goal is to enable MPI_THREAD_MULTIPLE-enabled applications that are bound to multiple cores (e.g., an entire NUMA domain), and who have more threads than cores.  Consider: if an MPI process is bound to N cores, and M threads are blocking in MPI_WAIT*, only one of those threads will be active inside the progress loop.  This means that there are still (N-1) cores available for other threads (regardless of the value of M!): threads that could be computing, or threads that could be performing non-blocking operations in MPI.
Meaning: the request refactor is intended to enable the "Got a long MPI operation to perform? Spawn a thread and let it block while performing that MPI operation (and then let the thread die)" programming model.
Existing benchmarks will therefore tend to not show any improvement because they do not have code that will execute during MPI_WAIT.  We need to create a new benchmark to show the programming model and performance benefits from this approach.
We need to write a new benchmark that does the following:
- Launches 2 MPI processes on a single server (using the vaderBTL between the two)
- Each process is bound to a NUMA domain
- Each process creates 2*(num_cores in the NUMA domain) threads (i.e., twice the number of cores that are in the NUMA domain)
- Have half the threads continually waiting on non-blocking MPI operations, in two ways (i.e., run these as two separate tests -- not both at the same time):
- Test 1: measure the bandwidth by continually MPI_WAITALLing on a large number of sends/receives of large messages.
- Test 2: measure the message rate by continually MPI_WAITALLing on a large number of sends/receives of small messages.
 
- Test 1: measure the bandwidth by continually 
- Have the other half of the threads perform CPU-based computations (e.g., DGEMM)
The performance of both types of metrics (MPI performance and non-MPI performance) should be greatly improved after the request refactoring.
- 2.0.0
- Master, commit before request refactoring (need to find a suitable git hash here, so that we all test the same thing)
- Make sure to disable debugging! (this is likely from before we switched master to always build optimized)
 
- Master head (need to agree on a specific git hash to test, so that we all test the same thing)