- 
                Notifications
    You must be signed in to change notification settings 
- Fork 928
Request refactoring test
Per the discussion on the 2016-07-26 webex, we decided to test several aspects of request refactoring.
Below is a proposal for running various tests / collecting data to evaluate the performance of OMPI with and without threading, and to evaluate the performance after the request code refactoring. The idea is that several organizations would run these tests and collect the data specified.
We suggest that everyone run these tests with the vader BTL:
- Shared memory is the lowest latency, and should easily show any performance differences / problems
- We can all run with vaderon all of our different platforms
- There is a big difference between various BTLs and MTLs in v1.10, v2.0.x, and v2.1.x -- making it difficult to get apples-to-apples comparisons.
Other networks can be run, but vader can be the baseline.
Run the osu_mbw_mr benchmark (using the vader BTL) to measure the effect on single threaded performance from before all the threaded improvements / request refactor (*).
(*) NOTE: Per https://github.com/open-mpi/ompi/issues/1902, we expect there to be some performance degradation.  Once this issue is fixed, there should be no performance degradation.  If there is, we should investigate/fix.
- 1.10.3
- 2.0.0
- Master, commit before request refactoring (need to find a suitable git hash here, so that we all test the same thing)
- Make sure to disable debugging! (this is likely from before we switched master to always build optimized)
 
- Master head (need to agree on a specific git hash to test, so that we all test the same thing)
This test should spawn an even number of processes on a single server (using the vader BTL).  Each thread should do a ping-pong with another thread in the same NUMA domain.
- 16 processes/1 process per core, each process uses MPI_THREAD_SINGLE- Use the stock osu_mbw_mrbenchmark
- This is the baseline performance measurement.
 
- Use the stock 
- 16 processes/1 process per core, each process uses MPI_THREAD_MULTIPLE- Use the stock osu_mbw_mrbenchmark, but setOMPI_MPI_THREAD_LEVELto 3, thereby settingMPI_THREAD_MULTIPLE
- The intent of this test is to measure the performance delta between this test and the baseline. We expect the performance delta to be nonzero (because we are now using locking/atomics -- especially once https://github.com/open-mpi/ompi/issues/1902 is fixed).
 
- Use the stock 
- 1 process/16 threads/1 thread per core (obviously using MPI_THREAD_MULTIPLE).- Use Arm's test for this (which essentially runs osu_mbw_mrin each thread).
- The intent of this test is to measure the performance delta between this test and the baseline. We expect the performance delta to be nonzero (because we are now using locking/atomics -- especially once https://github.com/open-mpi/ompi/issues/1902 is fixed).
 
- Use Arm's test for this (which essentially runs 
If the performance difference between the 2nd and 3rd tests and the baseline is large, we will need to investigate why.
- 2.0.0
- Master, commit before request refactoring (need to find a suitable git hash here, so that we all test the same thing)
- Make sure to disable debugging! (this is likely from before we switched master to always build optimized)
 
- Master head (need to agree on a specific git hash to test, so that we all test the same thing)
The goals of the request refactoring were to:
- Decrease lock contention when multiple threads are blocking in MPI_WAIT*
- Decrease CPU cycle / scheduling contention between threads that are blocking in MPI_WAIT*and threads that are not blocking inMPI_WAIT*
The traditional way of writing THREAD_MULTIPLE program by binding 1 thread/core will not show the performance improvement from the new request. Here is the example.
We have 16 threads, bind 1 thread per core and every core reached MPI_Wait*. Every core will be actively trying to get the lock. The winner will take the lock, run 1 opal_progress and release the lock and so on. In this case, every single core is wasting their CPU time trying to take the lock and there will be 1 lock/unlock per opal_progress() called.
With the new request, The winner will take care of opal_progress() until his request is fulfilled. The rest will be passively waited in pthread_cond_wait() and won't be consuming much CPU time. That grants the CPU time to do something else.
Request refactoring will allow user to use their CPU time more efficiently, i.e, user computation thread while waiting for MPI communication to be completed.
Normal benchmark will not show any improvement because there is nothing to exploit the extra CPU time we get back from the new request. So we have to add something to address that.
Each process spawns 2 * (numcore/numa) threads (bind 2 threads/core) Each core will have 2 type of thread bond to
- Type A, doing some calculation, measuring FLOPS, runtime.
- Type B, doing MPI communication with other process. (MPI_Waitall with x requests). Measuring msg rate, bw.
GOAL : We should see the better performance on the new request.
- Master, commit before request refactoring.
- Master head
- 2.0.0