SC15_APEX_tutorial

APEX examples for SC15 HPX tutorial

This collection of files contains exercises for the SC15 tutorial: Massively Parallel Task-Based Programming with HPX http://sc15.supercomputing.org/schedule/event_detail?evid=tut139

Any questions or comments, please contact [email protected].

BEFORE YOU START:

To run the exercises in this tutorial, make sure you have set up your environment on edison.nersc.gov or babbage.nersc.gov to build HPX examples:

# on Babbage ONLY, start a bash shell and load the module environment
bash
source /usr/share/Modules/init/bash

# source the environment
source /project/projectdirs/training/SC15/HPX-SC15/hpx_install/env.sh

should give this output (or something similar):

Loading environment for Babbage

Newly available modules:

---- /chos/global/project/projectdirs/training/SC15/HPX-SC15/hpx_install/../tau2-hpx-babbage/modulefiles -----
tau/host-2.25 tau/mic-2.25

-------------- /chos/global/project/projectdirs/training/SC15/HPX-SC15/hpx_install/modulefiles ---------------
hpx/0.9.11-debug        hpx/host-0.9.11-debug   hpx/mic-0.9.11-debug
hpx/0.9.11-release      hpx/host-0.9.11-release hpx/mic-0.9.11-release

Then load the appropriate module (for the first example, load the Babbage host module):

# to build examples to run on the Babbage host nodes
module load hpx/host-0.9.11-release
# to build examples to run on the Babbage MIC devices
module load hpx/mic-0.9.11-release
# to build examples to run on Edison
module load hpx/0.9.11-release

Download and expand these examples:

cd $HOME
git clone https://github.com/khuck/SC15_APEX_tutorial.git
cd SC15_APEX_tutorial

Building the exercises:

Compiling the exercises for the Babbage MIC devices:

To build the exercises for the Babbage MIC devices, load the appropriate module and then run the configuration script:

# if no HPX module loaded:
module load hpx/mic-0.9.11-release
# if HPX module already loaded:
module swap hpx hpx/mic-0.9.11-release
# run cmake and make
./scripts/configure-mic.sh

The build will be configured and compiled in the build-mic directory. After compiling, you should have the executables built in the directory build-mic/apex_examples. The output should look something like this:

-- The CXX compiler identification is Intel 16.0.0.20150815
-- Check for working CXX compiler: /opt/intel/compilers_and_libraries_2016/linux/bin/intel64/icpc
-- Check for working CXX compiler: /opt/intel/compilers_and_libraries_2016/linux/bin/intel64/icpc -- works
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- The C compiler identification is Intel 16.0.0.20150815
-- Check for working C compiler: /opt/intel/compilers_and_libraries_2016/linux/bin/intel64/icc
-- Check for working C compiler: /opt/intel/compilers_and_libraries_2016/linux/bin/intel64/icc -- works
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Using jemalloc allocator.
-- Configuring done
-- Generating done
-- Build files have been written to: /global/u1/t/train1/SC15_APEX_tutorial/build-mic
CMake configuration done. To build:
cd build-mic
make
Scanning dependencies of target 1d_stencil_4_exe
Scanning dependencies of target 1d_stencil_4_repart_exe
Scanning dependencies of target apex_fibonacci_exe
Scanning dependencies of target 1d_stencil_4_throttle_exe
[ 25%] Building CXX object apex_examples/CMakeFiles/apex_fibonacci_exe.dir/apex_fibonacci.cpp.o
[ 50%] [ 75%] [100%] Building CXX object apex_examples/CMakeFiles/1d_stencil_4_throttle_exe.dir/1d_stencil_4_throttle.cpp.o
Building CXX object apex_examples/CMakeFiles/1d_stencil_4_exe.dir/1d_stencil_4.cpp.o
Building CXX object apex_examples/CMakeFiles/1d_stencil_4_repart_exe.dir/1d_stencil_4_repart.cpp.o
Linking CXX executable apex_fibonacci
[100%] Built target apex_fibonacci_exe
Linking CXX executable 1d_stencil_4
[100%] Built target 1d_stencil_4_exe
Linking CXX executable 1d_stencil_4_throttle
Linking CXX executable 1d_stencil_4_repart
[100%] Built target 1d_stencil_4_throttle_exe
[100%] Built target 1d_stencil_4_repart_exe

Compiling the exercises for the Babbage host nodes:

To build the exercises for the Babbage host nodes, load the appropriate module and then run the configuration script:

# if no HPX module loaded:
module load hpx/host-0.9.11-release
# if HPX module already loaded:
module swap hpx hpx/host-0.9.11-release
# run cmake and make
./scripts/configure-host.sh

The build will be configured and compiled in the build-host directory. After compiling, you should have the executables built in the directory build-host/apex_examples.

Compiling the exercises for the Edison CNL nodes:

To build the exercises for the Edison CNL nodes, log on to Edison, set up the environment (as described above) load the appropriate module and then run the configuration script:

source /usr/share/Modules/init/bash
source /project/projectdirs/training/SC15/HPX-SC15/hpx_install/env.sh
# if no HPX module loaded:
module load hpx/0.9.11-release
# if HPX module already loaded:
module swap hpx hpx/0.9.11-release
# run cmake and make
./scripts/configure-cray.sh

The build will be configured and compiled in the build-cray directory. After compiling, you should have the executables built in the directory build-cray/apex_examples.

Exercise 1: apex_fibonacci.cpp

About this exercise:

The first exercise just demonstrates the usage of the APEX Policy Engine. This example is based on the fibonacci program available in HPX, but modified to include policies for different APEX event types. Every time an event passes through APEX, the callback function (defined as a C++ lamda in main()) is executed, printing a message to the screen identifying the event type:

    std::set<apex_event_type> when = {APEX_STARTUP, APEX_SHUTDOWN, APEX_NEW_NODE,
        APEX_NEW_THREAD, APEX_START_EVENT, APEX_STOP_EVENT, APEX_SAMPLE_VALUE};
    apex::register_policy(when, [](apex_context const& context)->int{
        switch(context.event_type) {
            case APEX_STARTUP: {
              std::cout << "Startup event" << std::endl;
              break;
            }
            case APEX_SHUTDOWN: {
              std::cout << "Shutdown event" << std::endl;
              break;
            }
            /* ... many other event types follow ... */
                        default: {
              std::cout << "Unknown event" << std::endl;
            }
        }
        return APEX_NOERROR;
    });

The example also forces APEX to output its settings at program initialization, and output a summary of all observed events at the end.

    // force APEX output
    apex::apex_options::use_screen_output(true);

Running the exercise on the Babbage MIC and host nodes, or Edison

After compilation, the program is executed by starting an interactive session and running the example:

Babbage MIC nodes:

salloc --reservation=SC_Reservation -N 1
# after the allocation is granted:
./scripts/run_apex_fibonacci-mic.sh

Babbage Host nodes:

salloc --reservation=SC_Reservation -N 1
# after the allocation is granted:
./scripts/run_apex_fibonacci-host.sh

Edison nodes:

qsub -I -V -d . -W x=FLAGS:ADVRES:Edison.SC15.376253
# after the allocation is granted:
./scripts/run_apex_fibonacci-cray.sh

The output should look something like this:

./build-mic/apex_examples/apex_fibonacci --hpx:threads 10
Startup event
Start event
0.9.11-4c96a9b-HEAD
Built on: 20:39:55 Nov 12 2015
C++ Language Standard version : 201103
Intel Compiler version : Intel(R) C++ g++ 4.7 mode
APEX_TAU : 0
APEX_POLICY : 1
APEX_MEASURE_CONCURRENCY : 0
APEX_UDP_SINK : 0
APEX_MEASURE_CONCURRENCY_PERIOD : 1000000
APEX_SCREEN_OUTPUT : 1
APEX_PROFILE_OUTPUT : 0
APEX_TASKGRAPH_OUTPUT : 0
APEX_PROC_CPUINFO : 0
APEX_PROC_MEMINFO : 0
APEX_PROC_NET_DEV : 0
APEX_PROC_SELF_STATUS : 0
APEX_PROC_STAT : 1
APEX_THROTTLE_CONCURRENCY : 0
APEX_THROTTLING_MAX_THREADS : 240
APEX_THROTTLING_MIN_THREADS : 1
APEX_THROTTLE_ENERGY : 0
APEX_THROTTLING_MAX_WATTS : 300
APEX_THROTTLING_MIN_WATTS : 150
APEX_PTHREAD_WRAPPER_STACK_SIZE : 0
APEX_UDP_SINK_HOST : localhost
APEX_UDP_SINK_PORT : 5560
APEX_UDP_SINK_CLIENTIP : 127.0.0.1
APEX_PAPI_METRICS : 
New thread event
New node event
New thread event
New thread event
New thread event
New thread event
... 
Sample value event
Sample value event
Sample value event
Elaspsed time: 7.66746
Cores detected: 240
Worker Threads observed: 10
Available CPU time: 76.6746
Action                         :  #calls  |  minimum |    mean  |  maximum |   total  |  stddev  |  % total  
------------------------------------------------------------------------------------------------------------
              APEX MAIN THREAD :        1    --n/a--   7.36e+00    --n/a--   7.36e+00   0.00e+00      9.599
                   CPU Guest % :        5   0.00e+00   0.00e+00   0.00e+00   0.00e+00   0.00e+00    --n/a-- 
                CPU I/O Wait % :        5   0.00e+00   0.00e+00   0.00e+00   0.00e+00   0.00e+00    --n/a-- 
                     CPU IRQ % :        5   0.00e+00   0.00e+00   0.00e+00   0.00e+00   0.00e+00    --n/a-- 
                    CPU Idle % :        5   9.47e+01   9.66e+01   9.93e+01   4.83e+02   1.62e+00    --n/a-- 
                    CPU Nice % :        5   3.86e-01   3.23e+00   5.12e+00   1.62e+01   1.66e+00    --n/a-- 
                   CPU Steal % :        5   0.00e+00   0.00e+00   0.00e+00   0.00e+00   0.00e+00    --n/a-- 
                  CPU System % :        5   1.18e-01   1.88e-01   3.22e-01   9.39e-01   8.23e-02    --n/a-- 
                    CPU User % :        5   5.60e-03   2.01e-02   7.59e-02   1.00e-01   2.79e-02    --n/a-- 
                CPU soft IRQ % :        5   3.19e-02   4.03e-02   5.98e-02   2.02e-01   1.03e-02    --n/a-- 
broadcast_call_shutdown_fun... :        2    --n/a--   5.93e-04    --n/a--   1.19e-03   0.00e+00      0.002
broadcast_call_startup_func... :        2    --n/a--   5.78e-04    --n/a--   1.16e-03   0.00e+00      0.002
call_shutdown_functions_action :        2    --n/a--   4.23e-03    --n/a--   8.47e-03   3.76e-03      0.011
 call_startup_functions_action :        2    --n/a--   4.26e-03    --n/a--   8.51e-03   3.77e-03      0.011
              fibonacci_action :      177    --n/a--   1.50e-03    --n/a--   2.65e-01   7.39e-04      0.345
                      hpx_main :        1    --n/a--   2.48e-03    --n/a--   2.48e-03   0.00e+00      0.003
        load_components_action :        2    --n/a--   6.48e-02    --n/a--   1.30e-01   5.25e-02      0.169
                      pre_main :        1    --n/a--   3.02e-02    --n/a--   3.02e-02   0.00e+00      0.039
primary_namespace_bulk_serv... :       40    --n/a--   6.75e-04    --n/a--   2.70e-02   4.46e-04      0.035
primary_namespace_service_a... :        4    --n/a--   3.42e-04    --n/a--   1.37e-03   1.65e-04      0.002
                    run_helper :        1    --n/a--   3.51e-03    --n/a--   3.51e-03   0.00e+00      0.005
symbol_namespace_service_ac... :        7    --n/a--   4.58e-04    --n/a--   3.21e-03   1.07e-04      0.004
                     APEX Idle :  --n/a--    --n/a--    --n/a--    --n/a--   6.88e+01    --n/a--     89.773
------------------------------------------------------------------------------------------------------------
Shutdown event

Babbage host and Edison is similar. It should be noted that APEX/HPX shutdown is somewhat delayed on Babbage - the more threads are requested, the longer it takes to terminate HPX. This is a known issue and is being investigated.

Exercise 2: Generating TAU profiles through APEX

About this exercise

This exercise shows how to enable TAU profiling with APEX, and demonstrates what is generated. The example program is a simple 1D stencil heat diffusion program with 1000 partitions of 100000 cells each. The program is executed on the MIC with 60 threads. This is neither the ideal program decomposition, nor the ideal number of threads, as we will see from later examples.

Running the exercise on the Babbage MIC node

The program is executed by starting (or continuing) an interactive session and running the example:

Babbage MIC nodes:

salloc --reservation=SC_Reservation -N 1
# after the allocation is granted:
./scripts/run_1d_stencil-mic.sh

Babbage Host nodes:

salloc --reservation=SC_Reservation -N 1
# after the allocation is granted:
./scripts/run_1d_stencil-host.sh

Cray CNL nodes:

qsub -I -V -d . -W x=FLAGS:ADVRES:Edison.SC15.376253
# after the allocation is granted:
./scripts/run_1d_stencil-cray.sh

The output should look something like this:

mpirun.mic -n 1 -hostfile micfile.8938 -ppn 1 ./build-mic/apex_examples/1d_stencil_4 --hpx:threads 60 --nx 100000 --np 1000 --nt 45
OS_Threads,Execution_Time_sec,Points_per_Partition,Partitions,Time_Steps
60,                   44.961225832, 100000,               1000,                 45

The directory should now be full of profiles and trace files. To see the TAU summary of the profiles, use the pprof command:

pprof -s
Reading Profile files in profile.*

FUNCTION SUMMARY (total):
---------------------------------------------------------------------------------------
%Time    Exclusive    Inclusive       #Call      #Subrs  Inclusive Name
              msec   total msec                          usec/call 
---------------------------------------------------------------------------------------
100.0  1:27:15.217  1:42:02.260          71       76631   86229019 .TAU application
 10.3    10:30.319    10:30.319       45009           0      14004 hpx::lcos::local::dataflow::execute
  1.4       58,775     1:26.683           1          58   86683381 ProcData::read_proc
  1.4     1:26.420     1:26.420           1           0   86420417 APEX MAIN THREAD
  1.3     1:17.521     1:17.521          60           0    1292027 hpx_main
  0.5       27,833       27,908          58         385     481178 ProcData::read_proc: main loop
  0.1        5,938        5,938       31881           0        186 profiler_listener::process_profiles
  0.0          117          117           3           0      39012 load_components_action
  0.0           73           73           3           0      24360 pre_main
  0.0           12           12          31           0        389 primary_namespace_bulk_service_action
  0.0            7            7           2           0       3669 call_startup_functions_action
  0.0            6            6           2           0       3336 call_shutdown_functions_action
  0.0            6            6           9           0        708 symbol_namespace_service_action
  0.0            4            4           1           0       4924 run_helper
  0.0            2            2           5           0        436 primary_namespace_service_action
  0.0            2            2           4           0        520 broadcast_call_startup_functions_action
  0.0            1            1           4           0        473 broadcast_call_shutdown_functions_action

FUNCTION SUMMARY (mean):
---------------------------------------------------------------------------------------
%Time    Exclusive    Inclusive       #Call      #Subrs  Inclusive Name
              msec   total msec                          usec/call 
---------------------------------------------------------------------------------------
100.0     1:13.735     1:26.229           1     1079.31   86229019 .TAU application
 10.3        8,877        8,877      633.93           0      14004 hpx::lcos::local::dataflow::execute
  1.4          827        1,220   0.0140845    0.816901   86683381 ProcData::read_proc
  1.4        1,217        1,217   0.0140845           0   86420417 APEX MAIN THREAD
  1.3        1,091        1,091     0.84507           0    1292027 hpx_main
  0.5          392          393    0.816901     5.42254     481178 ProcData::read_proc: main loop
  0.1           83           83     449.028           0        186 profiler_listener::process_profiles
  0.0            1            1   0.0422535           0      39012 load_components_action
  0.0            1            1   0.0422535           0      24360 pre_main
  0.0         0.17         0.17     0.43662           0        389 primary_namespace_bulk_service_action
  0.0        0.103        0.103    0.028169           0       3669 call_startup_functions_action
  0.0        0.094        0.094    0.028169           0       3335 call_shutdown_functions_action
  0.0       0.0897       0.0897    0.126761           0        708 symbol_namespace_service_action
  0.0       0.0694       0.0694   0.0140845           0       4924 run_helper
  0.0       0.0307       0.0307   0.0704225           0        436 primary_namespace_service_action
  0.0       0.0293       0.0293    0.056338           0        520 broadcast_call_startup_functions_action
  0.0       0.0266       0.0266    0.056338           0        473 broadcast_call_shutdown_functions_action

Running pprof without the -s option will show data for each individual thread. For a visualization of the profile output, use the paraprof program.

Before the trace files can be visualized, they have to be merged. The trace files are merged using the tau_multimerge program, and then the trace is converted to slog2 format using the tau2slog2 program. After those two steps, the trace can be loaded into the jumpshot program:

tau_multimerge
tau2slog2 tau.trc tau.edf -o tau.slog2
jumpshot ./tau.slog2

Exercise 3: APEX periodic policy to throttle thread concurrency

About this exercise

This program is the same 1D stencil heat diffusion program described in exercise 2, but modified to include an APEX policy that will attempt to adjust the thread concurrency to improve performance. The program is memory-bound beyond a number of threads (system-dependent, usually around 8-12) because the memory request traffic far exceeds the amount of computation required to update a single cell. Scaling studies of this test program have shown that the ideal number of threads is that which maximizes concurrency without oversaturating the memory controller. The APEX policy uses ActiveHarmony (http://www.dyninst.org/harmony) to minimize the HPX thread queue length (number of tasks waiting to execute). This is the function that requests the counter from HPX and adds it to the APEX profile:

bool test_function(apex_context const& context) {
    if (!counters_initialized) return false;
    try {
        counter_value value1 = performance_counter::get_value(counter_id);
        apex::sample_value("thread_queue_length", value1.get_value<int>());
        return APEX_NOERROR;
    }
    catch(hpx::exception const& e) {
        std::cerr
            << "apex_policy_engine_active_thread_count: caught exception: "
            << e.what() << std::endl;
        return APEX_ERROR;
    }
}

and this is the APEX API call to set up concurrency throttling, using the output from that function call:

void register_policies() {
    apex::register_periodic_policy(100000, test_function);

    apex::setup_timer_throttling(std::string("thread_queue_length"),
        APEX_MINIMIZE_ACCUMULATED, APEX_ACTIVE_HARMONY, 200000);
}

The policy registration is configured to run as an HPX "startup" function:

    hpx::register_startup_function(&register_policies);

Running the exercise on the Babbage MIC and host nodes, or Edison

The program is executed by starting (or continuing) an interactive session and running the example:

Babbage host nodes:

salloc --reservation=SC_Reservation -N 1
# after the allocation is granted:
./scripts/run_1d_stencil_throttle-host.sh

Babbage MIC nodes:

salloc --reservation=SC_Reservation -N 1
# after the allocation is granted:
./scripts/run_1d_stencil_throttle-mic.sh

Edison CNL nodes:

qsub -I -V -d . -W x=FLAGS:ADVRES:Edison.SC15.376253
# after the allocation is granted:
./scripts/run_1d_stencil_throttle-cray.sh

The output should look something like this:

./build-host/apex_examples/1d_stencil_4_throttle --hpx:queuing=throttle --hpx:threads 32 --nx 100000 --np 1000 --nt 100 --hpx:bind=balanced
Counters initialized! {00000001de000001, 0000000000001001}
Active threads 0
APEX concurrency throttling enabled, min threads: 8 max threads: 32
Cap: 32 New: 52 Prev: 52
Cap: 24 New: 0 Prev: 52
Cap: 16 New: 0 Prev: 52
Cap: 24 New: 196 Prev: 248
Cap: 20 New: 0 Prev: 248
Cap: 16 New: 0 Prev: 248
Cap: 20 New: 0 Prev: 248
Cap: 22 New: 957 Prev: 1205
Cap: 20 New: 0 Prev: 1205
Cap: 18 New: 667 Prev: 1872
Cap: 20 New: 847 Prev: 2719
Cap: 21 New: 123 Prev: 2842
Cap: 20 New: 0 Prev: 2842
Cap: 19 New: 605 Prev: 3447
Cap: 20 New: 0 Prev: 3447
Cap: 21 New: 607 Prev: 4054
Cap: 20 New: 0 Prev: 4054
Cap: 19 New: 0 Prev: 4054
Cap: 20 New: 126 Prev: 4180
Cap: 20 New: 725 Prev: 4905
Cap: 20 New: 0 Prev: 4905
Thread Cap value optimization has converged.
Thread Cap value : 20
Cap: 20 New: 619 Prev: 5524
Cap: 20 New: 741 Prev: 6265
...
Cap: 20 New: 198 Prev: 11787
Cap: 20 New: 139 Prev: 11926
OS_Threads,Execution_Time_sec,Points_per_Partition,Partitions,Time_Steps
32,                   15.151157572, 100000,               1000,                 100

While 20 is not the optimal solution, it is an improvement over the performance without the adaptation, while using fewer resources.

Exercise 4: APEX event-based policy to change decomposition

About this exercise

This program is the same 1D stencil heat diffusion program described in exercise 2, but modified to include an APEX policy that will attempt to adjust problem decomposition to improve performance. As described in exercise 2, The program is memory-bound, but the program decomposition also has an effect on performance. After some number of iterations, the problem is re-partitioned to try different values to improve performance. The APEX policy uses ActiveHarmony to minimize the time spent in each block of iterations (50 in this example). The block of iterations is timed with an HPX timer (in nanoseconds), and returned by a method that converts it to seconds:

double get_global_elapsed() {
    double seconds = global_elapsed / 1e9;
    return seconds;
}

and this is the APEX API call to set up the tuning, using the output from that function call:

    // Set up APEX tuning
    // The tunable parameter -- how many partitions to divide data into
    long np_index = 1;
    long * tune_params[1] = { 0L };
    long num_params = 1;
    long mins[1]  = { 0 };
    long maxs[1]  = { (long)divisors.size() };
    long steps[1] = { 1 };
    tune_params[0] = &np_index;
    apex::setup_custom_tuning(get_global_elapsed, end_iteration_event, num_params,
            tune_params, mins, maxs, steps);

The policy registration is configured to run as an HPX "startup" function:

    hpx::register_startup_function(&register_policies);

The repartitioning is triggered by an APEX custom event:

        apex::custom_event(end_iteration_event, 0);

Running the exercise on the Babbage host node:

The program is executed by starting (or continuing) an interactive session and running the example:

Babbage host nodes:

salloc --reservation=SC_Reservation -N 1
# after the allocation is granted:
./scripts/run_1d_stencil_repart-host.sh

Babbage MIC nodes:

salloc --reservation=SC_Reservation -N 1
# after the allocation is granted:
./scripts/run_1d_stencil_repart-mic.sh

Edison CNL nodes:

qsub -I -V -d . -W x=FLAGS:ADVRES:Edison.SC15.376253
# after the allocation is granted:
./scripts/run_1d_stencil_repart-cray.sh

The output should look something like this:

./build-host/apex_examples/1d_stencil_4_repart --hpx:print-counter /threadqueue/length --hpx:print-counter-interval 100 --hpx:print-counter-destination /dev/null --hpx:threads 12 --nx 10000000 --nr 50 --nt 50 --hpx:bind=balanced
apex_policy_engine_active_thread_count: caught exception: unknown counter type /threads/idle-rate: HPX(bad_parameter)
Using iteration time and/or APEX idle rate instead.
OS_Threads,Execution_Time_sec,Points_per_Partition,Partitions,Time_Steps
12,                   2.271579475, 5000000,              2,                    50                   
12,                   8.88190639, 320,                  31250,                50                   
12,                   1.850312922, 25000,                400,                  50                   
12,                   1.914576961, 3125,                 3200,                 50                   
12,                   1.292234367, 25000,                400,                  50                   
12,                   1.016024995, 250000,               40,                   50                   
12,                   1.312500042, 25000,                400,                  50                   
12,                   1.026867275, 250000,               40,                   50                   
12,                   1.280510026, 80000,                125,                  50                   
12,                   1.005137222, 250000,               40,                   50                   
12,                   1.306790145, 1250000,              8,                    50                   
12,                   1.044563498, 250000,               40,                   50                   
12,                   1.014567008, 250000,               40,                   50                   
12,                   1.155023633, 500000,               20,                   50                   
12,                   1.278969172, 1250000,              8,                    50                   
12,                   1.151422426, 500000,               20,                   50                   
12,                   1.013433416, 250000,               40,                   50                   
12,                   1.321016533, 125000,               80,                   50                   
12,                   1.166019871, 500000,               20,                   50                   
12,                   1.328789565, 125000,               80,                   50                   
12,                   1.067668983, 250000,               40,                   50                   
12,                   1.09541764, 400000,               25,                   50                   
12,                   1.043721435, 250000,               40,                   50                   
12,                   1.200178215, 200000,               50,                   50                   
12,                   1.063306877, 250000,               40,                   50                   
12,                   1.21445271, 312500,               32,                   50                   
12,                   1.014121612, 250000,               40,                   50                   
12,                   1.085850936, 250000,               40,                   50                   
12,                   1.02991568, 250000,               40,                   50                   
12,                   1.225457176, 312500,               32,                   50                   
12,                   1.032273038, 250000,               40,                   50                   
12,                   1.016382031, 250000,               40,                   50                   
12,                   1.028866099, 250000,               40,                   50                   
12,                   1.080911062, 250000,               40,                   50                   
12,                   1.016519709, 250000,               40,                   50                   
Tuning has converged.
12,                   1.022939519, 250000,               40,                   50                   
12,                   1.004997524, 250000,               40,                   50                   
12,                   1.004145836, 250000,               40,                   50                   
12,                   1.02543216, 250000,               40,                   50                   
12,                   1.026969246, 250000,               40,                   50                   
12,                   1.050520931, 250000,               40,                   50                   
12,                   1.029551753, 250000,               40,                   50                   
12,                   1.018438387, 250000,               40,                   50                   
12,                   1.043886854, 250000,               40,                   50                   
12,                   1.034152984, 250000,               40,                   50                   
12,                   1.023198584, 250000,               40,                   50                   
12,                   1.041405105, 250000,               40,                   50                   
12,                   1.027388522, 250000,               40,                   50                   
12,                   1.023507802, 250000,               40,                   50                   
12,                   1.015685204, 250000,               40,                   50

After tuning, it is determined that for 12 threads, 40 partitions of 250000 cells each provides the best performance.

BONUS Exercise: APEX event-based policy to enforce user-level power cap

About this exercise

This program is the same 1D stencil heat diffusion program, but using thread concurrency throttling to enforce a user-specified power cap. This example is only available on Edison.

Power cap throttling is enabled by setting up the throttling during startup:

    apex::setup_power_cap_throttling();

Runing the exercise:

Edison CNL nodes:

qsub -I -V -d . -W x=FLAGS:ADVRES:Edison.SC15.376253
# after the allocation is granted:
./scripts/run_1d_stencil_energy-cray.sh

By setting the APEX_MEASURE_CONCURRENCY=1 environment variable, APEX generates a gnuplot that shows the power consumption and thread concurrency over time. To view the gnuplot, load the gnuplot module and run gnuplot (may not work on ssh sessions without X11 forwarding):

module load gnuplot
gnuplot -persist concurrency.0.gnuplot

Name		Name	Last commit message	Last commit date
Latest commit History 47 Commits
1d_stencil_profiles		1d_stencil_profiles
apex_examples		apex_examples
etc		etc
scripts		scripts
CMakeLists.txt		CMakeLists.txt
HPX_APEX_TAU_Tutorial_Slides.pdf		HPX_APEX_TAU_Tutorial_Slides.pdf
README.md		README.md

khuck/SC15_APEX_tutorial

Folders and files

Latest commit

History

Repository files navigation

SC15_APEX_tutorial

BEFORE YOU START:

Building the exercises:

Compiling the exercises for the Babbage MIC devices:

Compiling the exercises for the Babbage host nodes:

Compiling the exercises for the Edison CNL nodes:

Exercise 1: apex_fibonacci.cpp

About this exercise:

Running the exercise on the Babbage MIC and host nodes, or Edison

Babbage MIC nodes:

Babbage Host nodes:

Edison nodes:

Exercise 2: Generating TAU profiles through APEX

About this exercise

Running the exercise on the Babbage MIC node

Babbage MIC nodes:

Babbage Host nodes:

Cray CNL nodes:

Exercise 3: APEX periodic policy to throttle thread concurrency

About this exercise

Running the exercise on the Babbage MIC and host nodes, or Edison

Babbage host nodes:

Babbage MIC nodes:

Edison CNL nodes:

Exercise 4: APEX event-based policy to change decomposition

About this exercise

Running the exercise on the Babbage host node:

Babbage host nodes:

Babbage MIC nodes:

Edison CNL nodes:

BONUS Exercise: APEX event-based policy to enforce user-level power cap

About this exercise

Runing the exercise:

Edison CNL nodes:

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages