Building madgraph4gpu and measuring throughput

Building madgraph4gpu and measuring throughput (epochX/ggttgg)

This page describes how to build the madraph4gpu code and measure its throughput in terms of matrix elements (MEs) per second. It is meant to address issue #249.

For the moment, this is a single page focusing on the latest implementation for CUDA and vectorized C++, for the ggttgg physics process (epochX/cudacpp/ggttgg). Other physics processes are also available and can be built and tested in an analogous way (epochX/cudacpp/eemumu and epochX/cudacpp/eemumu).

This page replaces a previous version of this wiki describing the older epoch1 code for eemumu (epoch1/cuda/eemumu), which is now obsolete but is still available here.

Eventually, other implementations alternative to CUDA/C++, based on Kokkos, Alpaka and Sycl may also be described.

For the impatient: jump to the example which puts it all together.

For the very impatient: jump to the section explaining how to use the throughput12.sh script to get detailed performance comparisons.

Download the code

In a chosen directory (here /data/valassi) and using your preferred authentication mechanism (here https) download the latest master

  cd /data/valassi
  git clone https://github.com/madgraph5/madgraph4gpu.git
  cd madgraph4gpu
  git checkout master

If you want to be sure that you are using the latest stable version of the epochX/cudacpp code, use the following commit:

   git reset --hard 26d40755be840a55ef2e357392492546375ee34a

For convenience, the download directory will be referred to as MADGRAPH4GPU_HOME in the following (but this environment variable is not used anywhere inside the code or Makefiles).

  export MADGRAPH4GPU_HOME=/data/valassi/madgraph4gpu

Understand the code structure

In production releases of the MadGraph5_aMC@NLO software, the physics code (Fortran, C++, CUDA...) to calculate a given physics process is automatically generated by a Python code generator.

Our developments in the madgraph4gpu project are based on an iterative process, where fixes and new features are added by modifying an existing auto-generated CUDA/C++ code base, and must then be back-ported to the Python code generator.

Previous versions of our code (namely in epoch1 and epoch2) were based on a one-off code generation, followed by many additions to the existing CUDA/C++. A new epoch was started when the features and fixes were back-ported to the Python code generator, maintained in a repository external to the project.

The new epochX developments follow a different approach, where the Python code generator is also included in the madgraph4gpu repository, and any new fixes and features in CUDA/C++ may be back-ported immediately to the code generator. For the two main physics processes we currently use for development (eemumu and ggttgg), both the manually developed and the auto-generated code are included in the repository. The iterative development process is the following: start from an auto-generated code and from an identical copy in the manually developed directory; add fixes and features to the latter; backport them to the code generator, regenerate the auto-generated directory; modify the code generator until the auto and manual directories are identical again; iterate by adding new fixes and features to the manual directory. More details are available in issue #244, which described how the current epochX structure was achieved from the previous epoch1 and epoch2.

The latest version of the code is in the epochX/cudacpp directory, which has the following contents:

\ls -1F $MADGRAPH4GPU_HOME/epochX/cudacpp 
  CODEGEN/
  ee_mumu/
  ee_mumu.auto/
  gg_tt.auto/
  gg_ttgg/
  gg_ttgg.auto/
  tput/

In particular:

CODEGEN contains the Python code-generator (as a "plugin" for an official MadGraph_aMC@NLO software release)
ee_mumu and ee_mumu.auto contain the manually developed and auto-generated code for the eemumu physics process
gg_ttgg and gg_ttgg.auto contain the manually developed and auto-generated code for the ggttgg physics process
gg_tt.auto contains the auto-generated code for the ggtt physics process (where we do no manual developments)
tput contains a collection of scripts and logfiles for performance measurements

In the "steady state", typically after a major pull request:

the auto generated code is that coming from the generator in the repository
the manual code is identical to the auto generated code
the throughput logs are those obtained with the latest auto and manual codes in the repository

Set the runtime environment for the build (compilers, ccache etc)

To build and run the code you need

O/S installation using the tsc clocksource (baseline is CentOS8), see issue #116 for details
C++ compiler and runtime libraries (baseline is gcc9): the CXX environment variable must be set
optionally, CUDA compiler and runtime libraries (baseline is nvcc 11.4): CUDA_HOME must be set (or nvcc must be in PATH); if CUDA_HOME points to an invalid path, a C++-only build is performed (using C++ random numbers instead of curand)
optionally, set up ccache

In addition (if you use the custom profiling scripts such as throughput12.sh):

optionally, set up the perf profiling tool
optionally, set up the Nvidia nsight profiling tools
optionally, set up python 3.8 or later

C++ compiler

The following C++ compilers are supported

gcc9 or later
clang10 or later (see issue #172)
icx 202110 or later (icc is no longer supported because it has no support for compiler vector extensions, see issue #220)

At CERN, the baseline configuration with gcc9 is set up using

  . /cvmfs/sft.cern.ch/lcg/releases/gcc/9.2.0/x86_64-centos7/setup.sh

The line above sets up all relevant runtime libraries and also sets the CXX environment variable:

  echo $CXX
  /cvmfs/sft.cern.ch/lcg/releases/gcc/9.2.0-afc57/x86_64-centos7/bin/g++

CUDA

To enable CUDA builds, you must set CUDA_HOME, or alternatively have nvcc in your PATH. For instance:

   export PATH=/usr/local/cuda/bin:${PATH}

Note that a CUDA runtime library (the CURAND random number library) is used not only in the GPU/CUDA application, but also in the CPU/C++ application (in the former case, the device version is used and random numbers are generated on the GPU, while in the latter case the host version is used and random numbers are generated on the CPU). This is meant to ensure that the same physics results (average matrix elements) are obtained in both cases, as the same random number seed is always used.

If nvcc is not in PATH, or if it is but CUDA_HOME is set to an invalid path, then no CUDA runtime libraries are used and the C++ application uses an alternative implementation of random numbers using the C++ standard library. This changes physics results slightly but has no impact on performance studies about the throughput of the matrix element calaculation alone.

ccache

To use ccache

you must have ccache in your PATH
you must set USECCACHE=1 to tell madgraph4gpu to use cacche
optionally, you must set CCACHE_DIR to your preferred ccache directory

At CERN, you may use

  export PATH=/cvmfs/sft.cern.ch/lcg/releases/ccache/4.3-ed8d3/x86_64-centos7-gcc8-opt/bin:$PATH

Build the code

Go to the appropriate P1_Sigma subdirectory for the chosen epoch and process. The build is done here.

  cd $MADGRAPH4GPU_HOME
  cd epoch1/cuda/ee_mumu/SubProcesses/P1_Sigma_sm_epem_mupmum

The following make variables (which can be set also via environment variables) control how the build is performed

AVX=[none|sse4|avx2|512y|512z]
- AVX=none: disable C++ vectorization
- AVX=sse4: enable C++ vectorization with SSE4.2 (128 bit registers, i.e. 2 doubles or 4 floats per vector)
- AVX=avx2: enable C++ vectorization with AVX2 (256 bit registers, i.e. 4 doubles or 8 floats per vector)
- AVX=512y (default): enable C++ vectorization with AVX512, limited to 256 bit ymm vector instructions (i.e. 4 doubles or 8 floats per vector)
- AVX=512y: enable C++ vectorization with AVX512, including 512 bit zmm vector instructions (i.e. 8 doubles or 16 floats per vector)
FPTYPE=[d|f]
- FPTYPE=d (default): use double precision floating-point variables (double)
- FPTYPE=f: use single precision floating-point variables (float)
HELINL=[0|1]
- HELINL=0 (default): do not use aggressive inlining
- HELINL=1: use aggressive inlining (emulate LTO optimizations)
USEBUILDDIR=[0|1]
- USEBUILDDIR=0 (default): place binaries (.o, .exe etc) in the P1_Sigma directory itself; if you attempt to recompile using different AVX, FPTYPE or HELINL settings, you will get an error
- USEBUILDDIR=1 (recommended): place binaries (.o, .exe etc) in a subdirectory of P1_Sigma directory specific to the chosen AVX, FPTYPE or HELINL settings; you may perform several builds in parallel for different AVX, FPTYPE or HELINL settings using different build directories

For detailed performance comparisons, USEBUILDDIR=1 is recommended to allow simultaneous builds with different FPTYPE's (see PR #213). You can use make cleanall to remove all build subdirectories.

The AVX settings refer to Intel CPUs, but the code builds and runs with C++ vectorizations on AMD CPUs too (see PR #238).

Aggressive inlining is found to give almost a factor 4 speedup with no vectorization, and almost a factor 2 speedup with the best vectorization (see issue #229). This still neds to be better understood. In particular, note that AVX=none:sse4:avx2 throughputs are more or less in ratios 1:2:4 for double builds (as one would naively expect) when inlining is disabled, but not when inlining is enabled.

Running the standalone executable

Two standalone executables are presently built in parallel in each build:

the C++ executable check.exe (where the matrix element calculation is performed using vectorized C++ on the CPU)
the CUDA executable gcheck.exe (where the matrix element calculation is performed using CUDA on the GPU)

Both executables accept the same command line arguments, which were actually designed for CUDA, but were kept also for C++. The baseline for performance tests is performed using the following arguments:

gcheck.exe -p 2048 256 12
check.exe -p 2048 256 12 In the GPU application, this computes 6M matrix elements, using 2048 blocks per grid, 256 threads per block, over 12 iterations of a full grid. In the CPU application, this also computes 6M matrix elements, in a way that reproduces the random number generation mechanism of the CUDA application (same random number seeds and same mapping of the random number arrays to assign them to different matrix element calculations).

Note that the C++ executable includes OpenMP multithreading, but this is disabled by default (see PR $84). You may enable it by setting OMP_NUM_THREADS explicitly. However this implementation is presently found to be suboptimal and may soon be replaced by a custom MT implementation (see issue #196.

The relevant lines describing the throughput of the matrix element calculation are those including EvtsPerSec[MECalcOnly] (3a). The previous lines including EvtsPerSec[MatrixElems] (3) show lower throughputs on the GPU, because they also include data copies between the host and device memory.

Putting it all together - an example:

Build the code using USEBUILDDIR=1 in the baseline configuration based on gcc9 and cuda 11.4, using the default AVX=avx2, FPTYPE=d and HELINL=0. Then run the C++ and the CUDA application.

  export MADGRAPH4GPU_HOME=/data/valassi/madgraph4gpu

  . /cvmfs/sft.cern.ch/lcg/releases/gcc/9.2.0/x86_64-centos7/setup.sh
  export PATH=/cvmfs/sft.cern.ch/lcg/releases/ccache/4.3-ed8d3/x86_64-centos7-gcc8-opt/bin:$PATH
  export PATH=/usr/local/cuda/bin:${PATH}

  export USECCACHE=1
  export CCACHE_DIR=$MADGRAPH4GPU_HOME/CCACHE_DIR

  export USEBUILDDIR=1

  cd $MADGRAPH4GPU_HOME
  cd epoch1/cuda/ee_mumu/SubProcesses/P1_Sigma_sm_epem_mupmum
  make cleanall
  make AVX=avx2 FPTYPE=d HELINL=0

  ./build.avx2_d_inl0/check.exe -p 2048 256 12
  ./build.avx2_d_inl0/gcheck.exe -p 2048 256 12

Using the custom throughput12.sh script

For more detailed comparisons of performances in different vectorization scenarios, you may use the throughput12.sh script. This builds the code in all relevant configurations, then runs the application selecting only some relevant lines of output, and adding additional information from perf and an objdump-based script.

You only need to set the runtime environment to the compilers and tools, prior to running this script. The script internally sets USEBUILDDIR=1 and uses the appropriate AVX, FPTYPE and HELINL settings.

To compare the five AVX scenarios, for the default FPTYPE=d and HELINL=0 settings, just type

  ./throughput12.sh -avxall

To compare the five AVX scenarios, using both FPTYPE=d and FPTYPE=f, and using both HELINL=0 and HELINL=1, just type

  ./throughput12.sh -avxall -flt -inl

For instance, this is a typical output:

  export MADGRAPH4GPU_HOME=/data/valassi/madgraph4gpu

  . /cvmfs/sft.cern.ch/lcg/releases/gcc/9.2.0/x86_64-centos7/setup.sh
  export PATH=/cvmfs/sft.cern.ch/lcg/releases/ccache/4.3-ed8d3/x86_64-centos7-gcc8-opt/bin:$PATH
  export PATH=/usr/local/cuda/bin:${PATH}
  export USECCACHE=1
  export CCACHE_DIR=$MADGRAPH4GPU_HOME/CCACHE_DIR

  cd $MADGRAPH4GPU_HOME
  cd epoch1/cuda/ee_mumu/SubProcesses/P1_Sigma_sm_epem_mupmum
  make cleanall
  ./throughput12.sh -avxall

On itscrd70.cern.ch [CPU: Intel(R) Xeon(R) Silver 4216 CPU] [GPU: 1x Tesla V100S-PCIE-32GB]:
=========================================================================
Process                     = EPOCH1_EEMUMU_CUDA [nvcc 11.4.120 (gcc 9.2.0)] [inlineHel=0]
FP precision                = DOUBLE (NaN/abnormal=0, zero=0)
EvtsPerSec[MatrixElems] (3) = ( 7.204611e+08                 )  sec^-1
EvtsPerSec[MECalcOnly] (3a) = ( 1.339841e+09                 )  sec^-1
MeanMatrixElemValue         = ( 1.371706e-02 +- 3.270315e-06 )  GeV^0
TOTAL       :     0.859595 sec
       709,240,622      cycles:u                  #    0.651 GHz                    
     1,758,148,505      instructions:u            #    2.48  insn per cycle         
       1.151621107 seconds time elapsed
==PROF== Profiling "sigmaKin": launch__registers_per_thread 122
==PROF== Profiling "sigmaKin": sm__sass_average_branch_targets_threads_uniform.pct 100%
=========================================================================
Process                     = EPOCH1_EEMUMU_CPP [gcc 9.2.0] [inlineHel=0]
FP precision                = DOUBLE (NaN/abnormal=0, zero=0)
Internal loops fptype_sv    = SCALAR ('none': ~vector[1], no SIMD)
OMP threads / `nproc --all` = 1 / 4
EvtsPerSec[MECalcOnly] (3a) = ( 1.318664e+06                 )  sec^-1
MeanMatrixElemValue         = ( 1.371706e-02 +- 3.270315e-06 )  GeV^0
TOTAL       :     7.180935 sec
    18,923,038,185      cycles:u                  #    2.630 GHz                    
    48,576,130,976      instructions:u            #    2.57  insn per cycle         
       7.198373449 seconds time elapsed
=Symbols in CPPProcess.o= (~sse4:  614) (avx2:    0) (512y:    0) (512z:    0)
-------------------------------------------------------------------------
Process                     = EPOCH1_EEMUMU_CPP [gcc 9.2.0] [inlineHel=0]
FP precision                = DOUBLE (NaN/abnormal=0, zero=0)
Internal loops fptype_sv    = VECTOR[2] ('sse4': SSE4.2, 128bit) [cxtype_ref=YES]
OMP threads / `nproc --all` = 1 / 4
EvtsPerSec[MECalcOnly] (3a) = ( 2.547401e+06                 )  sec^-1
MeanMatrixElemValue         = ( 1.371706e-02 +- 3.270315e-06 )  GeV^0
TOTAL       :     4.874589 sec
    12,769,980,796      cycles:u                  #    2.613 GHz                    
    29,943,791,884      instructions:u            #    2.34  insn per cycle         
       4.891651807 seconds time elapsed
=Symbols in CPPProcess.o= (~sse4: 3274) (avx2:    0) (512y:    0) (512z:    0)
-------------------------------------------------------------------------
Process                     = EPOCH1_EEMUMU_CPP [gcc 9.2.0] [inlineHel=0]
FP precision                = DOUBLE (NaN/abnormal=0, zero=0)
Internal loops fptype_sv    = VECTOR[4] ('avx2': AVX2, 256bit) [cxtype_ref=YES]
OMP threads / `nproc --all` = 1 / 4
EvtsPerSec[MECalcOnly] (3a) = ( 4.564820e+06                 )  sec^-1
MeanMatrixElemValue         = ( 1.371706e-02 +- 3.270315e-06 )  GeV^0
TOTAL       :     3.725796 sec
     9,149,489,724      cycles:u                  #    2.447 GHz                    
    16,568,031,235      instructions:u            #    1.81  insn per cycle         
       3.742837080 seconds time elapsed
=Symbols in CPPProcess.o= (~sse4:    0) (avx2: 2746) (512y:    0) (512z:    0)
-------------------------------------------------------------------------
Process                     = EPOCH1_EEMUMU_CPP [gcc 9.2.0] [inlineHel=0]
FP precision                = DOUBLE (NaN/abnormal=0, zero=0)
Internal loops fptype_sv    = VECTOR[4] ('512y': AVX512, 256bit) [cxtype_ref=YES]
OMP threads / `nproc --all` = 1 / 4
EvtsPerSec[MECalcOnly] (3a) = ( 4.941441e+06                 )  sec^-1
MeanMatrixElemValue         = ( 1.371706e-02 +- 3.270315e-06 )  GeV^0
TOTAL       :     3.636535 sec
     8,971,524,906      cycles:u                  #    2.458 GHz                    
    16,505,115,717      instructions:u            #    1.84  insn per cycle         
       3.653904248 seconds time elapsed
=Symbols in CPPProcess.o= (~sse4:    0) (avx2: 2572) (512y:   95) (512z:    0)
-------------------------------------------------------------------------
Process                     = EPOCH1_EEMUMU_CPP [gcc 9.2.0] [inlineHel=0]
FP precision                = DOUBLE (NaN/abnormal=0, zero=0)
Internal loops fptype_sv    = VECTOR[8] ('512z': AVX512, 512bit) [cxtype_ref=YES]
OMP threads / `nproc --all` = 1 / 4
EvtsPerSec[MECalcOnly] (3a) = ( 3.591695e+06                 )  sec^-1
MeanMatrixElemValue         = ( 1.371706e-02 +- 3.270315e-06 )  GeV^0
TOTAL       :     4.112475 sec
     8,818,863,751      cycles:u                  #    2.137 GHz                    
    13,367,219,072      instructions:u            #    1.52  insn per cycle         
       4.129526285 seconds time elapsed
=Symbols in CPPProcess.o= (~sse4:    0) (avx2: 1127) (512y:  205) (512z: 2045)
=========================================================================

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Building madgraph4gpu and measuring throughput

Building madgraph4gpu and measuring throughput (epochX/ggttgg)

Download the code

Understand the code structure

Set the runtime environment for the build (compilers, ccache etc)

C++ compiler

CUDA

ccache

Build the code

Running the standalone executable

Putting it all together - an example:

Using the custom throughput12.sh script

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally