Building madgraph4gpu and measuring throughput

This page describes how to build the madraph4gpu code and measure its throughput in terms of matrix elements (MEs) per second. It is meant to address issue #249.

For the moment, this is a single page focusing on the latest implementation for CUDA and vectorized C++, which is only available for the eemumu process (epoch1/cuda/eemumu). Eventually, other implementations including later CUDA/C++ versions (epochX/cudacpp) and alternative Kokkos, Alpaka and Sycl implementations will also be described.

Download the code

In a chosen directory (here /data/valassi) and using your preferred authentication mechanism (here https) download the latest master

  cd /data/valassi
  git clone https://github.com/madgraph5/madgraph4gpu.git
  cd madgraph4gpu
  git checkout master

For convenience, the download directory will be referred to as MADGRAPH4GPU_HOME in the following (but this environment variable is not used anywhere inside the code or Makefiles).

  export MADGRAPH4GPU_HOME=/data/valassi/madgraph4gpu

Set the runtime environment for the build (compilers, ccache etc)

To build and run the code you need

O/S installation using the tsc clocksource (baseline is CentOS8), see issue #116 for details
C++ compiler and runtime libraries (baseline is gcc9): the CXX environment variable must be set
optionally, CUDA compiler and runtime libraries (baseline is nvcc 11.4): CUDA_HOME must be set (or nvcc must be in PATH); if CUDA_HOME points to an invalid path, a C++-only build is performed (using C++ random numbers instead of curand)
optionally, set up ccache

In addition (if you use the custom profiling scripts such as throughput12.sh):

optionally, set up the perf profiling tool
optionally, set up the Nvidia nsight profiling tools
optionally, set up python 3.8 or later

C++ compiler

The following C++ compilers are supported

gcc9 or later
clang10 or later (see issue #172)
icx 202110 or later (icc is no longer supported because it has no support for compiler vector extensions, see issue #220)

At CERN, the baseline configuration with gcc9 is set up using

  . /cvmfs/sft.cern.ch/lcg/releases/gcc/9.2.0/x86_64-centos7/setup.sh

The line above sets up all relevant runtime libraries and also sets the CXX environment variable:

  echo $CXX
  /cvmfs/sft.cern.ch/lcg/releases/gcc/9.2.0-afc57/x86_64-centos7/bin/g++

CUDA

To enable CUDA builds, you must set CUDA_HOME, or alternatively have nvcc in your PATH. For instance:

   export PATH=/usr/local/cuda/bin:${PATH}

Note that a CUDA runtime library (the CURAND random number library) is used not only in the GPU/CUDA application, but also in the CPU/C++ application (in the former case, the device version is used and random numbers are generated on the GPU, while in the latter case the host version is used and random numbers are generated on the CPU). This is meant to ensure that the same physics results (average matrix elements) are obtained in both cases, as the same random number seed is always used.

If nvcc is not in PATH, or if it is but CUDA_HOME is set to an invalid path, then no CUDA runtime libraries are used and the C++ application uses an alternative implementation of random numbers using the C++ standard library. This changes physics results slightly but has no impact on performance studies about the throughput of the matrix element calaculation alone.

ccache

To use ccache

you must have ccache in your PATH
you must set USECCACHE=1 to tell madgraph4gpu to use cacche
optionally, you must set CCACHE_DIR to your preferred ccache directory

At CERN, you may use

  export PATH=/cvmfs/sft.cern.ch/lcg/releases/ccache/4.3-ed8d3/x86_64-centos7-gcc8-opt/bin:$PATH

Build the code

Go to the appropriate P1_Sigma subdirectory for the chosen epoch and process. The build is done here.

  cd $MADGRAPH4GPU_HOME
  cd epoch1/cuda/ee_mumu/SubProcesses/P1_Sigma_sm_epem_mupmum

The following make variables (which can be set also via environment variables) control how the build is performed

AVX=[none|sse4|avx2|512y|512z]
- AVX=none: disable C++ vectorization
- AVX=sse4: enable C++ vectorization with SSE4.2 (128 bit registers, i.e. 2 doubles or 4 floats per vector)
- AVX=avx2: enable C++ vectorization with AVX2 (256 bit registers, i.e. 4 doubles or 8 floats per vector)
- AVX=512y (default): enable C++ vectorization with AVX512, limited to 256 bit ymm vector instructions (i.e. 4 doubles or 8 floats per vector)
- AVX=512y: enable C++ vectorization with AVX512, including 512 bit zmm vector instructions (i.e. 8 doubles or 16 floats per vector)
FPTYPE=[d|f]
- FPTYPE=d (default): use double precision floating-point variables (double)
- FPTYPE=f: use single precision floating-point variables (float)
HELINL=[0|1]
- HELINL=0 (default): do not use aggressive inlining
- HELINL=1: use aggressive inlining (emulate LTO optimizations)
USEBUILDDIR=[0|1]
- USEBUILDDIR=0 (default): place binaries (.o, .exe etc) in the P1_Sigma directory itself; if you attempt to recompile using different AVX, FPTYPE or HELINL settings, you will get an error
- USEBUILDDIR=1 (recommended): place binaries (.o, .exe etc) in a subdirectory of P1_Sigma directory specific to the chosen AVX, FPTYPE or HELINL settings; you may perform several builds in parallel for different AVX, FPTYPE or HELINL settings using different build directories

For detailed performance comparisons, USEBUILDDIR=1 is recommended to allow simultaneous builds with different FPTYPE's (see PR #213). You can use make cleanall to remove all build subdirectories.

The AVX settings refer to Intel CPUs, but the code builds and runs with C++ vectorizations on AMD CPUs too (see PR #238).

Aggressive inlining is found to give almost a factor 4 speedup with no vectorization, and almost a factor 2 speedup with the best vectorization (see issue #229). This still neds to be better understood. In particular, note that AVX=none:sse4:avx2 throughputs are more or less in ratios 1:2:4 for double builds (as one would naively expect) when inlining is disabled, but not when inlining is enabled.

Running the standalone executable

Two standalone executables are presently built in parallel in each build:

the C++ executable check.exe (where the matrix element calculation is performed using vectorized C++ on the CPU)
the CUDA executable gcheck.exe (where the matrix element calculation is performed using CUDA on the GPU)

Both executables accept the same command line arguments, which were actually designed for CUDA, but were kept also for C++. The baseline for performance tests is performed using the following arguments:

gcheck.exe -p 2048 256 12
check.exe -p 2048 256 12 In the GPU application, this computes 6M matrix elements, using 2048 blocks per grid, 256 thredas per block, over 12 iterations of a full grid. In the CPU application, this also computes 6M matrix elements,

Note that the C++ executable includes OpenMP multithreading, but this is disabled by default (see PR $84). You may enable it by setting OMP_NUM_THREADS explicitly. However this implementation is presently found to be suboptimal and may soon be replaced by a custom MT implementation (see issue #196.

Example

Build the code using USEBUILDDIR=1 in the baseline configuration based on gcc9 and cuda 11.4, using the default AVX=avx2, FPTYPE=d and HELINL=0. Then run the C++ and the CUDA application.

  export MADGRAPH4GPU_HOME=/data/valassi/madgraph4gpu

  . /cvmfs/sft.cern.ch/lcg/releases/gcc/9.2.0/x86_64-centos7/setup.sh
  export PATH=/cvmfs/sft.cern.ch/lcg/releases/ccache/4.3-ed8d3/x86_64-centos7-gcc8-opt/bin:$PATH
  export PATH=/usr/local/cuda/bin:${PATH}

  export USECCACHE=1
  export CCACHE_DIR=$MADGRAPH4GPU_HOME/CCACHE_DIR

  export USEBUILDDIR=1

  cd $MADGRAPH4GPU_HOME
  cd epoch1/cuda/ee_mumu/SubProcesses/P1_Sigma_sm_epem_mupmum
  make cleanall
  make AVX=avx2 FPTYPE=d HELINL=0

  ./build.avx2_d_inl0/check.exe -p 2048 256 12
  ./build.avx2_d_inl0/gcheck.exe -p 2048 256 12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Building madgraph4gpu and measuring throughput

Building madgraph4gpu and measuring throughput

Download the code

Set the runtime environment for the build (compilers, ccache etc)

C++ compiler

CUDA

ccache

Build the code

Running the standalone executable

Example

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally