-
Notifications
You must be signed in to change notification settings - Fork 37
Building madgraph4gpu and measuring throughput
This page describes how to build the madraph4gpu code and measure its throughput in terms of matrix elements (MEs) per second. It is meant to address issue #249.
For the moment, this is a single page focusing on the latest implementation for CUDA and vectorized C++, which is only available for the eemumu process (epoch1/cuda/eemumu). Eventually, other implementations including later CUDA/C++ versions (epochX/cudacpp) and alternative Kokkos, Alpaka and Sycl implementations will also be described.
In a chosen directory (here /data/valassi) and using your preferred authentication mechanism (here https) download the latest master
cd /data/valassi
git clone https://github.com/madgraph5/madgraph4gpu.git
cd madgraph4gpu
git checkout master
For convenience, the download directory will be referred to as MADGRAPH4GPU_HOME in the following (but this environment variable is not used anywhere inside the code or Makefiles).
export MADGRAPH4GPU_HOME=/data/valassi/madgraph4gpu
To build and run the code you need
- O/S installation using the tsc clocksource (baseline is CentOS8), see issue #116 for details
- C++ compiler and runtime libraries (baseline is gcc9): the CXX environment variable must be set
- optionally, CUDA compiler and runtime libraries (baseline is nvcc 11.4): CUDA_HOME must be set (or nvcc must be in PATH); if CUDA_HOME points to an invalid path, a C++-only build is performed (using C++ random numbers instead of curand)
- optionally, set up ccache
In addition (if you use the custom profiling scripts such as throughput12.sh):
- optionally, set up the perf profiling tool
- optionally, set up the Nvidia nsight profiling tools
- optionally, set up python 3.8 or later
The following C++ compilers are supported
- gcc9 or later
- clang10 or later (see issue #172)
- icx 202110 or later (icc is no longer supported because it has no support for compiler vector extensions, see issue #220)
At CERN, the baseline configuration with gcc9 is set up using
. /cvmfs/sft.cern.ch/lcg/releases/gcc/9.2.0/x86_64-centos7/setup.sh
The line above sets up all relevant runtime libraries and also sets the CXX environment variable:
echo $CXX
/cvmfs/sft.cern.ch/lcg/releases/gcc/9.2.0-afc57/x86_64-centos7/bin/g++
To enable CUDA builds, you must set CUDA_HOME, or alternatively have nvcc in your PATH. For instance:
export PATH=/usr/local/cuda/bin:${PATH}
Note that a CUDA runtime library (the CURAND random number library) is used not only in the GPU/CUDA application, but also in the CPU/C++ application (in the former case, the device version is used and random numbers are generated on the GPU, while in the latter case the host version is used and random numbers are generated on the CPU). This is meant to ensure that the same physics results (average matrix elements) are obtained in both cases, as the same random number seed is always used.
If nvcc is not in PATH, or if it is but CUDA_HOME is set to an invalid path, then no CUDA runtime libraries are used and the C++ application uses an alternative implementation of random numbers using the C++ standard library. This changes physics results slightly but has no impact on performance studies about the throughput of the matrix element calaculation alone.
To use ccache
- you must have ccache in your PATH
- you must set USECCACHE=1 to tell madgraph4gpu to use cacche
- optionally, you must set CCACHE_DIR to your preferred ccache directory
At CERN, you may use
export PATH=/cvmfs/sft.cern.ch/lcg/releases/ccache/4.3-ed8d3/x86_64-centos7-gcc8-opt/bin:$PATH
Go to the appropriate P1_Sigma subdirectory for the chosen epoch and process. The build is done here.
cd $MADGRAPH4GPU_HOME
cd epoch1/cuda/ee_mumu/SubProcesses/P1_Sigma_sm_epem_mupmum
The following make variables (which can be set also via environment variables) control how the build is performed
- AVX=[none|sse4|avx2|512y|512z]
- AVX=none: disable C++ vectorization
- AVX=sse4: enable C++ vectorization with SSE4.2 (128 bit registers, i.e. 2 doubles or 4 floats per vector)
- AVX=avx2: enable C++ vectorization with AVX2 (256 bit registers, i.e. 4 doubles or 8 floats per vector)
- AVX=512y (default): enable C++ vectorization with AVX512, limited to 256 bit ymm vector instructions (i.e. 4 doubles or 8 floats per vector)
- AVX=512y: enable C++ vectorization with AVX512, including 512 bit zmm vector instructions (i.e. 8 doubles or 16 floats per vector)
- FPTYPE=[d|f]
- FPTYPE=d (default): use double precision floating-point variables (double)
- FPTYPE=f: use single precision floating-point variables (float)
- HELINL=[0|1]
- HELINL=0 (default): do not use aggressive inlining
- HELINL=1: use aggressive inlining (emulate LTO optimizations)
- USEBUILDDIR=[0|1]
- USEBUILDDIR=0 (default): place binaries (.o, .exe etc) in the P1_Sigma directory itself; if you attempt to recompile using different AVX, FPTYPE or HELINL settings, you will get an error
- USEBUILDDIR=1 (recommended): place binaries (.o, .exe etc) in a subdirectory of P1_Sigma directory specific to the chosen AVX, FPTYPE or HELINL settings; you may perform several builds in parallel for different AVX, FPTYPE or HELINL settings using different build directories
For detailed performance comparisons, USEBUILDDIR=1 is recommended to allow simultaneous builds with different FPTYPE's (see PR #213). You can use make cleanall to remove all build subdirectories.
The AVX settings refer to Intel CPUs, but the code builds and runs with C++ vectorizations on AMD CPUs too (see PR #238).
Aggressive inlining is found to give almost a factor 4 speedup with no vectorization, and almost a factor 2 speedup with the best vectorization (see issue #229). This still neds to be better understood. In particular, note that AVX=none:sse4:avx2 throughputs are more or less in ratios 1:2:4 for double builds (as one would naively expect) when inlining is disabled, but not when inlining is enabled.
Two standalone executables are presently built in parallel in each build:
- the C++ executable
check.exe(where the matrix element calculation is performed using vectorized C++ on the CPU) - the CUDA executable
gcheck.exe(where the matrix element calculation is performed using CUDA on the GPU)
Both executables accept the same command line arguments, which were actually designed for CUDA, but were kept also for C++. The baseline for performance tests is performed using the following arguments:
gcheck.exe -p 2048 256 12-
check.exe -p 2048 256 12In the GPU application, this computes 6M matrix elements, using 2048 blocks per grid, 256 thredas per block, over 12 iterations of a full grid. In the CPU application, this also computes 6M matrix elements,
Note that the C++ executable includes OpenMP multithreading, but this is disabled by default (see PR $84). You may enable it by setting OMP_NUM_THREADS explicitly. However this implementation is presently found to be suboptimal and may soon be replaced by a custom MT implementation (see issue #196.
Build the code using USEBUILDDIR=1 in the baseline configuration based on gcc9 and cuda 11.4, using the default AVX=avx2, FPTYPE=d and HELINL=0. Then run the C++ and the CUDA application.
export MADGRAPH4GPU_HOME=/data/valassi/madgraph4gpu
. /cvmfs/sft.cern.ch/lcg/releases/gcc/9.2.0/x86_64-centos7/setup.sh
export PATH=/cvmfs/sft.cern.ch/lcg/releases/ccache/4.3-ed8d3/x86_64-centos7-gcc8-opt/bin:$PATH
export PATH=/usr/local/cuda/bin:${PATH}
export USECCACHE=1
export CCACHE_DIR=$MADGRAPH4GPU_HOME/CCACHE_DIR
export USEBUILDDIR=1
cd $MADGRAPH4GPU_HOME
cd epoch1/cuda/ee_mumu/SubProcesses/P1_Sigma_sm_epem_mupmum
make cleanall
make AVX=avx2 FPTYPE=d HELINL=0
./build.avx2_d_inl0/check.exe -p 2048 256 12
./build.avx2_d_inl0/gcheck.exe -p 2048 256 12