-
Notifications
You must be signed in to change notification settings - Fork 37
Building madgraph4gpu and measuring throughput
This page describes how to build the madraph4gpu code and measure its throughput in terms of matrix elements (MEs) per second. It is meant to address issue #249.
For the moment, this is a single page focusing on the latest implementation for CUDA and vectorized C++, for the ggttgg physics process (epochX/cudacpp/ggttgg). Other physics processes are also available and can be built and tested in an analogous way (epochX/cudacpp/eemumu and epochX/cudacpp/eemumu).
This page replaces a previous version of this wiki describing the older epoch1 code for eemumu (epoch1/cuda/eemumu), which is now obsolete but is still available here.
Eventually, other implementations alternative to CUDA/C++, based on Kokkos, Alpaka and Sycl may also be described.
For the impatient: jump to the example which puts it all together.
For the very impatient: jump to the section explaining how to use the throughput12.sh script to get detailed performance comparisons.
In a chosen directory (here /data/valassi) and using your preferred authentication mechanism (here https) download the latest master
cd /data/valassi
git clone https://github.com/madgraph5/madgraph4gpu.git
cd madgraph4gpu
git checkout master
If you want to be sure that you are using the latest stable version of the epochX/cudacpp code, use the following commit:
git reset --hard 26d40755be840a55ef2e357392492546375ee34a
For convenience, the download directory will be referred to as MADGRAPH4GPU_HOME in the following (but this environment variable is not used anywhere inside the code or Makefiles).
export MADGRAPH4GPU_HOME=/data/valassi/madgraph4gpu
In production releases of the MadGraph5_aMC@NLO software, the physics code (Fortran, C++, CUDA...) to calculate a given physics process is automatically generated by a Python code generator.
Our developments in the madgraph4gpu project are based on an iterative process, where fixes and new features are added by modifying an existing auto-generated CUDA/C++ code base, and must then be back-ported to the Python code generator.
Previous versions of our code (namely in epoch1 and epoch2) were based on a one-off code generation, followed by many additions to the existing CUDA/C++. A new epoch was started when the features and fixes were back-ported to the Python code generator, maintained in a repository external to the project.
The new epochX developments follow a different approach, where the Python code generator is also included in the madgraph4gpu repository, and any new fixes and features in CUDA/C++ may be back-ported immediately to the code generator. For the two main physics processes we currently use for development (eemumu and ggttgg), both the manually developed and the auto-generated code are included in the repository. The iterative development process is the following: start from an auto-generated code and from an identical copy in the manually developed directory; add fixes and features to the latter; backport them to the code generator, regenerate the auto-generated directory; modify the code generator until the auto and manual directories are identical again; iterate by adding new fixes and features to the manual directory. More details are available in issue #244, which described how the current epochX structure was achieved from the previous epoch1 and epoch2.
The latest version of the code is in the epochX/cudacpp directory, which has the following contents:
\ls -1F $MADGRAPH4GPU_HOME/epochX/cudacpp
CODEGEN/
ee_mumu/
ee_mumu.auto/
gg_tt.auto/
gg_ttgg/
gg_ttgg.auto/
tput/
In particular:
- CODEGEN contains the Python code-generator (as a "plugin" for an official MadGraph_aMC@NLO software release)
- ee_mumu and ee_mumu.auto contain the manually developed and auto-generated code for the eemumu physics process
- gg_ttgg and gg_ttgg.auto contain the manually developed and auto-generated code for the ggttgg physics process
- gg_tt.auto contains the auto-generated code for the ggtt physics process (where we do no manual developments)
- tput contains a collection of scripts and logfiles for performance measurements
In the "steady state", typically after a major pull request:
- the auto generated code is that coming from the generator in the repository
- the manual code is identical to the auto generated code
- the throughput logs are those obtained with the latest auto and manual codes in the repository
To build and run the code you need
- O/S installation using the tsc clocksource (baseline is CentOS8), see issue #116 for details
- C++ compiler and runtime libraries (baseline is gcc9): the CXX environment variable must be set
- optionally, CUDA compiler and runtime libraries (baseline is nvcc 11.4): CUDA_HOME must be set (or nvcc must be in PATH); if CUDA_HOME points to an invalid path, a C++-only build is performed (using C++ random numbers instead of curand)
- optionally, set up ccache
In addition (if you use the custom profiling scripts such as throughput12.sh):
- optionally, set up the perf profiling tool
- optionally, set up the Nvidia nsight profiling tools
- optionally, set up python 3.8 or later
The following C++ compilers are supported
- gcc9 or later
- clang10 or later (see issue #172)
- icx 202110 or later (icc is no longer supported because it has no support for compiler vector extensions, see issue #220)
At CERN, the baseline configuration with gcc9 is set up using
. /cvmfs/sft.cern.ch/lcg/releases/gcc/9.2.0/x86_64-centos7/setup.sh
The line above sets up all relevant runtime libraries and also sets the CXX environment variable:
echo $CXX
/cvmfs/sft.cern.ch/lcg/releases/gcc/9.2.0-afc57/x86_64-centos7/bin/g++
To enable CUDA builds, you must set CUDA_HOME, or alternatively have nvcc in your PATH. For instance:
export PATH=/usr/local/cuda/bin:${PATH}
Note that a CUDA runtime library (the CURAND random number library) is used not only in the GPU/CUDA application, but also in the CPU/C++ application (in the former case, the device version is used and random numbers are generated on the GPU, while in the latter case the host version is used and random numbers are generated on the CPU). This is meant to ensure that the same physics results (average matrix elements) are obtained in both cases, as the same random number seed is always used.
If nvcc is not in PATH, or if it is but CUDA_HOME is set to an invalid path, then no CUDA runtime libraries are used and the C++ application uses an alternative implementation of random numbers using the C++ standard library. This changes physics results slightly but has no impact on performance studies about the throughput of the matrix element calaculation alone.
To use ccache
- you must have ccache in your PATH
- you must set USECCACHE=1 to tell madgraph4gpu to use cacche
- optionally, you must set CCACHE_DIR to your preferred ccache directory
At CERN, you may use
export PATH=/cvmfs/sft.cern.ch/lcg/releases/ccache/4.3-ed8d3/x86_64-centos7-gcc8-opt/bin:$PATH
Go to the appropriate P1_Sigma subdirectory for the chosen epoch and process. The build is done here.
cd $MADGRAPH4GPU_HOME
cd epoch1/cuda/ee_mumu/SubProcesses/P1_Sigma_sm_epem_mupmum
The following make variables (which can be set also via environment variables) control how the build is performed
- AVX=[none|sse4|avx2|512y|512z]
- AVX=none: disable C++ vectorization
- AVX=sse4: enable C++ vectorization with SSE4.2 (128 bit registers, i.e. 2 doubles or 4 floats per vector)
- AVX=avx2: enable C++ vectorization with AVX2 (256 bit registers, i.e. 4 doubles or 8 floats per vector)
- AVX=512y (default): enable C++ vectorization with AVX512, limited to 256 bit ymm vector instructions (i.e. 4 doubles or 8 floats per vector)
- AVX=512y: enable C++ vectorization with AVX512, including 512 bit zmm vector instructions (i.e. 8 doubles or 16 floats per vector)
- FPTYPE=[d|f]
- FPTYPE=d (default): use double precision floating-point variables (double)
- FPTYPE=f: use single precision floating-point variables (float)
- HELINL=[0|1]
- HELINL=0 (default): do not use aggressive inlining
- HELINL=1: use aggressive inlining (emulate LTO optimizations)
- USEBUILDDIR=[0|1]
- USEBUILDDIR=0 (default): place binaries (.o, .exe etc) in the P1_Sigma directory itself; if you attempt to recompile using different AVX, FPTYPE or HELINL settings, you will get an error
- USEBUILDDIR=1 (recommended): place binaries (.o, .exe etc) in a subdirectory of P1_Sigma directory specific to the chosen AVX, FPTYPE or HELINL settings; you may perform several builds in parallel for different AVX, FPTYPE or HELINL settings using different build directories
For detailed performance comparisons, USEBUILDDIR=1 is recommended to allow simultaneous builds with different FPTYPE's (see PR #213). You can use make cleanall to remove all build subdirectories.
The AVX settings refer to Intel CPUs, but the code builds and runs with C++ vectorizations on AMD CPUs too (see PR #238).
Aggressive inlining is found to give almost a factor 4 speedup with no vectorization, and almost a factor 2 speedup with the best vectorization (see issue #229). This still neds to be better understood. In particular, note that AVX=none:sse4:avx2 throughputs are more or less in ratios 1:2:4 for double builds (as one would naively expect) when inlining is disabled, but not when inlining is enabled.
Two standalone executables are presently built in parallel in each build:
- the C++ executable
check.exe(where the matrix element calculation is performed using vectorized C++ on the CPU) - the CUDA executable
gcheck.exe(where the matrix element calculation is performed using CUDA on the GPU)
Both executables accept the same command line arguments, which were actually designed for CUDA, but were kept also for C++. The baseline for performance tests is performed using the following arguments:
gcheck.exe -p 2048 256 12-
check.exe -p 2048 256 12In the GPU application, this computes 6M matrix elements, using 2048 blocks per grid, 256 threads per block, over 12 iterations of a full grid. In the CPU application, this also computes 6M matrix elements, in a way that reproduces the random number generation mechanism of the CUDA application (same random number seeds and same mapping of the random number arrays to assign them to different matrix element calculations).
Note that the C++ executable includes OpenMP multithreading, but this is disabled by default (see PR $84). You may enable it by setting OMP_NUM_THREADS explicitly. However this implementation is presently found to be suboptimal and may soon be replaced by a custom MT implementation (see issue #196.
The relevant lines describing the throughput of the matrix element calculation are those including EvtsPerSec[MECalcOnly] (3a). The previous lines including EvtsPerSec[MatrixElems] (3) show lower throughputs on the GPU, because they also include data copies between the host and device memory.
Build the code using USEBUILDDIR=1 in the baseline configuration based on gcc9 and cuda 11.4, using the default AVX=avx2, FPTYPE=d and HELINL=0. Then run the C++ and the CUDA application.
export MADGRAPH4GPU_HOME=/data/valassi/madgraph4gpu
. /cvmfs/sft.cern.ch/lcg/releases/gcc/9.2.0/x86_64-centos7/setup.sh
export PATH=/cvmfs/sft.cern.ch/lcg/releases/ccache/4.3-ed8d3/x86_64-centos7-gcc8-opt/bin:$PATH
export PATH=/usr/local/cuda/bin:${PATH}
export USECCACHE=1
export CCACHE_DIR=$MADGRAPH4GPU_HOME/CCACHE_DIR
export USEBUILDDIR=1
cd $MADGRAPH4GPU_HOME
cd epoch1/cuda/ee_mumu/SubProcesses/P1_Sigma_sm_epem_mupmum
make cleanall
make AVX=avx2 FPTYPE=d HELINL=0
./build.avx2_d_inl0/check.exe -p 2048 256 12
./build.avx2_d_inl0/gcheck.exe -p 2048 256 12
For more detailed comparisons of performances in different vectorization scenarios, you may use the throughput12.sh script. This builds the code in all relevant configurations, then runs the application selecting only some relevant lines of output, and adding additional information from perf and an objdump-based script.
You only need to set the runtime environment to the compilers and tools, prior to running this script. The script internally sets USEBUILDDIR=1 and uses the appropriate AVX, FPTYPE and HELINL settings.
To compare the five AVX scenarios, for the default FPTYPE=d and HELINL=0 settings, just type
./throughput12.sh -avxall
To compare the five AVX scenarios, using both FPTYPE=d and FPTYPE=f, and using both HELINL=0 and HELINL=1, just type
./throughput12.sh -avxall -flt -inl
For instance, this is a typical output:
export MADGRAPH4GPU_HOME=/data/valassi/madgraph4gpu
. /cvmfs/sft.cern.ch/lcg/releases/gcc/9.2.0/x86_64-centos7/setup.sh
export PATH=/cvmfs/sft.cern.ch/lcg/releases/ccache/4.3-ed8d3/x86_64-centos7-gcc8-opt/bin:$PATH
export PATH=/usr/local/cuda/bin:${PATH}
export USECCACHE=1
export CCACHE_DIR=$MADGRAPH4GPU_HOME/CCACHE_DIR
cd $MADGRAPH4GPU_HOME
cd epoch1/cuda/ee_mumu/SubProcesses/P1_Sigma_sm_epem_mupmum
make cleanall
./throughput12.sh -avxall
On itscrd70.cern.ch [CPU: Intel(R) Xeon(R) Silver 4216 CPU] [GPU: 1x Tesla V100S-PCIE-32GB]:
=========================================================================
Process = EPOCH1_EEMUMU_CUDA [nvcc 11.4.120 (gcc 9.2.0)] [inlineHel=0]
FP precision = DOUBLE (NaN/abnormal=0, zero=0)
EvtsPerSec[MatrixElems] (3) = ( 7.204611e+08 ) sec^-1
EvtsPerSec[MECalcOnly] (3a) = ( 1.339841e+09 ) sec^-1
MeanMatrixElemValue = ( 1.371706e-02 +- 3.270315e-06 ) GeV^0
TOTAL : 0.859595 sec
709,240,622 cycles:u # 0.651 GHz
1,758,148,505 instructions:u # 2.48 insn per cycle
1.151621107 seconds time elapsed
==PROF== Profiling "sigmaKin": launch__registers_per_thread 122
==PROF== Profiling "sigmaKin": sm__sass_average_branch_targets_threads_uniform.pct 100%
=========================================================================
Process = EPOCH1_EEMUMU_CPP [gcc 9.2.0] [inlineHel=0]
FP precision = DOUBLE (NaN/abnormal=0, zero=0)
Internal loops fptype_sv = SCALAR ('none': ~vector[1], no SIMD)
OMP threads / `nproc --all` = 1 / 4
EvtsPerSec[MECalcOnly] (3a) = ( 1.318664e+06 ) sec^-1
MeanMatrixElemValue = ( 1.371706e-02 +- 3.270315e-06 ) GeV^0
TOTAL : 7.180935 sec
18,923,038,185 cycles:u # 2.630 GHz
48,576,130,976 instructions:u # 2.57 insn per cycle
7.198373449 seconds time elapsed
=Symbols in CPPProcess.o= (~sse4: 614) (avx2: 0) (512y: 0) (512z: 0)
-------------------------------------------------------------------------
Process = EPOCH1_EEMUMU_CPP [gcc 9.2.0] [inlineHel=0]
FP precision = DOUBLE (NaN/abnormal=0, zero=0)
Internal loops fptype_sv = VECTOR[2] ('sse4': SSE4.2, 128bit) [cxtype_ref=YES]
OMP threads / `nproc --all` = 1 / 4
EvtsPerSec[MECalcOnly] (3a) = ( 2.547401e+06 ) sec^-1
MeanMatrixElemValue = ( 1.371706e-02 +- 3.270315e-06 ) GeV^0
TOTAL : 4.874589 sec
12,769,980,796 cycles:u # 2.613 GHz
29,943,791,884 instructions:u # 2.34 insn per cycle
4.891651807 seconds time elapsed
=Symbols in CPPProcess.o= (~sse4: 3274) (avx2: 0) (512y: 0) (512z: 0)
-------------------------------------------------------------------------
Process = EPOCH1_EEMUMU_CPP [gcc 9.2.0] [inlineHel=0]
FP precision = DOUBLE (NaN/abnormal=0, zero=0)
Internal loops fptype_sv = VECTOR[4] ('avx2': AVX2, 256bit) [cxtype_ref=YES]
OMP threads / `nproc --all` = 1 / 4
EvtsPerSec[MECalcOnly] (3a) = ( 4.564820e+06 ) sec^-1
MeanMatrixElemValue = ( 1.371706e-02 +- 3.270315e-06 ) GeV^0
TOTAL : 3.725796 sec
9,149,489,724 cycles:u # 2.447 GHz
16,568,031,235 instructions:u # 1.81 insn per cycle
3.742837080 seconds time elapsed
=Symbols in CPPProcess.o= (~sse4: 0) (avx2: 2746) (512y: 0) (512z: 0)
-------------------------------------------------------------------------
Process = EPOCH1_EEMUMU_CPP [gcc 9.2.0] [inlineHel=0]
FP precision = DOUBLE (NaN/abnormal=0, zero=0)
Internal loops fptype_sv = VECTOR[4] ('512y': AVX512, 256bit) [cxtype_ref=YES]
OMP threads / `nproc --all` = 1 / 4
EvtsPerSec[MECalcOnly] (3a) = ( 4.941441e+06 ) sec^-1
MeanMatrixElemValue = ( 1.371706e-02 +- 3.270315e-06 ) GeV^0
TOTAL : 3.636535 sec
8,971,524,906 cycles:u # 2.458 GHz
16,505,115,717 instructions:u # 1.84 insn per cycle
3.653904248 seconds time elapsed
=Symbols in CPPProcess.o= (~sse4: 0) (avx2: 2572) (512y: 95) (512z: 0)
-------------------------------------------------------------------------
Process = EPOCH1_EEMUMU_CPP [gcc 9.2.0] [inlineHel=0]
FP precision = DOUBLE (NaN/abnormal=0, zero=0)
Internal loops fptype_sv = VECTOR[8] ('512z': AVX512, 512bit) [cxtype_ref=YES]
OMP threads / `nproc --all` = 1 / 4
EvtsPerSec[MECalcOnly] (3a) = ( 3.591695e+06 ) sec^-1
MeanMatrixElemValue = ( 1.371706e-02 +- 3.270315e-06 ) GeV^0
TOTAL : 4.112475 sec
8,818,863,751 cycles:u # 2.137 GHz
13,367,219,072 instructions:u # 1.52 insn per cycle
4.129526285 seconds time elapsed
=Symbols in CPPProcess.o= (~sse4: 0) (avx2: 1127) (512y: 205) (512z: 2045)
=========================================================================