|
| 1 | +--- |
| 2 | +title: Scaling HPC applications - Azure Virtual Machines | Microsoft Docs |
| 3 | +description: Learn how to scale HPC applications on Azure VMs. |
| 4 | +services: virtual-machines |
| 5 | +documentationcenter: '' |
| 6 | +author: vermagit |
| 7 | +manager: jeconnoc |
| 8 | +editor: '' |
| 9 | +tags: azure-resource-manager |
| 10 | + |
| 11 | +ms.service: virtual-machines |
| 12 | +ms.workload: infrastructure-services |
| 13 | +ms.topic: article |
| 14 | +ms.date: 05/15/2019 |
| 15 | +ms.author: amverma |
| 16 | +--- |
| 17 | + |
| 18 | +# Scaling HPC applications |
| 19 | + |
| 20 | +Optimal scale-up and scale-out performance of HPC applications on Azure requires performance tuning and optimization experiments for the specific workload. This section and the VM series-specific pages offer general guidance for scaling your applications. |
| 21 | + |
| 22 | +## Compiling applications |
| 23 | + |
| 24 | +Though not necessary, compiling applications with appropriate optimization flags provides the best scale-up performance on HB and HC-series VMs. |
| 25 | + |
| 26 | +### AMD Optimizing C/C++ Compiler |
| 27 | + |
| 28 | +The AMD Optimizing C/C++ Compiler (AOCC) compiler system offers a high level of advanced optimizations, multi-threading, and processor support that includes global optimization, vectorization, inter-procedural analyses, loop transformations, and code generation. AOCC compiler binaries are suitable for Linux systems having GNU C Library (glibc) version 2.17 and above. The compiler suite consists of a C/C++ compiler (clang), a Fortran compiler (FLANG) and a Fortran front end to Clang (Dragon Egg). |
| 29 | + |
| 30 | +### Clang |
| 31 | + |
| 32 | +Clang is a C, C++, and Objective-C compiler handling preprocessing, parsing, optimization, code generation, assembly, and linking. |
| 33 | +Clang supports the `-march=znver1` flag to enable best code generation and tuning for AMD’s Zen based x86 architecture. |
| 34 | + |
| 35 | +### FLANG |
| 36 | + |
| 37 | +The FLANG compiler is a recent addition to the AOCC suite (added April 2018) and is currently in pre-release for developers to download and test. Based on Fortran 2008, AMD extends the GitHub version of FLANG (https://github.com/flangcompiler/flang). The FLANG compiler supports all Clang compiler options and an additional number of FLANG-specific compiler options. |
| 38 | + |
| 39 | +### DragonEgg |
| 40 | + |
| 41 | +DragonEgg is a gcc plugin that replaces GCC’s optimizers and code generators with those from the LLVM project. DragonEgg that comes with AOCC works with gcc-4.8.x, has been tested for x86-32/x86-64 targets and has been successfully used on various Linux platforms. |
| 42 | + |
| 43 | +GFortran is the actual frontend for Fortran programs responsible for preprocessing, parsing, and semantic analysis generating the GCC GIMPLE intermediate representation (IR). DragonEgg is a GNU plugin, plugging into GFortran compilation flow. It implements the GNU plugin API. With the plugin architecture, DragonEgg becomes the compiler driver, driving the different phases of compilation. After following the download and installation instructions, Dragon Egg can be invoked using: |
| 44 | + |
| 45 | +```bash |
| 46 | +$ gfortran [gFortran flags] |
| 47 | + -fplugin=/path/AOCC-1.2-Compiler/AOCC-1.2- |
| 48 | + FortranPlugin/dragonegg.so [plugin optimization flags] |
| 49 | + -c xyz.f90 $ clang -O3 -lgfortran -o xyz xyz.o $./xyz |
| 50 | +``` |
| 51 | + |
| 52 | +### PGI Compiler |
| 53 | +PGI Community Edition ver. 17 is confirmed to work with AMD EPYC. A PGI-compiled version of STREAM does deliver full memory bandwidth of the platform. The newer Community Edition 18.10 (Nov 2018) should likewise work well. Below is sample CLI to compiler optimally with the Intel Compiler: |
| 54 | + |
| 55 | +```bash |
| 56 | +pgcc $(OPTIMIZATIONS_PGI) $(STACK) -DSTREAM_ARRAY_SIZE=800000000 stream.c -o stream.pgi |
| 57 | +``` |
| 58 | + |
| 59 | +### Intel Compiler |
| 60 | +Intel Compiler ver. 18 is confirmed to work with AMD EPYC. Below is sample CLI to compiler optimally with the Intel Compiler. |
| 61 | + |
| 62 | +```bash |
| 63 | +icc -o stream.intel stream.c -DSTATIC -DSTREAM_ARRAY_SIZE=800000000 -mcmodel=large -shared-intel -Ofast –qopenmp |
| 64 | +``` |
| 65 | + |
| 66 | +### GCC Compiler |
| 67 | +For HPC, AMD recommends GCC compiler 7.3 or newer. Older versions, such as 4.8.5 included with RHEL/CentOS 7.4, are not recommended. GCC 7.3, and newer, will deliver significantly higher performance on HPL, HPCG, and DGEMM tests. |
| 68 | + |
| 69 | +```bash |
| 70 | +gcc $(OPTIMIZATIONS) $(OMP) $(STACK) $(STREAM_PARAMETERS) stream.c -o stream.gcc |
| 71 | +``` |
| 72 | + |
| 73 | +## Scaling applications |
| 74 | + |
| 75 | +The following suggestions apply for optimal application scaling efficiency, performance, and consistency: |
| 76 | + |
| 77 | +* Pin processes to cores 0-59 using a sequential pinning approach (as opposed to an auto-balance approach). |
| 78 | +* Binding by Numa/Core/HwThread is better than default binding. |
| 79 | +* For hybrid parallel applications (OpenMP+MPI), use 4 threads and 1 MPI rank per CCX. |
| 80 | +* For pure MPI applications, experiment with 1-4 MPI ranks per CCX for optimal performance. |
| 81 | +* Some applications with extreme sensitivity to memory bandwidth may benefit from using a reduced number of cores per CCX. For these applications, using 3 or 2 cores per CCX may reduce memory bandwidth contention and yield higher real-world performance or more consistent scalability. In particular, MPI Allreduce may benefit from this. |
| 82 | +* For significantly larger scale runs, it is recommended to use UD or hybrid RC+UD transports. Many MPI libraries/runtime libraries do this internally (such as UCX or MVAPICH2). Check your transport configurations for large-scale runs. |
| 83 | + |
| 84 | +## Next steps |
| 85 | + |
| 86 | +Learn more about [HPC](https://docs.microsoft.com/azure/architecture/topics/high-performance-computing/) on Azure. |
0 commit comments