Skip to content

Commit 3bb0a7f

Browse files
authored
Merge pull request #232422 from mattmcinnes/docs-editor/compiling-scaling-applications-1680025923
[Doc-a-thon] Compiling Scaling Applications
2 parents 391fcf4 + 7d85ae3 commit 3bb0a7f

File tree

1 file changed

+21
-28
lines changed

1 file changed

+21
-28
lines changed

articles/virtual-machines/compiling-scaling-applications.md

Lines changed: 21 additions & 28 deletions
Original file line numberDiff line numberDiff line change
@@ -4,8 +4,8 @@ description: Learn how to scale HPC applications on Azure VMs.
44
ms.service: virtual-machines
55
ms.subservice: hpc
66
ms.topic: article
7-
ms.date: 03/10/2023
8-
ms.reviewer: cynthn
7+
ms.date: 03/28/2023
8+
ms.reviewer: cynthn, mattmcinnes
99
ms.author: mamccrea
1010
author: mamccrea
1111
---
@@ -26,33 +26,25 @@ The [azurehpc repo](https://github.com/Azure/azurehpc) contains many examples of
2626

2727
The following suggestions apply for optimal application scaling efficiency, performance, and consistency:
2828

29-
- For smaller scale jobs (that is, < 256K connections) use the option:
30-
```bash
31-
UCX_TLS=rc,sm
32-
```
33-
34-
- For larger scale jobs (that is, > 256K connections) use the option:
35-
```bash
36-
UCX_TLS=dc,sm
37-
```
38-
39-
- In the above, to calculate the number of connections for your MPI job, use:
40-
```bash
41-
Max Connections = (processes per node) x (number of nodes per job) x (number of nodes per job)
42-
```
43-
29+
- For smaller scale jobs (< 256K connections) use:
30+
```bash UCX_TLS=rc,sm ```
31+
- For larger scale jobs (> 256K connections) use:
32+
```bash UCX_TLS=dc,sm ```
33+
- To calculate the number of connections for your MPI job, use:
34+
```bash Max Connections = (processes per node) x (number of nodes per job) x (number of nodes per job) ```
35+
4436
## Adaptive Routing
45-
Adaptive Routing (AR) allows Azure Virtual Machines (VMs) running EDR and HDR InfiniBand to automatically detect and avoid network congestion by dynamically selecting more optimal network paths. As a result, AR offers improved latency and bandwidth on the InfiniBand network, which in turn drives higher performance and scaling efficiency. For more details, see [TechCommunity article](https://techcommunity.microsoft.com/t5/azure-compute/adaptive-routing-on-azure-hpc/ba-p/1205217).
37+
Adaptive Routing (AR) allows Azure Virtual Machines (VMs) running EDR and HDR InfiniBand to automatically detect and avoid network congestion by dynamically selecting optimal network paths. As a result, AR offers improved latency and bandwidth on the InfiniBand network, which in turn drives higher performance and scaling efficiency. For more information, see [TechCommunity article](https://techcommunity.microsoft.com/t5/azure-compute/adaptive-routing-on-azure-hpc/ba-p/1205217).
4638

4739
## Process pinning
4840

4941
- Pin processes to cores using a sequential pinning approach (as opposed to an autobalance approach).
5042
- Binding by Numa/Core/HwThread is better than default binding.
51-
- For hybrid parallel applications (OpenMP+MPI), use 4 threads and 1 MPI rank per CCX on HB and HBv2 VM sizes.
52-
- For pure MPI applications, experiment with 1-4 MPI ranks per CCX for optimal performance on HB and HBv2 VM sizes.
53-
- Some applications with extreme sensitivity to memory bandwidth may benefit from using a reduced number of cores per CCX. For these applications, using 3 or 2 cores per CCX may reduce memory bandwidth contention and yield higher real-world performance or more consistent scalability. In particular, MPI Allreduce may benefit from this approach.
54-
- For larger scale runs, it's recommended to use UD or hybrid RC+UD transports. Many MPI libraries/runtime libraries does this internally (such as UCX or MVAPICH2). Check your transport configurations for large-scale runs.
55-
43+
- For hybrid parallel applications (OpenMP+MPI), use four threads and one MPI rank per [CCX]([HB-series virtual machines overview including info on CCXs](/azure/virtual-machines/hb-series-overview)) on HB and HBv2 VM sizes.
44+
- For pure MPI applications, experiment with between one to four MPI ranks per CCX for optimal performance on HB and HBv2 VM sizes.
45+
- Some applications with extreme sensitivity to memory bandwidth may benefit from using a reduced number of cores per CCX. For these applications, using three or two cores per CCX may reduce memory bandwidth contention and yield higher real-world performance or more consistent scalability. In particular, MPI 'Allreduce' may benefit from this approach.
46+
- For larger scale runs, it's recommended to use UD or hybrid RC+UD transports. Many MPI libraries/runtime libraries use these transports internally (such as UCX or MVAPICH2). Check your transport configurations for large-scale runs.
47+
5648
## Compiling applications
5749
<br>
5850
<details>
@@ -71,7 +63,7 @@ Clang supports the `-march=znver1` flag to enable best code generation and tuni
7163

7264
### FLANG
7365

74-
The FLANG compiler is a recent addition to the AOCC suite (added April 2018) and is currently in pre-release for developers to download and test. Based on Fortran 2008, AMD extends the GitHub version of FLANG (https://github.com/flang-compiler/flang). The FLANG compiler supports all Clang compiler options and other number of FLANG-specific compiler options.
66+
The FLANG compiler is a recent addition to the AOCC suite (added April 2018) and is currently in prerelease for developers to download and test. Based on Fortran 2008, AMD extends the GitHub version of FLANG (https://github.com/flang-compiler/flang). The FLANG compiler supports all Clang compiler options and other number of FLANG-specific compiler options.
7567

7668
### DragonEgg
7769

@@ -85,23 +77,23 @@ $ gfortran [gFortran flags]
8577
FortranPlugin/dragonegg.so [plugin optimization flags]
8678
-c xyz.f90 $ clang -O3 -lgfortran -o xyz xyz.o $./xyz
8779
```
88-
8980
### PGI Compiler
90-
PGI Community Edition 17 is confirmed to work with AMD EPYC. A PGI-compiled version of STREAM does deliver full memory bandwidth of the platform. The newer Community Edition 18.10 (Nov 2018) should likewise work well. A sample CLI to compiler optimally with the Intel Compiler:
81+
PGI Community Edition 17 is confirmed to work with AMD EPYC. A PGI-compiled version of STREAM does deliver full memory bandwidth of the platform. The newer Community Edition 18.10 (Nov 2018) should likewise work well. Use this CLI command to compile with the Intel Compiler:
82+
9183

9284
```bash
9385
pgcc $(OPTIMIZATIONS_PGI) $(STACK) -DSTREAM_ARRAY_SIZE=800000000 stream.c -o stream.pgi
9486
```
9587

9688
### Intel Compiler
97-
Intel Compiler 18 is confirmed to work with AMD EPYC. Below is sample CLI to compiler optimally with the Intel Compiler.
89+
Intel Compiler 18 is confirmed to work with AMD EPYC. Use this CLI command to compile with the Intel Compiler.
9890

9991
```bash
10092
icc -o stream.intel stream.c -DSTATIC -DSTREAM_ARRAY_SIZE=800000000 -mcmodel=large -shared-intel -Ofast –qopenmp
10193
```
10294

10395
### GCC Compiler
104-
For HPC, AMD recommends GCC compiler 7.3 or newer. Older versions, such as 4.8.5 included with RHEL/CentOS 7.4, aren't recommended. GCC 7.3, and newer, delivers higher performance on HPL, HPCG, and DGEMM tests.
96+
For HPC workloads, AMD recommends GCC compiler 7.3 or newer. Older versions, such as 4.8.5 included with RHEL/CentOS 7.4, aren't recommended. GCC 7.3, and newer, delivers higher performance on HPL, HPCG, and DGEMM tests.
10597

10698
```bash
10799
gcc $(OPTIMIZATIONS) $(OMP) $(STACK) $(STREAM_PARAMETERS) stream.c -o stream.gcc
@@ -114,3 +106,4 @@ gcc $(OPTIMIZATIONS) $(OMP) $(STACK) $(STREAM_PARAMETERS) stream.c -o stream.gcc
114106
- Review the [HBv3-series overview](hbv3-series-overview.md) and [HC-series overview](hc-series-overview.md).
115107
- Read about the latest announcements, HPC workload examples, and performance results at the [Azure Compute Tech Community Blogs](https://techcommunity.microsoft.com/t5/azure-compute/bg-p/AzureCompute).
116108
- Learn more about [HPC](/azure/architecture/topics/high-performance-computing/) on Azure.
109+

0 commit comments

Comments
 (0)