Skip to content

Commit 7acfbfb

Browse files
committed
Learn Editor: Update compiling-scaling-applications.md
1 parent eeb72d5 commit 7acfbfb

File tree

1 file changed

+11
-17
lines changed

1 file changed

+11
-17
lines changed

articles/virtual-machines/compiling-scaling-applications.md

Lines changed: 11 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -20,39 +20,33 @@ Optimal scale-up and scale-out performance of HPC applications on Azure requires
2020
The [azurehpc repo](https://github.com/Azure/azurehpc) contains many examples of:
2121
- Setting up and running [applications](https://github.com/Azure/azurehpc/tree/master/apps) optimally.
2222
- Configuration of [file systems, and clusters](https://github.com/Azure/azurehpc/tree/master/examples).
23-
- [Tutorials](https://github.com/Azure/azurehpc/tree/master/tutorials) on how to get started easily with some common application workflows.
23+
- - [Tutorials](https://github.com/Azure/azurehpc/tree/master/tutorials) on how to get started easily with some common application workflows.
2424

2525
## Optimally scaling MPI
2626

2727
The following suggestions apply for optimal application scaling efficiency, performance, and consistency:
2828

29-
- For smaller scale jobs (that is, < 256K connections) use the option:
30-
```bash
31-
UCX_TLS=rc,sm
32-
```
33-
34-
- For larger scale jobs (that is, > 256K connections) use the option:
35-
```bash
36-
UCX_TLS=dc,sm
37-
```
38-
29+
- For smaller scale jobs (< 256 K connections) use the option:
30+
```bash UCX_TLS=rc,sm ```
31+
- For larger scale jobs (> 256 K connections) use the option:
32+
```bash UCX_TLS=dc,sm ```
3933
- In the above, to calculate the number of connections for your MPI job, use:
4034
```bash
4135
Max Connections = (processes per node) x (number of nodes per job) x (number of nodes per job)
4236
```
4337

4438
## Adaptive Routing
45-
Adaptive Routing (AR) allows Azure Virtual Machines (VMs) running EDR and HDR InfiniBand to automatically detect and avoid network congestion by dynamically selecting optimal network paths. As a result, AR offers improved latency and bandwidth on the InfiniBand network, which in turn drives higher performance and scaling efficiency. For more details, see [TechCommunity article](https://techcommunity.microsoft.com/t5/azure-compute/adaptive-routing-on-azure-hpc/ba-p/1205217).
39+
Adaptive Routing (AR) allows Azure Virtual Machines (VMs) running EDR and HDR InfiniBand to automatically detect and avoid network congestion by dynamically selecting optimal network paths. As a result, AR offers improved latency and bandwidth on the InfiniBand network, which in turn drives higher performance and scaling efficiency. For more information, see [TechCommunity article](https://techcommunity.microsoft.com/t5/azure-compute/adaptive-routing-on-azure-hpc/ba-p/1205217).
4640

4741
## Process pinning
4842

4943
- Pin processes to cores using a sequential pinning approach (as opposed to an autobalance approach).
5044
- Binding by Numa/Core/HwThread is better than default binding.
5145
- For hybrid parallel applications (OpenMP+MPI), use 4 threads and 1 MPI rank per [CCX]([HB-series virtual machines overview including info on CCXs](/azure/virtual-machines/hb-series-overview)) on HB and HBv2 VM sizes.
5246
- For pure MPI applications, experiment with 1-4 MPI ranks per CCX for optimal performance on HB and HBv2 VM sizes.
53-
- Some applications with extreme sensitivity to memory bandwidth may benefit from using a reduced number of cores per CCX. For these applications, using 3 or 2 cores per CCX may reduce memory bandwidth contention and yield higher real-world performance or more consistent scalability. In particular, MPI Allreduce may benefit from this approach.
54-
- For larger scale runs, it's recommended to use UD or hybrid RC+UD transports. Many MPI libraries/runtime libraries does this internally (such as UCX or MVAPICH2). Check your transport configurations for large-scale runs.
55-
47+
- Some applications with extreme sensitivity to memory bandwidth may benefit from using a reduced number of cores per CCX. For these applications, using three or two cores per CCX may reduce memory bandwidth contention and yield higher real-world performance or more consistent scalability. In particular, MPI 'Allreduce' may benefit from this approach.
48+
- For larger scale runs, it's recommended to use UD or hybrid RC+UD transports. Many MPI libraries/runtime libraries do this internally (such as UCX or MVAPICH2). Check your transport configurations for large-scale runs.
49+
5650
## Compiling applications
5751
<br>
5852
<details>
@@ -71,7 +65,7 @@ Clang supports the `-march=znver1` flag to enable best code generation and tuni
7165

7266
### FLANG
7367

74-
The FLANG compiler is a recent addition to the AOCC suite (added April 2018) and is currently in pre-release for developers to download and test. Based on Fortran 2008, AMD extends the GitHub version of FLANG (https://github.com/flang-compiler/flang). The FLANG compiler supports all Clang compiler options and other number of FLANG-specific compiler options.
68+
The FLANG compiler is a recent addition to the AOCC suite (added April 2018) and is currently in prerelease for developers to download and test. Based on Fortran 2008, AMD extends the GitHub version of FLANG (https://github.com/flang-compiler/flang). The FLANG compiler supports all Clang compiler options and other number of FLANG-specific compiler options.
7569

7670
### DragonEgg
7771

@@ -100,7 +94,7 @@ icc -o stream.intel stream.c -DSTATIC -DSTREAM_ARRAY_SIZE=800000000 -mcmodel=lar
10094
```
10195

10296
### GCC Compiler
103-
For HPC, AMD recommends GCC compiler 7.3 or newer. Older versions, such as 4.8.5 included with RHEL/CentOS 7.4, aren't recommended. GCC 7.3, and newer, delivers higher performance on HPL, HPCG, and DGEMM tests.
97+
For HPC workloads, AMD recommends GCC compiler 7.3 or newer. Older versions, such as 4.8.5 included with RHEL/CentOS 7.4, aren't recommended. GCC 7.3, and newer, delivers higher performance on HPL, HPCG, and DGEMM tests.
10498

10599
```bash
106100
gcc $(OPTIMIZATIONS) $(OMP) $(STACK) $(STREAM_PARAMETERS) stream.c -o stream.gcc

0 commit comments

Comments
 (0)