Skip to content

Commit d2209ac

Browse files
Merge pull request #1562 from kieranhejmadi01/main
Learn Basic C++ Optimisation Techniques using the G++ Compiler - LP
2 parents cdead2c + 16e7885 commit d2209ac

File tree

6 files changed

+502
-0
lines changed

6 files changed

+502
-0
lines changed
Lines changed: 93 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,93 @@
1+
---
2+
title: Basics of Compilers
3+
weight: 2
4+
5+
### FIXED, DO NOT MODIFY
6+
layout: learningpathall
7+
---
8+
9+
## Introduction to C++ and Compilers
10+
11+
The C++ language gives the programmer the freedom to be expressive in the way they write code - allowing low-level manipulation of memory and data structures. Compared to managed languages, such as Java, C++ source code is generally less portable, requiring recompilation to the target Arm architecture. In the context of optimizing C++ workloads on Arm, significant performance improvements can be achieved without modifying the source code, simply by using the compiler correctly.
12+
13+
Writing performant C++ code is a topic in itself and out of scope for this learning path. Instead we will focus on how to effectively use the compiler to target Arm instances in a cloud environment.
14+
15+
## Purpose of a Compiler
16+
17+
The g++ compiler is part of the GNU Compiler Collection (GCC), which is a set of compilers for various programming languages, including C++. The primary objective of the g++ compiler is to translate C++ source code into machine code that can be executed by a computer. This process involves several high-level stages:
18+
19+
- Preprocessing: In this initial stage, the preprocessor handles directives that start with a # symbol, such as `#include`, `#define`, and `#if`. It expands included header files, replaces macros, and processes conditional compilation statements.
20+
21+
- Compilation: The compiler translates the preprocessed source code into an intermediate representation specific to the target processor architecture. This step includes syntax checking, semantic analysis, and generating error messages for any issues encountered in the source code.
22+
23+
- Assembly: The intermediate representation is converted into assembly language, which uses mnemonics and syntax specific to the target processor architecture. Assemblers then convert this assembly code into object code (machine code).
24+
25+
- Linking: The final stage involves linking the object code with necessary libraries and other object files. The linker merges multiple object files and libraries, resolves external references, allocates memory addresses for functions and variables, and generates an executable file that can be run on the target platform.
26+
27+
An interesting fact about the g++ compiler is that it is designed to optimize both the performance and the size of the generated code. The compiler performs various optimizations based on the knowledge it has of the program, and it can be configured to prioritize reducing the size of the generated executable.
28+
29+
30+
### Compiler Versioning
31+
32+
Two popular compilers of C++ are the GNU Compiler Collection (GCC) and LLVM - both of which are open-source compilers and have contributions from Arm engineers to support the latest architectures. Proprietary or vendor-specific compilers, such as `nvcc` for compiling for NVIDIA GPUs, are often based on these open-source compilers. Alternative proprietary compilers are often designed for specific use cases. For example, safety-critical applications may need to comply with various ISO standards, which also include the compiler. The functional safety [Arm Compiler for Embedded](https://developer.arm.com/Tools%20and%20Software/Arm%20Compiler%20for%20Embedded%20FuSa) is such an example of a C/C++ compiler.
33+
34+
Most application developers are not in this safety qualification domain so we will be using the open-source GCC/G++ compiler for this learning path.
35+
36+
There are multiple Linux distribtions available to choose from. Each Linux distribution and operating system has a default compiler. For example after installing the default g++ on an `r8g` AWS instance, the default g++ compiler as of January 2025 is below.
37+
38+
``` output
39+
g++ --version
40+
g++ (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0
41+
Copyright (C) 2023 Free Software Foundation, Inc.
42+
This is free software; see the source for copying conditions. There is NO
43+
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
44+
```
45+
46+
The table below shows the default (*) and available compilers for a variety of linux distributions. This is taken from the [AWS-graviton performance runbook](https://github.com/aws/aws-graviton-getting-started/blob/main/c-c%2B%2B.md).
47+
48+
49+
Distribution | GCC | Clang/LLVM
50+
----------------|----------------------|-------------
51+
Amazon Linux 2023 | 11* | 15*
52+
Amazon Linux 2 | 7*, 10 | 7, 11*
53+
Ubuntu 24.04 | 9, 10, 11, 12, 13*, 14 | 14, 15, 16, 17, 18*
54+
Ubuntu 22.04 | 9, 10, 11*, 12 | 11, 12, 13, 14*
55+
Ubuntu 20.04 | 7, 8, 9*, 10 | 6, 7, 8, 9, 10, 11, 12
56+
Ubuntu 18.04 | 4.8, 5, 6, 7*, 8 | 3.9, 4, 5, 6, 7, 8, 9, 10
57+
Debian10 | 7, 8* | 6, 7, 8
58+
Red Hat EL8 | 8*, 9, 10 | 10
59+
SUSE Linux ES15 | 7*, 9, 10 | 7
60+
61+
62+
The biggest and most simple performance gain can be achieved by using the most recent compiler available. The most recent optimisations and support will be available through the latest compiler.
63+
64+
Looking at the g++ documentation as an example, the most recent version of GCC available at the time of writing, version 14.2, has the following support and optimisations listed on their website [change note](https://gcc.gnu.org/gcc-14/changes.html).
65+
66+
```output
67+
A number of new CPUs are supported through the -mcpu and -mtune options (GCC identifiers in parentheses).
68+
- Ampere-1B (ampere1b).
69+
- Arm Cortex-A520 (cortex-a520).
70+
- Arm Cortex-A720 (cortex-a720).
71+
- Arm Cortex-X4 (cortex-x4).
72+
- Microsoft Cobalt-100 (cobalt-100).
73+
...
74+
```
75+
76+
Sufficient due diligence should be taken when updating your C++ compiler because the process may reveal bugs in your source code. These bugs are often undefined behaviour caused by not adhering to the C++ standard. It is rare that the compiler itself will introduced a bug. However, in such events known bugs are made publicly available in the compiler documentation.
77+
78+
## Basic g++ Optimisation Levels
79+
80+
Using the g++ compiler as an example, the most course-grained dial you can adjust is the optimisation level, denoted with `-O<x>`. This adjusts a variety of lower-level optimsation flags at the expense of increased computation time, memory use and debuggability. When aggresive optimisation is used, the optimised binary may not show expected behaviour when hooked up to a debugger such as `gdb`. This is because the generated code may not match the original source code or program order, for example from loop unrolling and vectorisation.
81+
82+
A few of the most common optimization levels are in the table below.
83+
84+
| Optimization Level | Description |
85+
|--------------------|----------------------------------------------------------------------------------------------|
86+
| `-O0` | No optimization; useful for debugging. |
87+
| `-O1` | Basic optimizations that improve performance without significantly increasing compilation time. |
88+
| `-O2` | More aggressive optimizations that further enhance performance. |
89+
| `-O3` | Enables aggressive optimizations that can significantly improve execution speed. |
90+
| `-Os` | Optimizes code size, reducing the overall binary size. |
91+
| `-Ofast` | Enables optimizations that may not strictly adhere to standard compliance. |
92+
93+
Please refer to your compiler documentation for full details on the optimisation level, for example [GCC](https://gcc.gnu.org/onlinedocs/gcc-14.2.0/gcc/Optimize-Options.html).
Lines changed: 84 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,84 @@
1+
---
2+
title: Setup Your Environment
3+
weight: 3
4+
5+
### FIXED, DO NOT MODIFY
6+
layout: learningpathall
7+
---
8+
9+
If you are new to cloud computing, please refer to our learning path on [Getting started with Servers and Cloud Computing](https://learn.arm.com/learning-paths/servers-and-cloud-computing/intro/).
10+
11+
## Connect to an AWS Arm-based Instance
12+
13+
In this example we will be building and running our C++ application on an AWS Graviton 4 (`r8g.xlarge`) instance running Ubuntu 24.04 LTS. Once connected run the following commands to confirm the operating system and archiecture version.
14+
15+
```bash
16+
cat /etc/*lsb*
17+
```
18+
19+
You will see an output such as the following:
20+
21+
```output
22+
DISTRIB_ID=Ubuntu
23+
DISTRIB_RELEASE=24.04
24+
DISTRIB_CODENAME=noble
25+
DISTRIB_DESCRIPTION="Ubuntu 24.04.1 LTS"
26+
```
27+
28+
Next, we will confirm we are using a 64-bit Arm-based system using the following command
29+
30+
```bash
31+
uname -m
32+
```
33+
34+
You will see the following output.
35+
36+
```output
37+
aarch64
38+
```
39+
40+
## Enable Environment modules
41+
42+
Environment modules are a tool to quickly modify your shell configuration and environment variables. For this learning path, it allows us to quickly switch between different compiler versions to demonstrate potential improvements.
43+
44+
Install Environment Modules
45+
46+
First, you need to install the environment modules package. Open your terminal and run the following command:
47+
```bash
48+
sudo apt update
49+
sudo apt install environment-modules
50+
```
51+
52+
Load environment modules after the package is installed.
53+
54+
```bash
55+
sudo chmod 755 /usr/share/modules/init/bash
56+
source /usr/share/modules/init/bash
57+
```
58+
Reload your shell configuration.
59+
60+
```bash
61+
source ~/.bashrc
62+
```
63+
64+
Install various compiler version on your Ubuntu system. For this example we will install version 9 of the gcc/g++ compiler to demonstrate potential improvements your application could achieve.
65+
66+
```bash
67+
sudo add-apt-repository ppa:ubuntu-toolchain-r/test
68+
sudo apt update
69+
sudo apt install gcc-9 g++-9
70+
```
71+
72+
Create a module file for each compiler installed.
73+
74+
```bash
75+
mkdir -p ~/modules/gcc
76+
nano ~/modules/gcc/9
77+
```
78+
Copy and paste the text below into the nano text editor and save the file
79+
```ouput
80+
#%Module1.0
81+
prepend-path PATH /usr/bin/gcc-9
82+
prepend-path PATH /usr/bin/g++-9
83+
```
84+
Lines changed: 139 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,139 @@
1+
---
2+
title: Finding Supported Neoverse Features
3+
weight: 3
4+
5+
### FIXED, DO NOT MODIFY
6+
layout: learningpathall
7+
---
8+
9+
## Identify the Neoverse Version
10+
11+
To understand which Neoverse version a cloud instance uses check the [Arm partner webpage](https://www.arm.com/partners/aws).
12+
13+
Alternatively, if you already have access to the instance, run the `lscpu` command and observe the underlying Neoverse Architecture under the `Model Name` row.
14+
15+
```output
16+
lscpu | grep -i model
17+
Model name: Neoverse-V2
18+
Model: 1
19+
```
20+
Here you can confirm the AWS`r8g.xlarge` instance, is based on the Neoverse-V2 Arm IP. We will use this instance for the remainder of this learning path.
21+
22+
## Understand Supported CPU Features
23+
24+
Next, to identify the CPU extensions supported by this architecture at runtime we can observe the Linux hardware capabilities (HWCAP) vector. The C++ source code below that reads a specific vector that contains the information.
25+
26+
Copy and paste the c program into a file named, `hw_cap.c`.
27+
28+
```c
29+
30+
#include <stdio.h>
31+
#include <sys/auxv.h>
32+
#include <asm/hwcap.h>
33+
34+
int main()
35+
{
36+
long hwcaps = getauxval(AT_HWCAP);
37+
38+
if (hwcaps & HWCAP_AES) {
39+
printf("AES instructions are available\n");
40+
} else {
41+
printf("AES instructions are not available\n");
42+
}
43+
if (hwcaps & HWCAP_CRC32) {
44+
printf("CRC32 instructions are available\n");
45+
} else {
46+
printf("CRC32 instructions are not available\n");
47+
}
48+
if (hwcaps & HWCAP_PMULL) {
49+
printf("PMULL/PMULL2 instructions that operate on 64-bit data are available\n");
50+
} else {
51+
printf("PMULL/PMULL2 instructions are not available\n");
52+
}
53+
if (hwcaps & HWCAP_SHA1) {
54+
printf("SHA1 instructions are available\n");
55+
} else {
56+
printf("SHA1 instructions are not available\n");
57+
}
58+
if (hwcaps & HWCAP_SHA2) {
59+
printf("SHA2 instructions are available\n");
60+
} else {
61+
printf("SHA2 instructions are not available\n");
62+
}
63+
if (hwcaps & HWCAP_SVE) {
64+
printf("Scalable Vector Extension (SVE) instructions are available\n");
65+
} else {
66+
printf("Scalable Vector Extension (SVE) instructions are not available\n");
67+
}
68+
69+
return 0;
70+
}
71+
72+
```
73+
74+
Compile and run with the command below.
75+
76+
```bash
77+
gcc hw_cap.c -o hw_cap
78+
./hw_cap
79+
```
80+
81+
On Graviton 4, I the output below confirms the scalable vector extensions (SVE) are available.
82+
83+
```output
84+
AES instructions are available
85+
CRC32 instructions are available
86+
PMULL/PMULL2 instructions that operate on 64-bit data are available
87+
SHA1 instructions are available
88+
SHA2 instructions are available
89+
Scalable Vector Extension (SVE) instructions are available
90+
```
91+
92+
For the latest list of all hardware capabilities available for a specific linux kernel version, refer to the `arch/arm/include/uapi/asm/hwcap.h` header file in the Linux Kernel source code.
93+
94+
Further, knowing the width of SVE (Scalable Vector Extension) can be useful for optimizing software performance, as it allows developers to tailor their code to fully utilize the available vector processing capabilities of the hardware. Copy the following C code into a file named `sve_width.c`.
95+
96+
```c
97+
#include <arm_sve.h>
98+
#include <stdio.h>
99+
100+
int main() {
101+
int sve_width = svcntb();
102+
printf("SVE vector length: %d bytes\n", sve_width);
103+
return 0;
104+
}
105+
```
106+
107+
Compile with the following command.
108+
109+
```bash
110+
g++ sve_width.c -o sve_width -mcpu=neoverse-v2
111+
```
112+
113+
This shows that the Neoverse-V2 based Graviton 4 instance has a SVE width of 8 bytes (128 bits).
114+
115+
```output
116+
SVE vector length: 16 bytes
117+
```
118+
119+
## Supported Compiler Features
120+
121+
Fortunately, the g++ compiler will automatically identify the host systems capability. The `-###` argument can be used to show the full options used when compiling.
122+
123+
If the host is the same platform you are compiling for, you can observe which CPUs are potential targets for your command with the following g++ command.
124+
125+
```output
126+
g++ -E -mcpu=help -xc /dev/null
127+
cc1: note: valid arguments are: cortex-a34 cortex-a35 cortex-a53 cortex-a57 cortex-a72 cortex-a73 thunderx thunderxt88p1 thunderxt88 octeontx octeontx81 octeontx83 thunderxt81 thunderxt83 ampere1 ampere1a emag xgene1 falkor qdf24xx exynos-m1 phecda thunderx2t99p1 vulcan thunderx2t99 cortex-a55 cortex-a75 cortex-a76 cortex-a76ae cortex-a77 cortex-a78 cortex-a78ae cortex-a78c cortex-a65 cortex-a65ae cortex-x1 cortex-x1c **neoverse-n1** ares neoverse-e1 octeontx2 octeontx2t98 octeontx2t96 octeontx2t93 octeontx2f95 octeontx2f95n octeontx2f95mm a64fx tsv110 thunderx3t110 neoverse-v1 zeus neoverse-512tvb saphira cortex-a57.cortex-a53 cortex-a72.cortex-a53 cortex-a73.cortex-a35 cortex-a73.cortex-a53 cortex-a75.cortex-a55 cortex-a76.cortex-a55 cortex-r82 cortex-a510 cortex-a710 cortex-a715 cortex-x2 cortex-x3 neoverse-n2 cobalt-100 neoverse-v2 grace demeter generic
128+
```
129+
130+
Comparing to when using `g++9` we can see there are fewer CPU targets to optimise for as recently released CPUs are omitted, for example the Neoverse V2.
131+
132+
```
133+
g++-9 -E -mcpu=help -xc /dev/null
134+
cc1: note: valid arguments are: cortex-a35 cortex-a53 cortex-a57 cortex-a72 cortex-a73 thunderx thunderxt88p1 thunderxt88 octeontx octeontx81 octeontx83 thunderxt81 thunderxt83 emag xgene1 falkor qdf24xx exynos-m1 phecda thunderx2t99p1 vulcan thunderx2t99 cortex-a55 cortex-a75 cortex-a76 ares neoverse-n1 neoverse-e1 a64fx tsv110 zeus neoverse-v1 neoverse-512tvb saphira neoverse-n2 cortex-a57.cortex-a53 cortex-a72.cortex-a53 cortex-a73.cortex-a35 cortex-a73.cortex-a53 cortex-a75.cortex-a55 cortex-a76.cortex-a55 generic
135+
```
136+
137+
138+
139+

0 commit comments

Comments
 (0)