Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -0,0 +1,98 @@
---
title: Build Linux image
weight: 3

### FIXED, DO NOT MODIFY
layout: learningpathall
---

## Install packages

```
sudo apt update
sudo apt install -y which sed make binutils build-essential diffutils gcc g++ bash patch gzip \
bzip2 perl tar cpio unzip rsync file bc findutils gawk libncurses-dev python-is-python3 \
gcc-arm-none-eabi
```

## Build a debuggable kernel image

For this learning path you will be using [Buildroot](https://github.com/buildroot/buildroot) to build a Linux image for Raspberry Pi 3B+ with a debuggable Linux kernel. You will profile Linux kernel modules built out-of-tree and Linux device drivers built in the Linux source code tree.

1. Clone the Buildroot Repository and initialize the build system with the default configurations.

```bash
git clone https://github.com/buildroot/buildroot.git
cd buildroot
export BUILDROOT_HOME=$(pwd)
make raspberrypi3_64_defconfig
```
{{% notice Using a different board %}}
If you're not using a Raspberry Pi 3 for this Learning Path, change the `raspberrypi3_64_defconfig` to the option that matches your hardware in `$(BUILDROOT_HOME)/configs`
{{% /notice %}}

2. You will use `menuconfig` to configure the setup. Invoke it with the following command:

```
make menuconfig
```

![Menuconfig UI for Buildroot configuration](./images/menuconfig.png)

Change Buildroot configurations to enable debugging symbols and SSH access.

```plaintext
Build options --->
[*] build packages with debugging symbols
gcc debug level (debug level 3)
[*] build packages with runtime debugging info
gcc optimization level (optimize for debugging) --->

System configuration --->
[*] Enable root login with password
(****) Root password # Choose root password here

Kernel --->
Linux Kernel Tools --->
[*] perf

Target packages --->
Networking applications --->
[*] openssh
[*] server
[*] key utilities
```

You might also need to change your default `sshd_config` file according to your network settings. To do that, you need to modify System configuration→ Root filesystem overlay directories to add a directory that contains your modified `sshd_config` file.

3. By default the Linux kernel images are stripped. You will need to make the image debuggable as you'll be using it later.

Invoke `linux-menuconfig` and uncheck the option as shown.

```bash
make linux-menuconfig
```

```plaintext
Kernel hacking --->
-*- Kernel debugging
Compile-time checks and compiler options --->
Debug information (Rely on the toolchain's implicit default DWARF version)
[ ] Reduce debugging information # un-check
```

4. Now you can build the Linux image and flash it to the the SD card to run it on the Raspberry Pi.

```bash
make -j$(nproc)
```

It will take some time to build the Linux image. When it completes, the output will be in `$BUILDROOT_HOME/output/images/sdcard.img`:

```bash
ls $BUILDROOT_HOME/output/images/ | grep sdcard.img
```

For details on flashing the SD card image, see [this helpful article](https://www.ev3dev.org/docs/tutorials/writing-sd-card-image-ubuntu-disk-image-writer/).

Now that you have a target running Linux with a debuggable kernel image, you can start writing your kernel module that you want to profile.
Original file line number Diff line number Diff line change
Expand Up @@ -8,15 +8,15 @@ layout: learningpathall

## Creating the Linux Kernel Module

We will now learn how to create an example Linux kernel module (Character device) that demonstrates a cache miss issue caused by traversing a 2D array in column-major order. This access pattern is not cache-friendly, as it skips over most of the neighboring elements in memory during each iteration.
You will now create an example Linux kernel module (Character device) that demonstrates a cache miss issue caused by traversing a 2D array in column-major order. This access pattern is not cache-friendly, as it skips over most of the neighboring elements in memory during each iteration.

To build the Linux kernel module, start by creating a new directory—We will call it **example_module**—in any location of your choice. Inside this directory, add two files: `mychardrv.c` and `Makefile`.
To build the Linux kernel module, start by creating a new directory, for example `example_module`. Inside this directory, add two files: `mychardrv.c` and `Makefile`.

**Makefile**

```makefile
obj-m += mychardrv.o
BUILDROOT_OUT := /opt/rpi-linux/buildroot/output # Change this to your buildroot output directory
BUILDROOT_OUT := $(BUILDROOT_HOME)/output # Change this to your buildroot output directory
KDIR := $(BUILDROOT_OUT)/build/linux-custom
CROSS_COMPILE := $(BUILDROOT_OUT)/host/bin/aarch64-buildroot-linux-gnu-
ARCH := arm64
Expand All @@ -29,7 +29,7 @@ clean:
```

{{% notice Note %}}
Change **BUILDROOT_OUT** to the correct buildroot output directory on your host machine
Change **BUILDROOT_OUT** to the correct buildroot output directory on your host machine.
{{% /notice %}}

**mychardrv.c**
Expand Down Expand Up @@ -201,40 +201,45 @@ MODULE_AUTHOR("Yahya Abouelseoud");
MODULE_DESCRIPTION("A simple char driver with cache misses issue");
```

The module above receives the size of a 2D array as a string through the `char_dev_write()` function, converts it to an integer, and passes it to the `char_dev_cache_traverse()` function. This function then creates the 2D array, initializes it with simple data, traverses it in a column-major (cache-unfriendly) order, computes the sum of its elements, and prints the result to the kernel log.
The module above receives the size of a 2D array as a string through the `char_dev_write()` function, converts it to an integer, and passes it to the `char_dev_cache_traverse()` function. This function then creates the 2D array, initializes it with simple data, traverses it in a column-major (cache-unfriendly) order, computes the sum of its elements, and prints the result to the kernel log. The cache-unfriendly aspects allows you to inspect a bottleneck using Streamline in the next section.

## Building and Running the Kernel Module

1. To compile the kernel module, run make inside the example_module directory. This will generate the output file `mychardrv.ko`.

2. Transfer the .ko file to the target using scp command and then insert it using insmod command. After inserting the module, we create a character device node using mknod command. Finally, we can test the module by writing a size value (e.g., 10000) to the device file and measuring the time taken for the operation using the `time` command.
2. Transfer the .ko file to the target using scp command and then insert it using insmod command. After inserting the module, you create a character device node using mknod command. Finally, you can test the module by writing a size value (e.g., 10000) to the device file and measuring the time taken for the operation using the `time` command.

```bash
scp mychardrv.ko root@<target-ip>:/root/
```

{{% notice Note %}}
Replace \<target-ip> with your own target IP address
Replace \<target-ip> with your target's IP address
{{% /notice %}}

3. To run the module on the target, we need to run the following commands on the target:
3. SSH onto your target device:

```bash
ssh root@<your-target-ip>

#The following commands should be running on target device

```

4. Execute the following commads on the target to run the module:
```bash
insmod /root/mychardrv.ko
mknod /dev/mychardrv c 42 0
```

{{% notice Note %}}
42 and 0 are the major and minor number we chose in our module code above
42 and 0 are the major and minor number specified in the module code above
{{% /notice %}}

4. Now if you run dmesg you should see something like:
4. To verify that the module is active, run `dmesg` and the output should match the below:

```bash
dmesg
```

```log
```output
[12381.654983] mychardrv is open - Major(42) Minor(0)
```

Expand All @@ -249,4 +254,4 @@ The module above receives the size of a 2D array as a string through the `char_d

The command above passes 10000 to the module, which specifies the size of the 2D array to be created and traversed. The **echo** command takes a long time to complete (around 38 seconds) due to the cache-unfriendly traversal implemented in the `char_dev_cache_traverse()` function.

With the kernel module built, the next step is to profile it using Arm Streamline. We will use it to capture runtime behavior, highlight performance bottlenecks, and help identifying issues such as the cache-unfriendly traversal in our module.
With the kernel module built, the next step is to profile it using Arm Streamline. You will use it to capture runtime behavior, highlight performance bottlenecks, and help identifying issues such as the cache-unfriendly traversal in your module.
Original file line number Diff line number Diff line change
Expand Up @@ -10,25 +10,33 @@ layout: learningpathall

Arm Streamline is a tool that uses sampling to measure system performance. Instead of recording every single event (like instrumentation does, which can slow things down), it takes snapshots of hardware counters and system registers at regular intervals. This gives a statistical view of how the system runs, while keeping the overhead small.

Streamline tracks many performance metrics such as CPU usage, execution cycles, memory access, cache hits and misses, and GPU activity. By putting this information together, it helps developers see how their code is using the hardware. Captured data is presented on a timeline, so you can see how performance changes as your program runs. This makes it easier to notice patterns, find bottlenecks, and link performance issues to specific parts of your application.
Streamline tracks performance metrics such as CPU usage, execution cycles, memory access, cache hits and misses, and GPU activity. By putting this information together, it helps developers see how their code is using the hardware. Captured data is presented on a timeline, so you can see how performance changes as your program runs. This makes it easier to notice patterns, find bottlenecks, and link performance issues to specific parts of your application.

For more details about Streamline and its features, refer to the [Streamline user guide](https://developer.arm.com/documentation/101816/latest/Getting-started-with-Streamline/Introduction-to-Streamline).

Streamline is included with Arm Performance Studio, which you can download and use for free from [Arm Performance Studio downloads](https://developer.arm.com/Tools%20and%20Software/Arm%20Performance%20Studio#Downloads).
### Download Streamline

Streamline is included with Arm Performance Studio, which you can download and use for free. Download it by following the link below.

[Arm Performance Studio downloads](https://developer.arm.com/Tools%20and%20Software/Arm%20Performance%20Studio#Downloads).

For step-by-step guidance on setting up Streamline on your host machine, follow the installation instructions provided in [Streamline installation guide](https://developer.arm.com/documentation/101816/latest/Getting-started-with-Streamline/Install-Streamline).

### Pushing Gator to the Target and Making a Capture

Once Streamline is installed on the host machine, you can capture trace data of our Linux kernel module.
Once Streamline is installed on the host machine, you can capture trace data of our Linux kernel module. On Linux, the binaries will be installed where you extracted the package.

1. To communicate with the target, Streamline requires a daemon, called **gatord**, to be installed and running on the target. gatord must be running before you can capture trace data. There are two pre-built gatord binaries available in Streamline's install directory, one for *Armv7 (AArch32)* and one for *Armv8 or later(AArch64)*. Push **gatord** to the target device using **scp**.

```bash
scp <install_directory>/streamline/bin/linux/arm64/gatord root@<target-ip>:/root/gatord
# use arm instead of arm64, if your are using an AArch32 target
```

{{% notice Note %}}
If you are using an AArch32 target, use `arm` instead of `arm64`.
{{% /notice%}}


2. Run gator on the target to start system-wide capture mode.

```bash
Expand All @@ -42,25 +50,27 @@ Once Streamline is installed on the host machine, you can capture trace data of
4. Enter your target hostname or IP address.
![Streamline TCP settings#center](./images/img02_streamline_tcp.png)

5. Click on *Select counters* to open the counter configuration dialogue, to learn more about counters and how to configure them please refer to [counter configuration guide](https://developer.arm.com/documentation/101816/latest/Capture-a-Streamline-profile/Counter-Configuration)
5. Click on *Select counters* to open the counter configuration dialogue.

6. Add `L1 data Cache: Refill` and `L1 Data Cache: Access` and enable Event-Based Sampling (EBS) for both of them as shown in the screenshot and click *Save*.

{{% notice %}}
{{% notice Further reading %}}
To learn more about counters and how to configure them please refer to [counter configuration guide](https://developer.arm.com/documentation/101816/latest/Capture-a-Streamline-profile/Counter-Configuration)

To learn more about EBS, please refer to [Streamline user guide](https://developer.arm.com/documentation/101816/9-7/Capture-a-Streamline-profile/Counter-Configuration/Setting-up-event-based-sampling)
{{% /notice %}}

![Counter configuration#center](./images/img03_counter_config.png)

7. In the Command section, we will add the same shell command we used earlier to test our Linux module.
7. In the Command section, add the same shell command you used earlier to test our Linux module.

```bash
sh -c "echo 10000 > /dev/mychardrv"
```

![Streamline command#center](./images/img04_streamline_cmd.png)

8. In the Capture settings dialog, select Add image, add your kernel module file `mychardrv.ko` and click Save.
8. In the Capture settings dialog, select Add image, add the absolut path of your kernel module file `mychardrv.ko` and click Save.
![Capture settings#center](./images/img05_capture_settings.png)

9. Start the capture and enter a name and location for the capture file. Streamline will start collecting data and the charts will show activity being captured from the target.
Expand All @@ -70,21 +80,21 @@ Once Streamline is installed on the host machine, you can capture trace data of

Once the capture is stopped, Streamline automatically analyzes the collected data and provides insights to help identify performance issues and bottlenecks. This section describes how to view these insights, starting with locating the functions related to our kernel module and narrowing down to the exact lines of code that may be responsible for the performance problems.

1. Open the *Functions tab*. In the counters list, select one of the counters we selected earlier in the counter configuration dialog, as shown:
1. Open the *Functions tab*. In the counters list, select one of the counters you selected earlier in the counter configuration dialog, as shown:

![Counter selection#center](./images/img07_select_datasource.png)

2. In the Functions tab, observe that the function `char_dev_cache_traverse()` has the highest L1 Cache refill rate, which we already expected.
2. In the Functions tab, observe that the function `char_dev_cache_traverse()` has the highest L1 Cache refill rate, which is expected.
Also notice the Image name on the right, which is our module file name `mychardrv.ko`:

![Functions tab#center](./images/img08_Functions_Tab.png)

3. To view the call path of this function, right click on the function name and choose *Select in Call Paths*.

4. You can now see the exact function that called `char_dev_cache_traverse()`. In the Locations column, notice that the function calls started in the userspace (echo command) and terminated in the kernel space module `mychardrv.ko`:
4. You can now see the exact function that called `char_dev_cache_traverse()`. In the Locations column, notice that the function calls started in the userspace (`echo` command) and terminated in the kernel space module `mychardrv.ko`:
![Call paths tab#center](./images/img09_callpaths_tab.png)

5. Since we compiled our kernel module with debug info, we will be able to see the exact code lines that are causing these cache misses.
5. Since you compiled the kernel module with debug info, you will be able to see the exact code lines that are causing these cache misses.
To do so, double-click on the function name and the *Code tab* opens. This view shows you how much each code line contributed to the cache misses and in bottom half of the code view, you can also see the disassembly of these lines with the counter values of each assembly instruction:
![Code tab#center](./images/img10_code_tab.png)

Expand Down
Loading
Loading