You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
sudo apt install -y which sed make binutils build-essential diffutils gcc g++ bash patch gzip \
14
+
bzip2 perl tar cpio unzip rsync file bc findutils gawk libncurses-dev python-is-python3 \
15
+
gcc-arm-none-eabi
16
+
```
17
+
18
+
## Build a debuggable kernel image
19
+
20
+
For this learning path you will be using [Buildroot](https://github.com/buildroot/buildroot) to build a Linux image for Raspberry Pi 3B+ with a debuggable Linux kernel. You will profile Linux kernel modules built out-of-tree and Linux device drivers built in the Linux source code tree.
21
+
22
+
1. Clone the Buildroot Repository and initialize the build system with the default configurations.
If you're not using a Raspberry Pi 3 for this Learning Path, change the `raspberrypi3_64_defconfig` to the option that matches your hardware in `$(BUILDROOT_HOME)/configs`
32
+
{{% /notice %}}
33
+
34
+
2. You will use `menuconfig` to configure the setup. Invoke it with the following command:
35
+
36
+
```
37
+
make menuconfig
38
+
```
39
+
40
+

41
+
42
+
Change Buildroot configurations to enable debugging symbols and SSH access.
43
+
44
+
```plaintext
45
+
Build options --->
46
+
[*] build packages with debugging symbols
47
+
gcc debug level (debug level 3)
48
+
[*] build packages with runtime debugging info
49
+
gcc optimization level (optimize for debugging) --->
50
+
51
+
System configuration --->
52
+
[*] Enable root login with password
53
+
(****) Root password # Choose root password here
54
+
55
+
Kernel --->
56
+
Linux Kernel Tools --->
57
+
[*] perf
58
+
59
+
Target packages --->
60
+
Networking applications --->
61
+
[*] openssh
62
+
[*] server
63
+
[*] key utilities
64
+
```
65
+
66
+
You might also need to change your default `sshd_config` file according to your network settings. To do that, you need to modify System configuration→ Root filesystem overlay directories to add a directory that contains your modified `sshd_config` file.
67
+
68
+
3. By default the Linux kernel images are stripped. You will need to make the image debuggable as you'll be using it later.
69
+
70
+
Invoke `linux-menuconfig` and uncheck the option as shown.
71
+
72
+
```bash
73
+
make linux-menuconfig
74
+
```
75
+
76
+
```plaintext
77
+
Kernel hacking --->
78
+
-*- Kernel debugging
79
+
Compile-time checks and compiler options --->
80
+
Debug information (Rely on the toolchain's implicit default DWARF version)
81
+
[ ] Reduce debugging information # un-check
82
+
```
83
+
84
+
4. Now you can build the Linux image and flash it to the the SD card to run it on the Raspberry Pi.
85
+
86
+
```bash
87
+
make -j$(nproc)
88
+
```
89
+
90
+
It will take some time to build the Linux image. When it completes, the output will be in `$BUILDROOT_HOME/output/images/sdcard.img`:
91
+
92
+
```bash
93
+
ls $BUILDROOT_HOME/output/images/ | grep sdcard.img
94
+
```
95
+
96
+
For details on flashing the SD card image, see [this helpful article](https://www.ev3dev.org/docs/tutorials/writing-sd-card-image-ubuntu-disk-image-writer/).
97
+
98
+
Now that you have a target running Linux with a debuggable kernel image, you can start writing your kernel module that you want to profile.
Copy file name to clipboardExpand all lines: content/learning-paths/embedded-and-microcontrollers/streamline-kernel-module/3_oot_module.md
+20-15Lines changed: 20 additions & 15 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -8,15 +8,15 @@ layout: learningpathall
8
8
9
9
## Creating the Linux Kernel Module
10
10
11
-
We will now learn how to create an example Linux kernel module (Character device) that demonstrates a cache miss issue caused by traversing a 2D array in column-major order. This access pattern is not cache-friendly, as it skips over most of the neighboring elements in memory during each iteration.
11
+
You will now create an example Linux kernel module (Character device) that demonstrates a cache miss issue caused by traversing a 2D array in column-major order. This access pattern is not cache-friendly, as it skips over most of the neighboring elements in memory during each iteration.
12
12
13
-
To build the Linux kernel module, start by creating a new directory—We will call it **example_module**—in any location of your choice. Inside this directory, add two files: `mychardrv.c` and `Makefile`.
13
+
To build the Linux kernel module, start by creating a new directory, for example `example_module`. Inside this directory, add two files: `mychardrv.c` and `Makefile`.
14
14
15
15
**Makefile**
16
16
17
17
```makefile
18
18
obj-m += mychardrv.o
19
-
BUILDROOT_OUT := /opt/rpi-linux/buildroot/output # Change this to your buildroot output directory
19
+
BUILDROOT_OUT := $(BUILDROOT_HOME)/output # Change this to your buildroot output directory
MODULE_DESCRIPTION("A simple char driver with cache misses issue");
202
202
```
203
203
204
-
The module above receives the size of a 2D array as a string through the `char_dev_write()` function, converts it to an integer, and passes it to the `char_dev_cache_traverse()` function. This function then creates the 2D array, initializes it with simple data, traverses it in a column-major (cache-unfriendly) order, computes the sum of its elements, and prints the result to the kernel log.
204
+
The module above receives the size of a 2D array as a string through the `char_dev_write()` function, converts it to an integer, and passes it to the `char_dev_cache_traverse()` function. This function then creates the 2D array, initializes it with simple data, traverses it in a column-major (cache-unfriendly) order, computes the sum of its elements, and prints the result to the kernel log. The cache-unfriendly aspects allows you to inspect a bottleneck using Streamline in the next section.
205
205
206
206
## Building and Running the Kernel Module
207
207
208
208
1. To compile the kernel module, run make inside the example_module directory. This will generate the output file `mychardrv.ko`.
209
209
210
-
2. Transfer the .ko file to the target using scp command and then insert it using insmod command. After inserting the module, we create a character device node using mknod command. Finally, we can test the module by writing a size value (e.g., 10000) to the device file and measuring the time taken for the operation using the `time` command.
210
+
2. Transfer the .ko file to the target using scp command and then insert it using insmod command. After inserting the module, you create a character device node using mknod command. Finally, you can test the module by writing a size value (e.g., 10000) to the device file and measuring the time taken for the operation using the `time` command.
211
211
212
212
```bash
213
213
scp mychardrv.ko root@<target-ip>:/root/
214
214
```
215
215
216
216
{{% notice Note %}}
217
-
Replace \<target-ip> with your own target IP address
217
+
Replace \<target-ip> with your target's IP address
218
218
{{% /notice %}}
219
219
220
-
3. To run the module on the target, we need to run the following commands on the target:
220
+
3. SSH onto your target device:
221
221
222
222
```bash
223
223
ssh root@<your-target-ip>
224
-
225
-
#The following commands should be running on target device
226
-
224
+
```
225
+
226
+
4. Execute the following commads on the target to run the module:
227
+
```bash
227
228
insmod /root/mychardrv.ko
228
229
mknod /dev/mychardrv c 42 0
229
230
```
230
231
231
232
{{% notice Note %}}
232
-
42 and 0 are the major and minor number we chose in our module code above
233
+
42 and 0 are the major and minor number specified in the module code above
233
234
{{% /notice %}}
234
235
235
-
4. Now if you run dmesg you should see something like:
236
+
4. To verify that the module is active, run `dmesg` and the output should match the below:
237
+
238
+
```bash
239
+
dmesg
240
+
```
236
241
237
-
```log
242
+
```output
238
243
[12381.654983] mychardrv is open - Major(42) Minor(0)
239
244
```
240
245
@@ -249,4 +254,4 @@ The module above receives the size of a 2D array as a string through the `char_d
249
254
250
255
The command above passes 10000 to the module, which specifies the size of the 2D array to be created and traversed. The **echo** command takes a long time to complete (around 38 seconds) due to the cache-unfriendly traversal implemented in the `char_dev_cache_traverse()` function.
251
256
252
-
With the kernel module built, the next step is to profile it using Arm Streamline. We will use it to capture runtime behavior, highlight performance bottlenecks, and help identifying issues such as the cache-unfriendly traversal in our module.
257
+
With the kernel module built, the next step is to profile it using Arm Streamline. You will use it to capture runtime behavior, highlight performance bottlenecks, and help identifying issues such as the cache-unfriendly traversal in your module.
Copy file name to clipboardExpand all lines: content/learning-paths/embedded-and-microcontrollers/streamline-kernel-module/4_sl_profile_oot.md
+22-12Lines changed: 22 additions & 12 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -10,25 +10,33 @@ layout: learningpathall
10
10
11
11
Arm Streamline is a tool that uses sampling to measure system performance. Instead of recording every single event (like instrumentation does, which can slow things down), it takes snapshots of hardware counters and system registers at regular intervals. This gives a statistical view of how the system runs, while keeping the overhead small.
12
12
13
-
Streamline tracks many performance metrics such as CPU usage, execution cycles, memory access, cache hits and misses, and GPU activity. By putting this information together, it helps developers see how their code is using the hardware. Captured data is presented on a timeline, so you can see how performance changes as your program runs. This makes it easier to notice patterns, find bottlenecks, and link performance issues to specific parts of your application.
13
+
Streamline tracks performance metrics such as CPU usage, execution cycles, memory access, cache hits and misses, and GPU activity. By putting this information together, it helps developers see how their code is using the hardware. Captured data is presented on a timeline, so you can see how performance changes as your program runs. This makes it easier to notice patterns, find bottlenecks, and link performance issues to specific parts of your application.
14
14
15
15
For more details about Streamline and its features, refer to the [Streamline user guide](https://developer.arm.com/documentation/101816/latest/Getting-started-with-Streamline/Introduction-to-Streamline).
16
16
17
-
Streamline is included with Arm Performance Studio, which you can download and use for free from [Arm Performance Studio downloads](https://developer.arm.com/Tools%20and%20Software/Arm%20Performance%20Studio#Downloads).
17
+
### Download Streamline
18
+
19
+
Streamline is included with Arm Performance Studio, which you can download and use for free. Download it by following the link below.
20
+
21
+
[Arm Performance Studio downloads](https://developer.arm.com/Tools%20and%20Software/Arm%20Performance%20Studio#Downloads).
18
22
19
23
For step-by-step guidance on setting up Streamline on your host machine, follow the installation instructions provided in [Streamline installation guide](https://developer.arm.com/documentation/101816/latest/Getting-started-with-Streamline/Install-Streamline).
20
24
21
25
### Pushing Gator to the Target and Making a Capture
22
26
23
-
Once Streamline is installed on the host machine, you can capture trace data of our Linux kernel module.
27
+
Once Streamline is installed on the host machine, you can capture trace data of our Linux kernel module. On Linux, the binaries will be installed where you extracted the package.
24
28
25
29
1. To communicate with the target, Streamline requires a daemon, called **gatord**, to be installed and running on the target. gatord must be running before you can capture trace data. There are two pre-built gatord binaries available in Streamline's install directory, one for *Armv7 (AArch32)* and one for *Armv8 or later(AArch64)*. Push **gatord** to the target device using **scp**.
5. Click on *Select counters* to open the counter configuration dialogue, to learn more about counters and how to configure them please refer to [counter configuration guide](https://developer.arm.com/documentation/101816/latest/Capture-a-Streamline-profile/Counter-Configuration)
53
+
5. Click on *Select counters* to open the counter configuration dialogue.
46
54
47
55
6. Add `L1 data Cache: Refill` and `L1 Data Cache: Access` and enable Event-Based Sampling (EBS) forboth of them as shownin the screenshot and click *Save*.
48
56
49
-
{{% notice %}}
57
+
{{% notice Further reading %}}
58
+
To learn more about counters and how to configure them please refer to [counter configuration guide](https://developer.arm.com/documentation/101816/latest/Capture-a-Streamline-profile/Counter-Configuration)
59
+
50
60
To learn more about EBS, please refer to [Streamline user guide](https://developer.arm.com/documentation/101816/9-7/Capture-a-Streamline-profile/Counter-Configuration/Setting-up-event-based-sampling)
9. Start the capture and enter a name and location for the capture file. Streamline will start collecting data and the charts will show activity being captured from the target.
@@ -70,21 +80,21 @@ Once Streamline is installed on the host machine, you can capture trace data of
70
80
71
81
Once the capture is stopped, Streamline automatically analyzes the collected data and provides insights to help identify performance issues and bottlenecks. This section describes how to view these insights, starting with locating the functions related to our kernel module and narrowing down to the exact lines of code that may be responsible for the performance problems.
72
82
73
-
1. Open the *Functions tab*. In the counters list, selectone of the counters we selected earlier in the counter configuration dialog, as shown:
83
+
1. Open the *Functions tab*. In the counters list, selectone of the counters you selected earlier in the counter configuration dialog, as shown:
3. To view the call path of this function, right click on the functionname and choose *Select in Call Paths*.
83
93
84
-
4. You can now see the exact functionthat called `char_dev_cache_traverse()`. In the Locations column, notice that the functioncalls started in the userspace (echo command) and terminated in the kernel space module `mychardrv.ko`:
94
+
4. You can now see the exact functionthat called `char_dev_cache_traverse()`. In the Locations column, notice that the functioncalls started in the userspace (`echo` command) and terminated in the kernel space module `mychardrv.ko`:
5. Since we compiled our kernel module with debug info, we will be able to see the exact code lines that are causing these cache misses.
97
+
5. Since you compiled the kernel module with debug info, you will be able to see the exact code lines that are causing these cache misses.
88
98
To do so, double-click on the functionname and the *Code tab* opens. This view shows you how much each code line contributed to the cache misses and in bottom half of the code view, you can also see the disassembly of these lines with the counter values of each assembly instruction:
0 commit comments