Skip to content

Commit db0c5a0

Browse files
Merge pull request #1441 from jasonrandrews/review
Complete review of WindowsPerf with SPE
2 parents f6f5330 + 0b60dcc commit db0c5a0

File tree

5 files changed

+89
-61
lines changed

5 files changed

+89
-61
lines changed

content/learning-paths/cross-platform/windowsperf_sampling_cpython_spe/_index.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
---
2-
title: Sampling CPython with Arm SPE with WindowsPerf
2+
title: Sampling CPython WindowsPerf and Arm SPE
33
draft: true
44
cascade:
55
draft: true
@@ -16,7 +16,7 @@ learning_objectives:
1616

1717
prerequisites:
1818
- Windows on Arm desktop or development machine with [WindowsPerf](/install-guides/wperf), [Visual Studio](/install-guides/vs-woa/), and [Git](/install-guides/git-woa/) installed.
19-
- The system must also have an Arm CPU with SPE support.
19+
- The Windows on Arm system must have an Arm CPU with SPE support.
2020

2121
author_primary: Przemyslaw Wirkus
2222

content/learning-paths/cross-platform/windowsperf_sampling_cpython_spe/_review.md

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -32,7 +32,7 @@ review:
3232
3333
- questions:
3434
question: >
35-
WindowsPerf can be used and executed only on native ARM64 WOA hardware, and not in a virtual environment.
35+
WindowsPerf can be used and executed only on native Windows on Arm hardware, and not in a virtual environment.
3636
answers:
3737
- "True"
3838
- "False"
@@ -62,7 +62,7 @@ review:
6262
6363
- questions:
6464
question: >
65-
Is load_filter is one of SPE filters supported by WindowsPerf?
65+
load_filter is one of SPE filters supported by WindowsPerf?
6666
answers:
6767
- "True"
6868
- "False"
@@ -72,7 +72,7 @@ review:
7272
7373
- questions:
7474
question: >
75-
Is store_filter is one of SPE filters supported by WindowsPerf?
75+
store_filter is one of SPE filters supported by WindowsPerf?
7676
answers:
7777
- "True"
7878
- "False"
@@ -82,7 +82,7 @@ review:
8282
8383
- questions:
8484
question: >
85-
Is branch_filter is one of SPE filters supported by WindowsPerf?
85+
branch_filter is one of SPE filters supported by WindowsPerf?
8686
answers:
8787
- "True"
8888
- "False"

content/learning-paths/cross-platform/windowsperf_sampling_cpython_spe/windowsperf_sampling_cpython_spe.md

Lines changed: 4 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@ title: An overview of CPython sampling with SPE
44
weight: 2
55
---
66

7-
In this example, you will build a debug build of CPython from sources and execute simple instructions in the Python interactive mode to obtain WindowsPerf sampling results from the CPython runtime image.
7+
In this example, you will build a debug build of CPython from source and execute simple instructions in the Python interactive mode to obtain WindowsPerf sampling results from the CPython runtime image.
88

99
## Introduction to the Arm Statistical Profiling Extension (SPE)
1010

@@ -21,7 +21,8 @@ WindowsPerf includes `record` support for the Arm Statistical Profiling Extensio
2121
SPE is an optional feature in ARMv8.2 hardware that allows CPU instructions to be sampled and associated with the source code location where that instruction occurred.
2222

2323
{{% notice Note %}}
24-
Currently SPE is available on Windows On Arm in Test Mode only!
24+
SPE is only available on Windows on Arm in Test Mode.
25+
Windows Test Mode is a feature that allows you to install and test drivers that have not been digitally signed by Microsoft.
2526
{{% /notice %}}
2627

2728
## Before you begin
@@ -31,7 +32,7 @@ For this Learning Path you will need:
3132
* A Windows on Arm (ARM64) native machine with pre-installed WindowsPerf (both driver and `wperf` CLI tool). Refer to the [WindowsPerf Install Guide](/install-guides/wperf/) for more details.
3233
* Note: The [WindowsPerf release 3.8.0](https://github.com/arm-developer-tools/windowsperf/releases/tag/3.8.0) includes a separate build with Arm SPE (Statistical Profiling Extension) support enabled. To install this version download release asset and you will find WindowsPerf SPE build in the `SPE/` subdirectory.
3334
* [Visual Studio](/install-guides/vs-woa/) and [Git](/install-guides/git-woa/) installed.
34-
* The CPU must support the Arm SPE extension, an optional feature in ARMv8.2 hardware. You can check your CPU compatibility using the WindowsPerf command-line tool (explained below).
35+
* The CPU must support the Arm SPE extension, an optional feature in ARMv8.2 hardware. You can check your CPU compatibility using the WindowsPerf command-line tool as explained below.
3536

3637
### How do I check if my Arm CPU supports the Arm SPE extension?
3738

content/learning-paths/cross-platform/windowsperf_sampling_cpython_spe/windowsperf_sampling_cpython_spe_example_1.md

Lines changed: 66 additions & 42 deletions
Original file line numberDiff line numberDiff line change
@@ -4,33 +4,39 @@ title: WindowsPerf sample using SPE example
44
weight: 3
55
---
66

7-
## Example 1: Sampling of CPython calculating Googolplex using SPE
7+
## Example 1: Sample CPython using SPE
88

9-
{{% notice Note %}}
10-
All the steps in these following sections are done on a native ARM64 Windows on Arm machine.
11-
{{% /notice %}}
9+
You can use the [CPython](https://github.com/python/cpython) binary you built from source in debug mode to compute a large integer number called a [Googolplex](https://en.wikipedia.org/wiki/Googolplex). This is a good way to stress CPython to demonstrate profiling.
10+
11+
The steps are:
12+
- Pin the `python_d.exe` interactive console to an arbitrary CPU core and calculate `10^10^100`.
13+
- Run counting and sampling to obtain event information.
1214

13-
You will use the pre-built [CPython](https://github.com/python/cpython) binaries targeting ARM64 from sources in the debug mode from the previous step and then complete the following:
14-
- Pin `python_d.exe` interactive console to an arbitrary CPU core, calculate `10^10^100` expression, a large integer number [Googolplex](https://en.wikipedia.org/wiki/Googolplex) to stress the CPython application and get a simple workload.
15-
- Run counting and sampling to obtain some simple event information.
15+
### Pin CPython to CPU core 1
1616

17-
### Pin the new CPython process to a CPU core 1
17+
You can use the Windows `start` command to execute and pin `python_d.exe` process to CPU core 1.
1818

19-
Use the Windows `start` command to execute and pin `python_d.exe` process to CPU core number 1. Below command is executing computation intensive calculations of `10^10^100`, a [Googolplex](https://en.wikipedia.org/wiki/Googolplex) number, with CPython.
19+
Run the command below at a Windows Command Prompt to execute the computation intensive calculation:
2020

2121
```command
2222
start /affinity 2 cpython\PCbuild\arm64\python_d.exe -c 10**10**100
2323
```
2424

2525
{{% notice Note %}}
26-
The [start](https://learn.microsoft.com/en-us/windows-server/administration/windows-commands/start) command line switch `/affinity <hexaffinity>` applies the specified processor affinity mask (expressed as a hexadecimal number) to the new application. In our example decimal `2` is `0x02` or `0b0010`. This value denotes core no. `1` as `1` is a first bit in the mask, where the mask is indexed from `0` (zero).
26+
The [start](https://learn.microsoft.com/en-us/windows-server/administration/windows-commands/start) command line `/affinity <hexaffinity>` applies the specified processor affinity mask (expressed as a hexadecimal number). In this example decimal `2` is `0x02` or `0b0010`. This value denotes core number `1` as `1` is a first bit in the mask, where the mask is indexed from `0`.
2727
{{% /notice %}}
2828

29-
You can use the Windows Task Manager to confirm that `python_d.exe` is running on CPU core no. 1.
29+
You can use the Windows Task Manager to confirm that `python_d.exe` is running on CPU core 1.
30+
31+
### WindowsPerf introduces SPE filters
32+
33+
You can specify SPE filters using the `-e` command line option with `arm_spe_0//`.
34+
35+
The `arm_spe_0/*/` notation is available for the `sample` and `record` commands, where `*` represents a comma-separated list of supported filters.
3036

31-
### SPE introduces new option for command line switch -e arm_spe_0//
37+
Currently, filters such as `store_filter=`, `load_filter=`, and `branch_filter=`, or their short equivalents like `st=`, `ld=`, and `b=`. Use `0` or `1` to disable or enable a given filter.
3238

33-
Users can specify SPE filters using the `-e` command line option with `arm_spe_0//`. We've introduced the `arm_spe_0/*/` notation for the `sample` and `record` command, where `*` represents a comma-separated list of supported filters. Currently, we support filters such as `store_filter=`, `load_filter=`, and `branch_filter=`, or their short equivalents like `st=`, `ld=`, and `b=`. Use `0` or `1` to disable or enable a given filter. For example:
39+
Here are some filter examples:
3440

3541
```output
3642
arm_spe_0/branch_filter=1/
@@ -41,24 +47,31 @@ arm_spe_0/st=0,ld=0,b=1/
4147

4248
#### Filtering sample records
4349

44-
SPE register `PMSFCR_EL1.FT` enables filtering by operation type. When enabled `PMSFCR_EL1.{ST, LD, B}` define the collected types:
50+
The SPE register `PMSFCR_EL1.FT` enables filtering by operation type.
51+
52+
When enabled `PMSFCR_EL1.{ST, LD, B}` defines the collected types:
53+
4554
- `ST` enables collection of store sampled operations, including all atomic operations.
4655
- `LD` enables collection of load sampled operations, including atomic operations that return a value to a register.
4756
- `B` enables collection of branch sampled operations, including direct and indirect branches and exception returns.
4857

49-
### Sampling using SPE the CPython application running the Googolplex calculation on CPU core 1
58+
### Sample CPython using SPE
59+
60+
The command below samples the running `python_d.exe` process.
5061

51-
Below command will sample already running process `python_d.exe` (denoted with `--image_name python_d.exe`) on CPU core no. 1. SPE filter `ld=1` enables collection of load sampled operations, including atomic operations that return a value to a register.
62+
The SPE filter `ld=1` enables collection of load sampled operations, including atomic operations that return a value to a register.
5263

5364
```command
5465
wperf sample -e arm_spe_0/ld=1/ --pe_file cpython\PCbuild\arm64\python_d.exe --image_name python_d.exe -c 1
5566
```
5667

5768
{{% notice Note%}}
58-
You can use the same sampling `--annotate` and `--disassemble` command line interface of WindowsPerf with SPE extension. See example outputs below.
69+
You can use the same sampling `--annotate` and `--disassemble` command line interface of WindowsPerf with SPE extension. This is shown in the example output below.
5970
{{% /notice %}}
6071

61-
Please wait a few seconds for the samples to arrive from the Kernel driver and then press `Ctrl+C` to stop sampling. You should see:
72+
Please wait a few seconds for the samples to arrive from the Kernel driver and then press `Ctrl+C` to stop sampling.
73+
74+
You see output similar to:
6275

6376
```output
6477
base address of 'python_d.exe': 0x7ff765fe1288, runtime delta: 0x7ff625fe0000
@@ -84,36 +97,44 @@ note: 'e' - normal event, 'gN' - grouped event with group number N, metric name
8497
9.853 seconds time elapsed
8598
```
8699

87-
{{% notice Note%}}
88-
You can close the command line window with `python_d.exe` running when you have finished sampling. Sampling will also automatically end when the sample process has finished.
89-
{{% /notice %}}
100+
You can close the command line window running `python_d.exe` when you have finished sampling.
90101

102+
Sampling will also automatically end when the sampled process exits.
91103

92104
#### SPE sampling output
93105

94-
- In the above example, you can see that the majority of "overhead" is generated by `python_d.exe` executable resides inside the `python312_d.dll` DLL, in `x_mul` symbol.
95-
- SPE sampling output contains also PMU events for SPE registered during sampling:
96-
- `sample_pop` - Statistical Profiling sample population. Counts statistical profiling sample population, the count of all operations that could be sampled but may or may not be chosen for sampling.
97-
- `sample_feed` - Statistical Profiling sample taken. Counts statistical profiling samples taken for sampling.
98-
- `sample_filtrate` - Statistical Profiling sample taken and not removed by filtering. Counts statistical profiling samples taken which are not removed by filtering.
99-
- `sample_collision` - Statistical Profiling sample collided with previous sample. Counts statistical profiling samples that have collided with a previous sample and so therefore not taken.
100-
- Note that in sampling `....eee....e` is a progressing printout where:
101-
- character `.` represents a SPE sample payload received from the WindowsPerf Kernel driver and
102-
- character `e` represents an unsuccessful attempt (empty SPE fill buffer) to fetch the whole sample payload.
106+
In the output above, you see that the majority of "overhead" generated by `python_d.exe` resides in the `python312_d.dll` DLL, in the `x_mul` symbol.
107+
108+
SPE sampling output also contains PMU events for the SPE registered events.
109+
110+
Here are some helpful definitions:
111+
112+
- `sample_pop` - Counts statistical profiling sample population, the count of all operations that could be sampled but may or may not be chosen for sampling.
113+
- `sample_feed` - Counts statistical profiling samples taken.
114+
- `sample_filtrate` - Counts statistical profiling samples taken which are not removed by filtering.
115+
- `sample_collision` - Counts statistical profiling samples that have collided with a previous sample and therefore not taken.
116+
117+
During sampling the `....eee....e` output is a progressing printout where:
118+
- each `.` character represents an SPE sample payload received from the WindowsPerf Kernel driver
119+
- each `e` character represents an unsuccessful attempt (empty SPE fill buffer) to fetch the whole sample payload
103120

104121
{{% notice Note%}}
105-
You can also output `wperf sample` command in JSON format. Use the `--json` command line option to enable the JSON output.
122+
You can also generate `wperf sample` output in JSON format. Use the `--json` command line option to enable the JSON output.
106123
Use the `-v` command line option `verbose` to add more information about sampling.
107124
{{% /notice %}}
108125

109126
#### Example output with annotate enabled
110127

111-
Command line option `--annotate` enables translating addresses taken from samples in sample/record mode into source code line numbers.
128+
The `--annotate` command line option enables translating addresses taken from samples in sample/record mode into source code line numbers.
129+
130+
For example:
112131

113132
```console
114133
wperf sample -e arm_spe_0/ld=1/ --annotate --pe_file cpython\PCbuild\arm64\python_d.exe --image_name python_d.exe -c 1
115134
```
116135

136+
The output is similar to:
137+
117138
```output
118139
base address of 'python_d.exe': 0x7ff765fe1288, runtime delta: 0x7ff625fe0000
119140
sampling ....ee.Ctrl-C received, quit counting...e done!
@@ -142,18 +163,21 @@ x_mul:python312_d.dll
142163
5.199 seconds time elapsed
143164
```
144165

145-
Note: Above SPE sampling pass recorded:
146-
- function `x_mul:python312_d.dll`:
147-
- in source file `C:\path\to\cpython\Objects\longobject.c`, line `3590` as a hot-spot for `load_filter` enabled.
166+
The above SPE sampling pass records that the function `x_mul:python312_d.dll`
167+
n source file `C:\path\to\cpython\Objects\longobject.c`, line `3590` is a hot-spot for the `load_filter`.
148168

149169
#### Example output with disassemble enabled
150170

151-
Command line option `--disassemble` enables disassemble output on sampling mode. Implies `--annotate`.
171+
The `--disassemble` command line option enables disassembly output, and also implies `--annotate`.
172+
173+
For example:
152174

153175
```console
154176
wperf sample -e arm_spe_0/ld=1/ --disassemble --pe_file cpython\PCbuild\arm64\python_d.exe --image_name python_d.exe -c 1
155177
```
156178

179+
The output is similar to:
180+
157181
```output
158182
base address of 'python_d.exe': 0x7ff765fe1288, runtime delta: 0x7ff625fe0000
159183
sampling ......eCtrl-C received, quit counting... done!
@@ -207,9 +231,9 @@ v_isub:python312_d.dll
207231
4.422 seconds time elapsed
208232
```
209233

210-
Note: Above SPE sampling pass recorded:
211-
- function `x_mul:python312_d.dll`:
212-
- in source file `C:\path\to\cpython\Objects\longobject.c`, line `3591`, instruction `ldr x9, [sp, #0x20]` at address `0x4043b4` as potential hot-spot.
213-
- in source file `C:\path\to\cpython\Objects\longobject.c`, line `3589`, instruction `ldr x8, [sp, #0x58]` at address `0x404360` as potential hot-spot.
214-
- Function `v_isub:python312_d.dll`:
215-
- in source file `C:\path\to\cpython\Objects\longobject.c`, line `1603`, instruction `ldr w8, [sp, #0x10]` at address `0x402a60` as potential hot-spot.
234+
The output above shows that the function `x_mul:python312_d.dll` is a hot spot which comes from the following source code lines:
235+
- File `C:\path\to\cpython\Objects\longobject.c`, line `3591`, instruction `ldr x9, [sp, #0x20]` at address `0x4043b4` as potential hot-spot.
236+
- File `C:\path\to\cpython\Objects\longobject.c`, line `3589`, instruction `ldr x8, [sp, #0x58]` at address `0x404360` as potential hot-spot.
237+
238+
Another potential hot spot is in the function `v_isub:python312_d.dll` in the source file `C:\path\to\cpython\Objects\longobject.c`, line `1603`, instruction `ldr w8, [sp, #0x10]` at address `0x402a60`.
239+

content/learning-paths/cross-platform/windowsperf_sampling_cpython_spe/windowsperf_sampling_cpython_spe_example_2.md

Lines changed: 13 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -4,22 +4,23 @@ title: WindowsPerf record using SPE example
44
weight: 4
55
---
66

7-
## Example 2: Using the `record` command to simplify things
7+
## Example 2: Record CPython using SPE
88

9-
- The `record` command spawns the process and pins it to the core specified by the `-c` option.
10-
- A double-dash (`--`) is a syntax used in shell commands to signify end of command options and beginning of positional arguments. In other words, it separates `wperf` CLI options from arguments that the command operates on. Use `--` to separate `wperf.exe` command line options from the process you want to spawn followed by its verbatim arguments.
9+
You can use the `record` command to spawn the Python process and pin it to the core specified by the `-c` option.
10+
11+
A double-dash (`--`) syntax in shell commands signifies the end of command options and beginning of positional arguments. In other words, it separates the `wperf` CLI options from the arguments passed to the profiled program, `python_d.exe`.
12+
13+
Run the `record` command with SPE to collect load events from SPE:
1114

1215
```console
1316
wperf record -e arm_spe_0/ld=1/ -c 1 --timeout 5 -- cpython\PCbuild\arm64\python_d.exe -c 10**10**100
1417
```
1518

16-
{{% notice Note%}}
17-
You can use the same sampling `--annotate` and `--disassemble` command line interface of WindowsPerf with SPE extension.
18-
{{% /notice %}}
19+
You can use the same `--annotate` and `--disassemble` command line arguments the SPE extension.
1920

2021
The WindowsPerf `record` command is versatile, allowing you to start and stop the sampling process easily. It also simplifies the command line syntax, making it user-friendly and efficient.
2122

22-
Example 2 can be replaced by these two commands:
23+
The example above can be replaced by these two commands:
2324

2425
```console
2526
start /affinity 2 cpython\PCbuild\arm64\python_d.exe -c 10**10**100
@@ -28,8 +29,10 @@ wperf sample -e arm_spe_0/ld=1/ --pe_file cpython\PCbuild\arm64\python_d.exe --i
2829

2930
## Summary
3031

31-
WindowsPerf is a versatile performance analysis tool that can support both software (with CPU PMU events) and hardware sampling (with SPE extension). The type of sampling it can perform depends on the availability of the Arm Statistical Profiling Extension (SPE) in the ARM64 CPU. If the Arm SPE extension is present, WindowsPerf can leverage hardware sampling to provide detailed performance insights. Otherwise, it will rely on software sampling to gather performance data. This flexibility ensures that WindowsPerf can adapt to different hardware configurations and still deliver valuable performance metrics.
32+
WindowsPerf is a versatile performance analysis tool supporting both software (with CPU PMU events) and hardware sampling (with the SPE extension).
33+
34+
The type of sampling it can perform depends on the availability of the Arm Statistical Profiling Extension (SPE) in the CPU. If the Arm SPE extension is present, WindowsPerf can leverage hardware sampling to provide detailed performance insights. Otherwise, it will rely on software sampling to gather performance data. This flexibility ensures that WindowsPerf can adapt to different hardware configurations and still deliver valuable performance metrics.
3235

33-
Use `wperf sample`, a sampling mode, for determining the frequencies of event occurrences produced by program locations at the function, basic block, and/or instruction levels.
36+
Use `wperf sample`, sampling mode, for determining the frequencies of event occurrences produced by program locations at the function, basic block, and/or instruction levels.
3437

35-
Use `wperf record`, same as sample but also automatically spawns the process and pins it to the core specified by `-c`. Process name is defined by COMMAND. User can pass verbatim arguments to the process.
38+
Use `wperf record`, is the same as sample, but also automatically spawns the process and pins it to the core specified by `-c`. You can use `record` to pass verbatim arguments to the process.

0 commit comments

Comments
 (0)