Skip to content

Commit 68b5973

Browse files
committed
Merge tag 'perf-tools-for-v6.11-2024-07-16' of git://git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools
Pull perf tools updates from Namhyung Kim: "Build: - Build each directory as a library so that depedency check for the python extension module can be automatic - Use pkg-config to check libtraceevent and libtracefs perf sched: - Add --task-name and --fuzzy-name options for `perf sched map` It focuses on selected tasks only by removing unrelated tasks in the output. It matches the task comm with the given string and the --fuzzy-name option allows the partial matching: $ sudo perf sched record -a sleep 1 $ sudo perf sched map --task-name kworker --fuzzy-name . . . . - *A0 . . 481065.315131 secs A0 => kworker/5:2-i91:438521 . . . . - *- . . 481065.315160 secs *B0 . . . - . . . 481065.316435 secs B0 => kworker/0:0-i91:437860 *- . . . . . . . 481065.316441 secs . . . . . *A0 . . 481065.318703 secs . . . . . *- . . 481065.318717 secs . . *C0 . . . . . 481065.320544 secs C0 => kworker/u16:30-:430186 . . *- . . . . . 481065.320555 secs . . *D0 . . . . . 481065.328524 secs D0 => kworker/2:0-kdm:429654 *B0 . D0 . - . . . 481065.328527 secs *- . D0 . - . . . 481065.328535 secs . . *- . . . . . 481065.328535 secs - Fix -r/--repeat option of perf sched replay The documentation said -1 will work as infinity but it didn't accept the value. Update the code and document to use 0 instead - Fix perf sched timehist to account the delay time for preempted tasks Perf event filtering: - perf top gained filtering support on regular events using BPF like perf record. Previously it was able to use it for tracepoints only - The BPF filter now supports filtering by UID/GID. This should be preferred than -u <UID> option as it's racy to scan /proc to check tasks for the user and fails to open an event for the task if it's already gone $ sudo perf top -e cycles --filter "uid == $(id -u)" perf report: - Skip dummy events in the group output by default. The --skip-empty option controls display of empty events without samples. But perf report can force display all events in a group In this case, auto-added a dummy event (for a system-wide record) ends up in the output. Now it can skip those empty events even in the group display mode To preserve the old behavior, run this: $ perf report --group --no-skip-empty perf stat: - Choose the most disaggregate option when multiple aggregation options are given. It used to pick the last option in the command line but it can be confusing and not consistent. Now it'll choose the smallest unit For example, it'd aggregate the result per-core when the user gave both --per-socket and --per-core options at the same time Internals: - Fix `perf bench` when some CPUs are offline - Fix handling of JIT symbol mappings to accept "/tmp/perf-${PID}.map patterns only so that it can not be confused by other /tmp/perf-* files - Many improvements and fixes for `perf test` Others: - Support some new instructions for Intel-PT - Fix syscall ID mapping in perf trace - Document AMD IBS PMU usages - Change `perf lock info` to show map and thread info by default Vendor JSON events: - Update Intel events and metrics - Add i.MX9[35] DDR metrics" * tag 'perf-tools-for-v6.11-2024-07-16' of git://git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools: (125 commits) perf trace: Fix iteration of syscall ids in syscalltbl->entries perf dso: Fix address sanitizer build perf mem: Warn if memory events are not supported on all CPUs perf arm-spe: Support multiple Arm SPE PMUs perf build x86: Fix SC2034 error in syscalltbl.sh perf record: Fix memset out-of-range error perf sched map: Add --fuzzy-name option for fuzzy matching in task names perf sched map: Add support for multiple task names using CSV perf sched map: Add task-name option to filter the output map perf build: Conditionally add feature check flags for libtrace{event,fs} perf install: Don't propagate subdir to Documentation submake perf vendor events arm64:: Add i.MX95 DDR Performance Monitor metrics perf vendor events arm64:: Add i.MX93 DDR Performance Monitor metrics perf dsos: When adding a dso into sorted dsos maintain the sort order perf comm str: Avoid sort during insert perf report: Calling available function for stats printing perf intel-pt: Fix exclude_guest setting perf intel-pt: Fix aux_watermark calculation for 64-bit size perf sched replay: Fix -r/--repeat command line option for infinity perf: pmus: Remove unneeded semicolon ...
2 parents f669aac + 7a2fb56 commit 68b5973

File tree

581 files changed

+81285
-4379
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

581 files changed

+81285
-4379
lines changed

tools/lib/api/io.h

Lines changed: 38 additions & 31 deletions
Original file line numberDiff line numberDiff line change
@@ -43,48 +43,55 @@ static inline void io__init(struct io *io, int fd,
4343
io->eof = false;
4444
}
4545

46-
/* Reads one character from the "io" file with similar semantics to fgetc. */
47-
static inline int io__get_char(struct io *io)
46+
/* Read from fd filling the buffer. Called when io->data == io->end. */
47+
static inline int io__fill_buffer(struct io *io)
4848
{
49-
char *ptr = io->data;
49+
ssize_t n;
5050

5151
if (io->eof)
5252
return -1;
5353

54-
if (ptr == io->end) {
55-
ssize_t n;
56-
57-
if (io->timeout_ms != 0) {
58-
struct pollfd pfds[] = {
59-
{
60-
.fd = io->fd,
61-
.events = POLLIN,
62-
},
63-
};
64-
65-
n = poll(pfds, 1, io->timeout_ms);
66-
if (n == 0)
67-
errno = ETIMEDOUT;
68-
if (n > 0 && !(pfds[0].revents & POLLIN)) {
69-
errno = EIO;
70-
n = -1;
71-
}
72-
if (n <= 0) {
73-
io->eof = true;
74-
return -1;
75-
}
54+
if (io->timeout_ms != 0) {
55+
struct pollfd pfds[] = {
56+
{
57+
.fd = io->fd,
58+
.events = POLLIN,
59+
},
60+
};
61+
62+
n = poll(pfds, 1, io->timeout_ms);
63+
if (n == 0)
64+
errno = ETIMEDOUT;
65+
if (n > 0 && !(pfds[0].revents & POLLIN)) {
66+
errno = EIO;
67+
n = -1;
7668
}
77-
n = read(io->fd, io->buf, io->buf_len);
78-
7969
if (n <= 0) {
8070
io->eof = true;
8171
return -1;
8272
}
83-
ptr = &io->buf[0];
84-
io->end = &io->buf[n];
8573
}
86-
io->data = ptr + 1;
87-
return *ptr;
74+
n = read(io->fd, io->buf, io->buf_len);
75+
76+
if (n <= 0) {
77+
io->eof = true;
78+
return -1;
79+
}
80+
io->data = &io->buf[0];
81+
io->end = &io->buf[n];
82+
return 0;
83+
}
84+
85+
/* Reads one character from the "io" file with similar semantics to fgetc. */
86+
static inline int io__get_char(struct io *io)
87+
{
88+
if (io->data == io->end) {
89+
int ret = io__fill_buffer(io);
90+
91+
if (ret)
92+
return ret;
93+
}
94+
return *io->data++;
8895
}
8996

9097
/* Read a hexadecimal value with no 0x prefix into the out argument hex. If the

tools/lib/perf/include/perf/event.h

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -77,6 +77,12 @@ struct perf_record_lost_samples {
7777
__u64 lost;
7878
};
7979

80+
#define MAX_ID_HDR_ENTRIES 6
81+
struct perf_record_lost_samples_and_ids {
82+
struct perf_record_lost_samples lost;
83+
__u64 sample_ids[MAX_ID_HDR_ENTRIES];
84+
};
85+
8086
/*
8187
* PERF_FORMAT_ENABLED | PERF_FORMAT_RUNNING | PERF_FORMAT_ID | PERF_FORMAT_LOST
8288
*/

tools/perf/Build

Lines changed: 8 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
perf-y += builtin-bench.o
1+
perf-bench-y += builtin-bench.o
22
perf-y += builtin-annotate.o
33
perf-y += builtin-config.o
44
perf-y += builtin-diff.o
@@ -35,8 +35,8 @@ endif
3535

3636
perf-$(CONFIG_LIBELF) += builtin-probe.o
3737

38-
perf-y += bench/
39-
perf-y += tests/
38+
perf-bench-y += bench/
39+
perf-test-y += tests/
4040

4141
perf-y += perf.o
4242

@@ -53,10 +53,12 @@ CFLAGS_builtin-trace.o += -DSTRACE_GROUPS_DIR="BUILD_STR($(STRACE_GROUPS_DIR_
5353
CFLAGS_builtin-report.o += -DTIPDIR="BUILD_STR($(tipdir_SQ))"
5454
CFLAGS_builtin-report.o += -DDOCDIR="BUILD_STR($(srcdir_SQ)/Documentation)"
5555

56-
perf-y += util/
56+
perf-util-y += util/
57+
perf-util-y += arch/
5758
perf-y += arch/
58-
perf-y += ui/
59-
perf-y += scripts/
59+
perf-test-y += arch/
60+
perf-ui-y += ui/
61+
perf-util-y += scripts/
6062

6163
gtk-y += ui/gtk/
6264

Lines changed: 189 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,189 @@
1+
perf-amd-ibs(1)
2+
===============
3+
4+
NAME
5+
----
6+
perf-amd-ibs - Support for AMD Instruction-Based Sampling (IBS) with perf tool
7+
8+
SYNOPSIS
9+
--------
10+
[verse]
11+
'perf record' -e ibs_op//
12+
'perf record' -e ibs_fetch//
13+
14+
DESCRIPTION
15+
-----------
16+
17+
Instruction-Based Sampling (IBS) provides precise Instruction Pointer (IP)
18+
profiling support on AMD platforms. IBS has two independent components: IBS
19+
Op and IBS Fetch. IBS Op sampling provides information about instruction
20+
execution (micro-op execution to be precise) with details like d-cache
21+
hit/miss, d-TLB hit/miss, cache miss latency, load/store data source, branch
22+
behavior etc. IBS Fetch sampling provides information about instruction fetch
23+
with details like i-cache hit/miss, i-TLB hit/miss, fetch latency etc. IBS is
24+
per-smt-thread i.e. each SMT hardware thread contains standalone IBS units.
25+
26+
Both, IBS Op and IBS Fetch, are exposed as PMUs by Linux and can be exploited
27+
using the Linux perf utility. The following files will be created at boot time
28+
if IBS is supported by the hardware and kernel.
29+
30+
/sys/bus/event_source/devices/ibs_op/
31+
/sys/bus/event_source/devices/ibs_fetch/
32+
33+
IBS Op PMU supports two events: cycles and micro ops. IBS Fetch PMU supports
34+
one event: fetch ops.
35+
36+
IBS PMUs do not have user/kernel filtering capability and thus it requires
37+
CAP_SYS_ADMIN or CAP_PERFMON privilege.
38+
39+
IBS VS. REGULAR CORE PMU
40+
------------------------
41+
42+
IBS gives samples with precise IP, i.e. the IP recorded with IBS sample has
43+
no skid. Whereas the IP recorded by regular core PMU will have some skid
44+
(sample was generated at IP X but perf would record it at IP X+n). Hence,
45+
regular core PMU might not help for profiling with instruction level
46+
precision. Further, IBS provides additional information about the sample in
47+
question. On the other hand, regular core PMU has it's own advantages like
48+
plethora of events, counting mode (less interference), up to 6 parallel
49+
counters, event grouping support, filtering capabilities etc.
50+
51+
Three regular core PMU events are internally forwarded to IBS Op PMU when
52+
precise_ip attribute is set:
53+
54+
-e cpu-cycles:p becomes -e ibs_op//
55+
-e r076:p becomes -e ibs_op//
56+
-e r0C1:p becomes -e ibs_op/cnt_ctl=1/
57+
58+
EXAMPLES
59+
--------
60+
61+
IBS Op PMU
62+
~~~~~~~~~~
63+
64+
System-wide profile, cycles event, sampling period: 100000
65+
66+
# perf record -e ibs_op// -c 100000 -a
67+
68+
Per-cpu profile (cpu10), cycles event, sampling period: 100000
69+
70+
# perf record -e ibs_op// -c 100000 -C 10
71+
72+
Per-cpu profile (cpu10), cycles event, sampling freq: 1000
73+
74+
# perf record -e ibs_op// -F 1000 -C 10
75+
76+
System-wide profile, uOps event, sampling period: 100000
77+
78+
# perf record -e ibs_op/cnt_ctl=1/ -c 100000 -a
79+
80+
Same command, but also capture IBS register raw dump along with perf sample:
81+
82+
# perf record -e ibs_op/cnt_ctl=1/ -c 100000 -a --raw-samples
83+
84+
System-wide profile, uOps event, sampling period: 100000, L3MissOnly (Zen4 onward)
85+
86+
# perf record -e ibs_op/cnt_ctl=1,l3missonly=1/ -c 100000 -a
87+
88+
Per process(upstream v6.2 onward), uOps event, sampling period: 100000
89+
90+
# perf record -e ibs_op/cnt_ctl=1/ -c 100000 -p 1234
91+
92+
Per process(upstream v6.2 onward), uOps event, sampling period: 100000
93+
94+
# perf record -e ibs_op/cnt_ctl=1/ -c 100000 -- ls
95+
96+
To analyse recorded profile in aggregate mode
97+
98+
# perf report
99+
/* Select a line and press 'a' to drill down at instruction level. */
100+
101+
To go over each sample
102+
103+
# perf script
104+
105+
Raw dump of IBS registers when profiled with --raw-samples
106+
107+
# perf report -D
108+
/* Look for PERF_RECORD_SAMPLE */
109+
110+
Example register raw dump:
111+
112+
ibs_op_ctl: 000002c30006186a MaxCnt 100000 L3MissOnly 0 En 1
113+
Val 1 CntCtl 0=cycles CurCnt 707
114+
IbsOpRip: ffffffff8204aea7
115+
ibs_op_data: 0000010002550001 CompToRetCtr 1 TagToRetCtr 597
116+
BrnRet 0 RipInvalid 0 BrnFuse 0 Microcode 1
117+
ibs_op_data2: 0000000000000013 RmtNode 1 DataSrc 3=DRAM
118+
ibs_op_data3: 0000000031960092 LdOp 0 StOp 1 DcL1TlbMiss 0
119+
DcL2TlbMiss 0 DcL1TlbHit2M 1 DcL1TlbHit1G 0 DcL2TlbHit2M 0
120+
DcMiss 1 DcMisAcc 0 DcWcMemAcc 0 DcUcMemAcc 0 DcLockedOp 0
121+
DcMissNoMabAlloc 0 DcLinAddrValid 1 DcPhyAddrValid 1
122+
DcL2TlbHit1G 0 L2Miss 1 SwPf 0 OpMemWidth 32 bytes
123+
OpDcMissOpenMemReqs 12 DcMissLat 0 TlbRefillLat 0
124+
IbsDCLinAd: ff110008a5398920
125+
IbsDCPhysAd: 00000008a5398920
126+
127+
IBS applied in a real world usecase
128+
129+
~90% regression was observed in tbench with specific scheduler hint
130+
which was counter intuitive. IBS profile of good and bad run captured
131+
using perf helped in identifying exact cause of the problem:
132+
133+
https://lore.kernel.org/r/[email protected]
134+
135+
IBS Fetch PMU
136+
~~~~~~~~~~~~~
137+
138+
Similar commands can be used with Fetch PMU as well.
139+
140+
System-wide profile, fetch ops event, sampling period: 100000
141+
142+
# perf record -e ibs_fetch// -c 100000 -a
143+
144+
System-wide profile, fetch ops event, sampling period: 100000, Random enable
145+
146+
# perf record -e ibs_fetch/rand_en=1/ -c 100000 -a
147+
148+
Random enable adds small degree of variability to sample period. This
149+
helps in cases like long running loops where PMU is tagging the same
150+
instruction over and over because of fixed sample period.
151+
152+
etc.
153+
154+
PERF MEM AND PERF C2C
155+
---------------------
156+
157+
perf mem is a memory access profiler tool and perf c2c is a shared data
158+
cacheline analyser tool. Both of them internally uses IBS Op PMU on AMD.
159+
Below is a simple example of the perf mem tool.
160+
161+
# perf mem record -c 100000 -- make
162+
# perf mem report
163+
164+
A normal perf mem report output will provide detailed memory access profile.
165+
However, it can also be aggregated based on output fields. For example:
166+
167+
# perf mem report -F mem,sample,snoop
168+
Samples: 3M of event 'ibs_op//', Event count (approx.): 23524876
169+
Memory access Samples Snoop
170+
N/A 1903343 N/A
171+
L1 hit 1056754 N/A
172+
L2 hit 75231 N/A
173+
L3 hit 9496 HitM
174+
L3 hit 2270 N/A
175+
RAM hit 8710 N/A
176+
Remote node, same socket RAM hit 3241 N/A
177+
Remote core, same node Any cache hit 1572 HitM
178+
Remote core, same node Any cache hit 514 N/A
179+
Remote node, same socket Any cache hit 1216 HitM
180+
Remote node, same socket Any cache hit 350 N/A
181+
Uncached hit 18 N/A
182+
183+
Please refer to their man page for more detail.
184+
185+
SEE ALSO
186+
--------
187+
188+
linkperf:perf-record[1], linkperf:perf-script[1], linkperf:perf-report[1],
189+
linkperf:perf-mem[1], linkperf:perf-c2c[1]

tools/perf/Documentation/perf-kwork.txt

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
perf-kowrk(1)
1+
perf-kwork(1)
22
=============
33

44
NAME
@@ -35,7 +35,7 @@ There are several variants of 'perf kwork':
3535
perf kwork top
3636
perf kwork top -b
3737

38-
By default it shows the individual work events such as irq, workqeueu,
38+
By default it shows the individual work events such as irq, workqueue,
3939
including the run time and delay (time between raise and actually entry):
4040

4141
Runtime start Runtime end Cpu Kwork name Runtime Delaytime

tools/perf/Documentation/perf-lock.txt

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -111,11 +111,11 @@ INFO OPTIONS
111111

112112
-t::
113113
--threads::
114-
dump thread list in perf.data
114+
dump only the thread list in perf.data
115115

116116
-m::
117117
--map::
118-
dump map of lock instances (address:name table)
118+
dump only the map of lock instances (address:name table)
119119

120120

121121
CONTENTION OPTIONS

tools/perf/Documentation/perf-mem.txt

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -21,7 +21,7 @@ and stores are sampled. Use the -t option to limit to loads or stores.
2121

2222
Note that on Intel systems the memory latency reported is the use-latency,
2323
not the pure load (or store latency). Use latency includes any pipeline
24-
queueing delays in addition to the memory subsystem latency.
24+
queuing delays in addition to the memory subsystem latency.
2525

2626
On Arm64 this uses SPE to sample load and store operations, therefore hardware
2727
and kernel support is required. See linkperf:perf-arm-spe[1] for a setup guide.

tools/perf/Documentation/perf-record.txt

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -200,7 +200,7 @@ OPTIONS
200200
ip, id, tid, pid, cpu, time, addr, period, txn, weight, phys_addr,
201201
code_pgsz, data_pgsz, weight1, weight2, weight3, ins_lat, retire_lat,
202202
p_stage_cyc, mem_op, mem_lvl, mem_snoop, mem_remote, mem_lock,
203-
mem_dtlb, mem_blk, mem_hops
203+
mem_dtlb, mem_blk, mem_hops, uid, gid
204204

205205
The <operator> can be one of:
206206
==, !=, >, >=, <, <=, &
@@ -311,7 +311,7 @@ OPTIONS
311311
User can change the size by passing the size after comma like
312312
"--call-graph dwarf,4096".
313313

314-
When "fp" recording is used, perf tries to save stack enties
314+
When "fp" recording is used, perf tries to save stack entries
315315
up to the number specified in sysctl.kernel.perf_event_max_stack
316316
by default. User can change the number by passing it after comma
317317
like "--call-graph fp,32".

0 commit comments

Comments
 (0)