Skip to content

Commit d6f38c1

Browse files
committed
Merge tag 'trace-v6.17' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace
Pull tracing updates from Steven Rostedt: - Deprecate auto-mounting tracefs to /sys/kernel/debug/tracing When tracefs was first introduced back in 2014, the directory /sys/kernel/tracing was added and is the designated location to mount tracefs. To keep backward compatibility, tracefs was auto-mounted in /sys/kernel/debug/tracing as well. All distros now mount tracefs on /sys/kernel/tracing. Having it seen in two different locations has lead to various issues and inconsistencies. The VFS folks have to also maintain debugfs_create_automount() for this single user. It's been over 10 years. Tooling and scripts should start replacing the debugfs location with the tracefs one. The reason tracefs was created in the first place was to allow access to the tracing facilities without the need to configure debugfs into the kernel. Using tracefs should now be more robust. A new config is created: CONFIG_TRACEFS_AUTOMOUNT_DEPRECATED which is default y, so that the kernel is still built with the automount. This config allows those that want to remove the automount from debugfs to do so. When tracefs is accessed from /sys/kernel/debug/tracing, the following printk is triggerd: pr_warn("NOTICE: Automounting of tracing to debugfs is deprecated and will be removed in 2030\n"); This gives users another 5 years to fix their scripts. - Use queue_rcu_work() instead of call_rcu() for freeing event filters The number of filters to be free can be many depending on the number of events within an event system. Freeing them from softirq context can potentially cause undesired latency. Use the RCU workqueue to free them instead. - Remove pointless memory barriers in latency code Memory barriers were added to some of the latency code a long time ago with the idea of "making them visible", but that's not what memory barriers are for. They are to synchronize access between different variables. There was no synchronization here making them pointless. - Remove "__attribute__()" from the type field of event format When LLVM is used to compile the kernel with CONFIG_DEBUG_INFO_BTF=y and PAHOLE_HAS_BTF_TAG=y, some of the format fields get expanded with the following: field:const char * filename; offset:24; size:8; signed:0; Turns into: field:const char __attribute__((btf_type_tag("user"))) * filename; offset:24; size:8; signed:0; This confuses parsers. Add code to strip these tags from the strings. - Add eprobe config option CONFIG_EPROBE_EVENTS Eprobes were added back in 5.15 but were only enabled when another probe was enabled (kprobe, fprobe, uprobe, etc). The eprobes had no config option of their own. Add one as they should be a separate entity. It's default y to keep with the old kernels but still has dependencies on TRACING and HAVE_REGS_AND_STACK_ACCESS_API. - Add eprobe documentation When eprobes were added back in 5.15 no documentation was added to describe them. This needs to be rectified. - Replace open coded cpumask_next_wrap() in move_to_next_cpu() - Have preemptirq_delay_run() use off-stack CPU mask - Remove obsolete comment about pelt_cfs event DECLARE_TRACE() appends "_tp" to trace events now, but the comment above pelt_cfs still mentioned appending it manually. - Remove EVENT_FILE_FL_SOFT_MODE flag The SOFT_MODE flag was required when the soft enabling and disabling of trace events was first introduced. But there was a bug with this approach as it only worked for a single instance. When multiple users required soft disabling and disabling the code was changed to have a ref count. The SOFT_MODE flag is now set iff the ref count is non zero. This is redundant and just reading the ref count is good enough. - Fix typo in comment * tag 'trace-v6.17' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: Documentation: tracing: Add documentation about eprobes tracing: Have eprobes have their own config option tracing: Remove "__attribute__()" from the type field of event format tracing: Deprecate auto-mounting tracefs in debugfs tracing: Fix comment in trace_module_remove_events() tracing: Remove EVENT_FILE_FL_SOFT_MODE flag tracing: Remove pointless memory barriers tracing/sched: Remove obsolete comment on suffixes kernel: trace: preemptirq_delay_test: use offstack cpu mask tracing: Use queue_rcu_work() to free filters tracing: Replace opencoded cpumask_next_wrap() in move_to_next_cpu()
2 parents 5172a77 + 623526b commit d6f38c1

File tree

14 files changed

+498
-85
lines changed

14 files changed

+498
-85
lines changed
Lines changed: 20 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,20 @@
1+
What: /sys/kernel/debug/tracing
2+
Date: May 2008
3+
KernelVersion: 2.6.27
4+
5+
Description:
6+
7+
The ftrace was first added to the kernel, its interface was placed
8+
into the debugfs file system under the "tracing" directory. Access
9+
to the files were in /sys/kernel/debug/tracing. As systems wanted
10+
access to the tracing interface without having to enable debugfs, a
11+
new interface was created called "tracefs". This was a stand alone
12+
file system and was usually mounted in /sys/kernel/tracing.
13+
14+
To allow older tooling to continue to operate, when mounting
15+
debugfs, the tracefs file system would automatically get mounted in
16+
the "tracing" directory of debugfs. The tracefs interface was added
17+
in January 2015 in the v4.1 kernel.
18+
19+
All tooling should now be using tracefs directly and the "tracing"
20+
directory in debugfs should be removed by January 2030.

Documentation/trace/eprobetrace.rst

Lines changed: 269 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,269 @@
1+
.. SPDX-License-Identifier: GPL-2.0
2+
3+
==================================
4+
Eprobe - Event-based Probe Tracing
5+
==================================
6+
7+
:Author: Steven Rostedt <[email protected]>
8+
9+
- Written for v6.17
10+
11+
Overview
12+
========
13+
14+
Eprobes are dynamic events that are placed on existing events to either
15+
dereference a field that is a pointer, or simply to limit what fields are
16+
recorded in the trace event.
17+
18+
Eprobes depend on kprobe events so to enable this feature, build your kernel
19+
with CONFIG_EPROBE_EVENTS=y.
20+
21+
Eprobes are created via the /sys/kernel/tracing/dynamic_events file.
22+
23+
Synopsis of eprobe_events
24+
-------------------------
25+
::
26+
27+
e[:[EGRP/][EEVENT]] GRP.EVENT [FETCHARGS] : Set a probe
28+
-:[EGRP/][EEVENT] : Clear a probe
29+
30+
EGRP : Group name of the new event. If omitted, use "eprobes" for it.
31+
EEVENT : Event name. If omitted, the event name is generated and will
32+
be the same event name as the event it attached to.
33+
GRP : Group name of the event to attach to.
34+
EVENT : Event name of the event to attach to.
35+
36+
FETCHARGS : Arguments. Each probe can have up to 128 args.
37+
$FIELD : Fetch the value of the event field called FIELD.
38+
@ADDR : Fetch memory at ADDR (ADDR should be in kernel)
39+
@SYM[+|-offs] : Fetch memory at SYM +|- offs (SYM should be a data symbol)
40+
$comm : Fetch current task comm.
41+
+|-[u]OFFS(FETCHARG) : Fetch memory at FETCHARG +|- OFFS address.(\*3)(\*4)
42+
\IMM : Store an immediate value to the argument.
43+
NAME=FETCHARG : Set NAME as the argument name of FETCHARG.
44+
FETCHARG:TYPE : Set TYPE as the type of FETCHARG. Currently, basic types
45+
(u8/u16/u32/u64/s8/s16/s32/s64), hexadecimal types
46+
(x8/x16/x32/x64), VFS layer common type(%pd/%pD), "char",
47+
"string", "ustring", "symbol", "symstr" and "bitfield" are
48+
supported.
49+
50+
Types
51+
-----
52+
The FETCHARGS above is very similar to the kprobe events as described in
53+
Documentation/trace/kprobetrace.rst.
54+
55+
The difference between eprobes and kprobes FETCHARGS is that eprobes has a
56+
$FIELD command that returns the content of the event field of the event
57+
that is attached. Eprobes do not have access to registers, stacks and function
58+
arguments that kprobes has.
59+
60+
If a field argument is a pointer, it may be dereferenced just like a memory
61+
address using the FETCHARGS syntax.
62+
63+
64+
Attaching to dynamic events
65+
---------------------------
66+
67+
Eprobes may attach to dynamic events as well as to normal events. It may
68+
attach to a kprobe event, a synthetic event or a fprobe event. This is useful
69+
if the type of a field needs to be changed. See Example 2 below.
70+
71+
Usage examples
72+
==============
73+
74+
Example 1
75+
---------
76+
77+
The basic usage of eprobes is to limit the data that is being recorded into
78+
the tracing buffer. For example, a common event to trace is the sched_switch
79+
trace event. That has a format of::
80+
81+
field:unsigned short common_type; offset:0; size:2; signed:0;
82+
field:unsigned char common_flags; offset:2; size:1; signed:0;
83+
field:unsigned char common_preempt_count; offset:3; size:1; signed:0;
84+
field:int common_pid; offset:4; size:4; signed:1;
85+
86+
field:char prev_comm[16]; offset:8; size:16; signed:0;
87+
field:pid_t prev_pid; offset:24; size:4; signed:1;
88+
field:int prev_prio; offset:28; size:4; signed:1;
89+
field:long prev_state; offset:32; size:8; signed:1;
90+
field:char next_comm[16]; offset:40; size:16; signed:0;
91+
field:pid_t next_pid; offset:56; size:4; signed:1;
92+
field:int next_prio; offset:60; size:4; signed:1;
93+
94+
The first four fields are common to all events and can not be limited. But the
95+
rest of the event has 60 bytes of information. It records the names of the
96+
previous and next tasks being scheduled out and in, as well as their pids and
97+
priorities. It also records the state of the previous task. If only the pids
98+
of the tasks are of interest, why waste the ring buffer with all the other
99+
fields?
100+
101+
An eprobe can limit what gets recorded. Note, it does not help in performance,
102+
as all the fields are recorded in a temporary buffer to process the eprobe.
103+
::
104+
105+
# echo 'e:sched/switch sched.sched_switch prev=$prev_pid:u32 next=$next_pid:u32' >> /sys/kernel/tracing/dynamic_events
106+
# echo 1 > /sys/kernel/tracing/events/sched/switch/enable
107+
# cat /sys/kernel/tracing/trace
108+
109+
# tracer: nop
110+
#
111+
# entries-in-buffer/entries-written: 2721/2721 #P:8
112+
#
113+
# _-----=> irqs-off/BH-disabled
114+
# / _----=> need-resched
115+
# | / _---=> hardirq/softirq
116+
# || / _--=> preempt-depth
117+
# ||| / _-=> migrate-disable
118+
# |||| / delay
119+
# TASK-PID CPU# ||||| TIMESTAMP FUNCTION
120+
# | | | ||||| | |
121+
sshd-session-1082 [004] d..4. 5041.239906: switch: (sched.sched_switch) prev=1082 next=0
122+
bash-1085 [001] d..4. 5041.240198: switch: (sched.sched_switch) prev=1085 next=141
123+
kworker/u34:5-141 [001] d..4. 5041.240259: switch: (sched.sched_switch) prev=141 next=1085
124+
<idle>-0 [004] d..4. 5041.240354: switch: (sched.sched_switch) prev=0 next=1082
125+
bash-1085 [001] d..4. 5041.240385: switch: (sched.sched_switch) prev=1085 next=141
126+
kworker/u34:5-141 [001] d..4. 5041.240410: switch: (sched.sched_switch) prev=141 next=1085
127+
bash-1085 [001] d..4. 5041.240478: switch: (sched.sched_switch) prev=1085 next=0
128+
sshd-session-1082 [004] d..4. 5041.240526: switch: (sched.sched_switch) prev=1082 next=0
129+
<idle>-0 [001] d..4. 5041.247524: switch: (sched.sched_switch) prev=0 next=90
130+
<idle>-0 [002] d..4. 5041.247545: switch: (sched.sched_switch) prev=0 next=16
131+
kworker/1:1-90 [001] d..4. 5041.247580: switch: (sched.sched_switch) prev=90 next=0
132+
rcu_sched-16 [002] d..4. 5041.247591: switch: (sched.sched_switch) prev=16 next=0
133+
<idle>-0 [002] d..4. 5041.257536: switch: (sched.sched_switch) prev=0 next=16
134+
rcu_sched-16 [002] d..4. 5041.257573: switch: (sched.sched_switch) prev=16 next=0
135+
136+
Note, without adding the "u32" after the prev_pid and next_pid, the values
137+
would default showing in hexadecimal.
138+
139+
Example 2
140+
---------
141+
142+
If a specific system call is to be recorded but the syscalls events are not
143+
enabled, the raw_syscalls can still be used (syscalls are system call
144+
events are not normal events, but are created from the raw_syscalls events
145+
within the kernel). In order to trace the openat system call, one can create
146+
an event probe on top of the raw_syscalls event:
147+
::
148+
149+
# cd /sys/kernel/tracing
150+
# cat events/raw_syscalls/sys_enter/format
151+
name: sys_enter
152+
ID: 395
153+
format:
154+
field:unsigned short common_type; offset:0; size:2; signed:0;
155+
field:unsigned char common_flags; offset:2; size:1; signed:0;
156+
field:unsigned char common_preempt_count; offset:3; size:1; signed:0;
157+
field:int common_pid; offset:4; size:4; signed:1;
158+
159+
field:long id; offset:8; size:8; signed:1;
160+
field:unsigned long args[6]; offset:16; size:48; signed:0;
161+
162+
print fmt: "NR %ld (%lx, %lx, %lx, %lx, %lx, %lx)", REC->id, REC->args[0], REC->args[1], REC->args[2], REC->args[3], REC->args[4], REC->args[5]
163+
164+
From the source code, the sys_openat() has:
165+
::
166+
167+
int sys_openat(int dirfd, const char *path, int flags, mode_t mode)
168+
{
169+
return my_syscall4(__NR_openat, dirfd, path, flags, mode);
170+
}
171+
172+
The path is the second parameter, and that is what is wanted.
173+
::
174+
175+
# echo 'e:openat raw_syscalls.sys_enter nr=$id filename=+8($args):ustring' >> dynamic_events
176+
177+
This is being run on x86_64 where the word size is 8 bytes and the openat
178+
system call __NR_openat is set at 257.
179+
::
180+
181+
# echo 'nr == 257' > events/eprobes/openat/filter
182+
183+
Now enable the event and look at the trace.
184+
::
185+
186+
# echo 1 > events/eprobes/openat/enable
187+
# cat trace
188+
189+
# tracer: nop
190+
#
191+
# entries-in-buffer/entries-written: 4/4 #P:8
192+
#
193+
# _-----=> irqs-off/BH-disabled
194+
# / _----=> need-resched
195+
# | / _---=> hardirq/softirq
196+
# || / _--=> preempt-depth
197+
# ||| / _-=> migrate-disable
198+
# |||| / delay
199+
# TASK-PID CPU# ||||| TIMESTAMP FUNCTION
200+
# | | | ||||| | |
201+
cat-1298 [003] ...2. 2060.875970: openat: (raw_syscalls.sys_enter) nr=0x101 filename=(fault)
202+
cat-1298 [003] ...2. 2060.876197: openat: (raw_syscalls.sys_enter) nr=0x101 filename=(fault)
203+
cat-1298 [003] ...2. 2060.879126: openat: (raw_syscalls.sys_enter) nr=0x101 filename=(fault)
204+
cat-1298 [003] ...2. 2060.879639: openat: (raw_syscalls.sys_enter) nr=0x101 filename=(fault)
205+
206+
The filename shows "(fault)". This is likely because the filename has not been
207+
pulled into memory yet and currently trace events cannot fault in memory that
208+
is not present. When an eprobe tries to read memory that has not been faulted
209+
in yet, it will show the "(fault)" text.
210+
211+
To get around this, as the kernel will likely pull in this filename and make
212+
it present, attaching it to a synthetic event that can pass the address of the
213+
filename from the entry of the event to the end of the event, this can be used
214+
to show the filename when the system call returns.
215+
216+
Remove the old eprobe::
217+
218+
# echo 1 > events/eprobes/openat/enable
219+
# echo '-:openat' >> dynamic_events
220+
221+
This time make an eprobe where the address of the filename is saved::
222+
223+
# echo 'e:openat_start raw_syscalls.sys_enter nr=$id filename=+8($args):x64' >> dynamic_events
224+
225+
Create a synthetic event that passes the address of the filename to the
226+
end of the event::
227+
228+
# echo 's:filename u64 file' >> dynamic_events
229+
# echo 'hist:keys=common_pid:f=filename if nr == 257' > events/eprobes/openat_start/trigger
230+
# echo 'hist:keys=common_pid:file=$f:onmatch(eprobes.openat_start).trace(filename,$file) if id == 257' > events/raw_syscalls/sys_exit/trigger
231+
232+
Now that the address of the filename has been passed to the end of the
233+
system call, create another eprobe to attach to the exit event to show the
234+
string::
235+
236+
# echo 'e:openat synthetic.filename filename=+0($file):ustring' >> dynamic_events
237+
# echo 1 > events/eprobes/openat/enable
238+
# cat trace
239+
240+
# tracer: nop
241+
#
242+
# entries-in-buffer/entries-written: 4/4 #P:8
243+
#
244+
# _-----=> irqs-off/BH-disabled
245+
# / _----=> need-resched
246+
# | / _---=> hardirq/softirq
247+
# || / _--=> preempt-depth
248+
# ||| / _-=> migrate-disable
249+
# |||| / delay
250+
# TASK-PID CPU# ||||| TIMESTAMP FUNCTION
251+
# | | | ||||| | |
252+
cat-1331 [001] ...5. 2944.787977: openat: (synthetic.filename) filename="/etc/ld.so.cache"
253+
cat-1331 [001] ...5. 2944.788480: openat: (synthetic.filename) filename="/lib/x86_64-linux-gnu/libc.so.6"
254+
cat-1331 [001] ...5. 2944.793426: openat: (synthetic.filename) filename="/usr/lib/locale/locale-archive"
255+
cat-1331 [001] ...5. 2944.831362: openat: (synthetic.filename) filename="trace"
256+
257+
Example 3
258+
---------
259+
260+
If syscall trace events are available, the above would not need the first
261+
eprobe, but it would still need the last one::
262+
263+
# echo 's:filename u64 file' >> dynamic_events
264+
# echo 'hist:keys=common_pid:f=filename' > events/syscalls/sys_enter_openat/trigger
265+
# echo 'hist:keys=common_pid:file=$f:onmatch(syscalls.sys_enter_openat).trace(filename,$file)' > events/syscalls/sys_exit_openat/trigger
266+
# echo 'e:openat synthetic.filename filename=+0($file):ustring' >> dynamic_events
267+
# echo 1 > events/eprobes/openat/enable
268+
269+
And this would produce the same result as Example 2.

Documentation/trace/index.rst

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -36,6 +36,7 @@ the Linux kernel.
3636
kprobes
3737
kprobetrace
3838
fprobetrace
39+
eprobetrace
3940
fprobe
4041
ring-buffer-design
4142

include/linux/trace_events.h

Lines changed: 0 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -480,7 +480,6 @@ enum {
480480
EVENT_FILE_FL_RECORDED_TGID_BIT,
481481
EVENT_FILE_FL_FILTERED_BIT,
482482
EVENT_FILE_FL_NO_SET_FILTER_BIT,
483-
EVENT_FILE_FL_SOFT_MODE_BIT,
484483
EVENT_FILE_FL_SOFT_DISABLED_BIT,
485484
EVENT_FILE_FL_TRIGGER_MODE_BIT,
486485
EVENT_FILE_FL_TRIGGER_COND_BIT,
@@ -618,7 +617,6 @@ extern int __kprobe_event_add_fields(struct dynevent_cmd *cmd, ...);
618617
* RECORDED_TGID - The tgids should be recorded at sched_switch
619618
* FILTERED - The event has a filter attached
620619
* NO_SET_FILTER - Set when filter has error and is to be ignored
621-
* SOFT_MODE - The event is enabled/disabled by SOFT_DISABLED
622620
* SOFT_DISABLED - When set, do not trace the event (even though its
623621
* tracepoint may be enabled)
624622
* TRIGGER_MODE - When set, invoke the triggers associated with the event
@@ -633,7 +631,6 @@ enum {
633631
EVENT_FILE_FL_RECORDED_TGID = (1 << EVENT_FILE_FL_RECORDED_TGID_BIT),
634632
EVENT_FILE_FL_FILTERED = (1 << EVENT_FILE_FL_FILTERED_BIT),
635633
EVENT_FILE_FL_NO_SET_FILTER = (1 << EVENT_FILE_FL_NO_SET_FILTER_BIT),
636-
EVENT_FILE_FL_SOFT_MODE = (1 << EVENT_FILE_FL_SOFT_MODE_BIT),
637634
EVENT_FILE_FL_SOFT_DISABLED = (1 << EVENT_FILE_FL_SOFT_DISABLED_BIT),
638635
EVENT_FILE_FL_TRIGGER_MODE = (1 << EVENT_FILE_FL_TRIGGER_MODE_BIT),
639636
EVENT_FILE_FL_TRIGGER_COND = (1 << EVENT_FILE_FL_TRIGGER_COND_BIT),

include/trace/events/sched.h

Lines changed: 0 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -829,8 +829,6 @@ TRACE_EVENT(sched_wake_idle_without_ipi,
829829
/*
830830
* Following tracepoints are not exported in tracefs and provide hooking
831831
* mechanisms only for testing and debugging purposes.
832-
*
833-
* Postfixed with _tp to make them easily identifiable in the code.
834832
*/
835833
DECLARE_TRACE(pelt_cfs,
836834
TP_PROTO(struct cfs_rq *cfs_rq),

kernel/trace/Kconfig

Lines changed: 27 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -200,6 +200,19 @@ menuconfig FTRACE
200200

201201
if FTRACE
202202

203+
config TRACEFS_AUTOMOUNT_DEPRECATED
204+
bool "Automount tracefs on debugfs [DEPRECATED]"
205+
depends on TRACING
206+
default y
207+
help
208+
The tracing interface was moved from /sys/kernel/debug/tracing
209+
to /sys/kernel/tracing in 2015, but the tracing file system
210+
was still automounted in /sys/kernel/debug for backward
211+
compatibility with tooling.
212+
213+
The new interface has been around for more than 10 years and
214+
the old debug mount will soon be removed.
215+
203216
config BOOTTIME_TRACING
204217
bool "Boot-time Tracing support"
205218
depends on TRACING
@@ -780,6 +793,20 @@ config UPROBE_EVENTS
780793
This option is required if you plan to use perf-probe subcommand
781794
of perf tools on user space applications.
782795

796+
config EPROBE_EVENTS
797+
bool "Enable event-based dynamic events"
798+
depends on TRACING
799+
depends on HAVE_REGS_AND_STACK_ACCESS_API
800+
select PROBE_EVENTS
801+
select DYNAMIC_EVENTS
802+
default y
803+
help
804+
Eprobes are dynamic events that can be placed on other existing
805+
events. It can be used to limit what fields are recorded in
806+
an event or even dereference a field of an event. It can
807+
convert the type of an event field. For example, turn an
808+
address into a string.
809+
783810
config BPF_EVENTS
784811
depends on BPF_SYSCALL
785812
depends on (KPROBE_EVENTS || UPROBE_EVENTS) && PERF_EVENTS

kernel/trace/Makefile

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -82,7 +82,7 @@ obj-$(CONFIG_EVENT_TRACING) += trace_event_perf.o
8282
endif
8383
obj-$(CONFIG_EVENT_TRACING) += trace_events_filter.o
8484
obj-$(CONFIG_EVENT_TRACING) += trace_events_trigger.o
85-
obj-$(CONFIG_PROBE_EVENTS) += trace_eprobe.o
85+
obj-$(CONFIG_EPROBE_EVENTS) += trace_eprobe.o
8686
obj-$(CONFIG_TRACE_EVENT_INJECT) += trace_events_inject.o
8787
obj-$(CONFIG_SYNTH_EVENTS) += trace_events_synth.o
8888
obj-$(CONFIG_HIST_TRIGGERS) += trace_events_hist.o

0 commit comments

Comments
 (0)