Skip to content

Commit 590e7b2

Browse files
committed
Documentation: tracing/probes: Add fprobe event tracing document
Add a documentation about fprobe event tracing including tracepoint probe event and BTF argument. Link: https://lore.kernel.org/all/168507479345.913472.2804569685436422001.stgit@mhiramat.roam.corp.google.com/ Signed-off-by: Masami Hiramatsu (Google) <[email protected]> Reviewed-by: Bagas Sanjaya <[email protected]>
1 parent 4231f30 commit 590e7b2

File tree

3 files changed

+191
-0
lines changed

3 files changed

+191
-0
lines changed

Documentation/trace/fprobetrace.rst

Lines changed: 188 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,188 @@
1+
.. SPDX-License-Identifier: GPL-2.0
2+
3+
==========================
4+
Fprobe-based Event Tracing
5+
==========================
6+
7+
.. Author: Masami Hiramatsu <[email protected]>
8+
9+
Overview
10+
--------
11+
12+
Fprobe event is similar to the kprobe event, but limited to probe on
13+
the function entry and exit only. It is good enough for many use cases
14+
which only traces some specific functions.
15+
16+
This document also covers tracepoint probe events (tprobe) since this
17+
is also works only on the tracepoint entry. User can trace a part of
18+
tracepoint argument, or the tracepoint without trace-event, which is
19+
not exposed on tracefs.
20+
21+
As same as other dynamic events, fprobe events and tracepoint probe
22+
events are defined via `dynamic_events` interface file on tracefs.
23+
24+
Synopsis of fprobe-events
25+
-------------------------
26+
::
27+
28+
f[:[GRP1/][EVENT1]] SYM [FETCHARGS] : Probe on function entry
29+
f[MAXACTIVE][:[GRP1/][EVENT1]] SYM%return [FETCHARGS] : Probe on function exit
30+
t[:[GRP2/][EVENT2]] TRACEPOINT [FETCHARGS] : Probe on tracepoint
31+
32+
GRP1 : Group name for fprobe. If omitted, use "fprobes" for it.
33+
GRP2 : Group name for tprobe. If omitted, use "tracepoints" for it.
34+
EVENT1 : Event name for fprobe. If omitted, the event name is
35+
"SYM__entry" or "SYM__exit".
36+
EVENT2 : Event name for tprobe. If omitted, the event name is
37+
the same as "TRACEPOINT", but if the "TRACEPOINT" starts
38+
with a digit character, "_TRACEPOINT" is used.
39+
MAXACTIVE : Maximum number of instances of the specified function that
40+
can be probed simultaneously, or 0 for the default value
41+
as defined in Documentation/trace/fprobes.rst
42+
43+
FETCHARGS : Arguments. Each probe can have up to 128 args.
44+
ARG : Fetch "ARG" function argument using BTF (only for function
45+
entry or tracepoint.) (\*1)
46+
@ADDR : Fetch memory at ADDR (ADDR should be in kernel)
47+
@SYM[+|-offs] : Fetch memory at SYM +|- offs (SYM should be a data symbol)
48+
$stackN : Fetch Nth entry of stack (N >= 0)
49+
$stack : Fetch stack address.
50+
$argN : Fetch the Nth function argument. (N >= 1) (\*2)
51+
$retval : Fetch return value.(\*3)
52+
$comm : Fetch current task comm.
53+
+|-[u]OFFS(FETCHARG) : Fetch memory at FETCHARG +|- OFFS address.(\*4)(\*5)
54+
\IMM : Store an immediate value to the argument.
55+
NAME=FETCHARG : Set NAME as the argument name of FETCHARG.
56+
FETCHARG:TYPE : Set TYPE as the type of FETCHARG. Currently, basic types
57+
(u8/u16/u32/u64/s8/s16/s32/s64), hexadecimal types
58+
(x8/x16/x32/x64), "char", "string", "ustring", "symbol", "symstr"
59+
and bitfield are supported.
60+
61+
(\*1) This is available only when BTF is enabled.
62+
(\*2) only for the probe on function entry (offs == 0).
63+
(\*3) only for return probe.
64+
(\*4) this is useful for fetching a field of data structures.
65+
(\*5) "u" means user-space dereference.
66+
67+
For the details of TYPE, see :ref:`kprobetrace documentation <kprobetrace_types>`.
68+
69+
BTF arguments
70+
-------------
71+
BTF (BPF Type Format) argument allows user to trace function and tracepoint
72+
parameters by its name instead of ``$argN``. This feature is available if the
73+
kernel is configured with CONFIG_BPF_SYSCALL and CONFIG_DEBUG_INFO_BTF.
74+
If user only specify the BTF argument, the event's argument name is also
75+
automatically set by the given name. ::
76+
77+
# echo 'f:myprobe vfs_read count pos' >> dynamic_events
78+
# cat dynamic_events
79+
f:fprobes/myprobe vfs_read count=count pos=pos
80+
81+
It also chooses the fetch type from BTF information. For example, in the above
82+
example, the ``count`` is unsigned long, and the ``pos`` is a pointer. Thus, both
83+
are converted to 64bit unsigned long, but only ``pos`` has "%Lx" print-format as
84+
below ::
85+
86+
# cat events/fprobes/myprobe/format
87+
name: myprobe
88+
ID: 1313
89+
format:
90+
field:unsigned short common_type; offset:0; size:2; signed:0;
91+
field:unsigned char common_flags; offset:2; size:1; signed:0;
92+
field:unsigned char common_preempt_count; offset:3; size:1; signed:0;
93+
field:int common_pid; offset:4; size:4; signed:1;
94+
95+
field:unsigned long __probe_ip; offset:8; size:8; signed:0;
96+
field:u64 count; offset:16; size:8; signed:0;
97+
field:u64 pos; offset:24; size:8; signed:0;
98+
99+
print fmt: "(%lx) count=%Lu pos=0x%Lx", REC->__probe_ip, REC->count, REC->pos
100+
101+
If user unsures the name of arguments, ``$arg*`` will be helpful. The ``$arg*``
102+
is expanded to all function arguments of the function or the tracepoint. ::
103+
104+
# echo 'f:myprobe vfs_read $arg*' >> dynamic_events
105+
# cat dynamic_events
106+
f:fprobes/myprobe vfs_read file=file buf=buf count=count pos=pos
107+
108+
BTF also affects the ``$retval``. If user doesn't set any type, the retval type is
109+
automatically picked from the BTF. If the function returns ``void``, ``$retval``
110+
is rejected.
111+
112+
Usage examples
113+
--------------
114+
Here is an example to add fprobe events on ``vfs_read()`` function entry
115+
and exit, with BTF arguments.
116+
::
117+
118+
# echo 'f vfs_read $arg*' >> dynamic_events
119+
# echo 'f vfs_read%return $retval' >> dynamic_events
120+
# cat dynamic_events
121+
f:fprobes/vfs_read__entry vfs_read file=file buf=buf count=count pos=pos
122+
f:fprobes/vfs_read__exit vfs_read%return arg1=$retval
123+
# echo 1 > events/fprobes/enable
124+
# head -n 20 trace | tail
125+
# TASK-PID CPU# ||||| TIMESTAMP FUNCTION
126+
# | | | ||||| | |
127+
sh-70 [000] ...1. 335.883195: vfs_read__entry: (vfs_read+0x4/0x340) file=0xffff888005cf9a80 buf=0x7ffef36c6879 count=1 pos=0xffffc900005aff08
128+
sh-70 [000] ..... 335.883208: vfs_read__exit: (ksys_read+0x75/0x100 <- vfs_read) arg1=1
129+
sh-70 [000] ...1. 335.883220: vfs_read__entry: (vfs_read+0x4/0x340) file=0xffff888005cf9a80 buf=0x7ffef36c6879 count=1 pos=0xffffc900005aff08
130+
sh-70 [000] ..... 335.883224: vfs_read__exit: (ksys_read+0x75/0x100 <- vfs_read) arg1=1
131+
sh-70 [000] ...1. 335.883232: vfs_read__entry: (vfs_read+0x4/0x340) file=0xffff888005cf9a80 buf=0x7ffef36c687a count=1 pos=0xffffc900005aff08
132+
sh-70 [000] ..... 335.883237: vfs_read__exit: (ksys_read+0x75/0x100 <- vfs_read) arg1=1
133+
sh-70 [000] ...1. 336.050329: vfs_read__entry: (vfs_read+0x4/0x340) file=0xffff888005cf9a80 buf=0x7ffef36c6879 count=1 pos=0xffffc900005aff08
134+
sh-70 [000] ..... 336.050343: vfs_read__exit: (ksys_read+0x75/0x100 <- vfs_read) arg1=1
135+
136+
You can see all function arguments and return values are recorded as signed int.
137+
138+
Also, here is an example of tracepoint events on ``sched_switch`` tracepoint.
139+
To compare the result, this also enables the ``sched_switch`` traceevent too.
140+
::
141+
142+
# echo 't sched_switch $arg*' >> dynamic_events
143+
# echo 1 > events/sched/sched_switch/enable
144+
# echo 1 > events/tracepoints/sched_switch/enable
145+
# echo > trace
146+
# head -n 20 trace | tail
147+
# TASK-PID CPU# ||||| TIMESTAMP FUNCTION
148+
# | | | ||||| | |
149+
sh-70 [000] d..2. 3912.083993: sched_switch: prev_comm=sh prev_pid=70 prev_prio=120 prev_state=S ==> next_comm=swapper/0 next_pid=0 next_prio=120
150+
sh-70 [000] d..3. 3912.083995: sched_switch: (__probestub_sched_switch+0x4/0x10) preempt=0 prev=0xffff88800664e100 next=0xffffffff828229c0 prev_state=1
151+
<idle>-0 [000] d..2. 3912.084183: sched_switch: prev_comm=swapper/0 prev_pid=0 prev_prio=120 prev_state=R ==> next_comm=rcu_preempt next_pid=16 next_prio=120
152+
<idle>-0 [000] d..3. 3912.084184: sched_switch: (__probestub_sched_switch+0x4/0x10) preempt=0 prev=0xffffffff828229c0 next=0xffff888004208000 prev_state=0
153+
rcu_preempt-16 [000] d..2. 3912.084196: sched_switch: prev_comm=rcu_preempt prev_pid=16 prev_prio=120 prev_state=I ==> next_comm=swapper/0 next_pid=0 next_prio=120
154+
rcu_preempt-16 [000] d..3. 3912.084196: sched_switch: (__probestub_sched_switch+0x4/0x10) preempt=0 prev=0xffff888004208000 next=0xffffffff828229c0 prev_state=1026
155+
<idle>-0 [000] d..2. 3912.085191: sched_switch: prev_comm=swapper/0 prev_pid=0 prev_prio=120 prev_state=R ==> next_comm=rcu_preempt next_pid=16 next_prio=120
156+
<idle>-0 [000] d..3. 3912.085191: sched_switch: (__probestub_sched_switch+0x4/0x10) preempt=0 prev=0xffffffff828229c0 next=0xffff888004208000 prev_state=0
157+
158+
As you can see, the ``sched_switch`` trace-event shows *cooked* parameters, on
159+
the other hand, the ``sched_switch`` tracepoint probe event shows *raw*
160+
parameters. This means you can access any field values in the task
161+
structure pointed by the ``prev`` and ``next`` arguments.
162+
163+
For example, usually ``task_struct::start_time`` is not traced, but with this
164+
traceprobe event, you can trace it as below.
165+
::
166+
167+
# echo 't sched_switch comm=+1896(next):string start_time=+1728(next):u64' > dynamic_events
168+
# head -n 20 trace | tail
169+
# TASK-PID CPU# ||||| TIMESTAMP FUNCTION
170+
# | | | ||||| | |
171+
sh-70 [000] d..3. 5606.686577: sched_switch: (__probestub_sched_switch+0x4/0x10) comm="rcu_preempt" usage=1 start_time=245000000
172+
rcu_preempt-16 [000] d..3. 5606.686602: sched_switch: (__probestub_sched_switch+0x4/0x10) comm="sh" usage=1 start_time=1596095526
173+
sh-70 [000] d..3. 5606.686637: sched_switch: (__probestub_sched_switch+0x4/0x10) comm="swapper/0" usage=2 start_time=0
174+
<idle>-0 [000] d..3. 5606.687190: sched_switch: (__probestub_sched_switch+0x4/0x10) comm="rcu_preempt" usage=1 start_time=245000000
175+
rcu_preempt-16 [000] d..3. 5606.687202: sched_switch: (__probestub_sched_switch+0x4/0x10) comm="swapper/0" usage=2 start_time=0
176+
<idle>-0 [000] d..3. 5606.690317: sched_switch: (__probestub_sched_switch+0x4/0x10) comm="kworker/0:1" usage=1 start_time=137000000
177+
kworker/0:1-14 [000] d..3. 5606.690339: sched_switch: (__probestub_sched_switch+0x4/0x10) comm="swapper/0" usage=2 start_time=0
178+
<idle>-0 [000] d..3. 5606.692368: sched_switch: (__probestub_sched_switch+0x4/0x10) comm="kworker/0:1" usage=1 start_time=137000000
179+
180+
Currently, to find the offset of a specific field in the data structure,
181+
you need to build kernel with debuginfo and run `perf probe` command with
182+
`-D` option. e.g.
183+
::
184+
185+
# perf probe -D "__probestub_sched_switch next->comm:string next->start_time"
186+
p:probe/__probestub_sched_switch __probestub_sched_switch+0 comm=+1896(%cx):string start_time=+1728(%cx):u64
187+
188+
And replace the ``%cx`` with the ``next``.

Documentation/trace/index.rst

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -13,6 +13,7 @@ Linux Tracing Technologies
1313
kprobes
1414
kprobetrace
1515
uprobetracer
16+
fprobetrace
1617
tracepoints
1718
events
1819
events-kmem

Documentation/trace/kprobetrace.rst

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -66,6 +66,8 @@ Synopsis of kprobe_events
6666
(\*3) this is useful for fetching a field of data structures.
6767
(\*4) "u" means user-space dereference. See :ref:`user_mem_access`.
6868

69+
.. _kprobetrace_types:
70+
6971
Types
7072
-----
7173
Several types are supported for fetchargs. Kprobe tracer will access memory

0 commit comments

Comments
 (0)