|
| 1 | +.. SPDX-License-Identifier: GPL-2.0 |
| 2 | +
|
| 3 | +================================== |
| 4 | +Eprobe - Event-based Probe Tracing |
| 5 | +================================== |
| 6 | + |
| 7 | +:Author: Steven Rostedt < [email protected]> |
| 8 | + |
| 9 | +- Written for v6.17 |
| 10 | + |
| 11 | +Overview |
| 12 | +======== |
| 13 | + |
| 14 | +Eprobes are dynamic events that are placed on existing events to either |
| 15 | +dereference a field that is a pointer, or simply to limit what fields are |
| 16 | +recorded in the trace event. |
| 17 | + |
| 18 | +Eprobes depend on kprobe events so to enable this feature, build your kernel |
| 19 | +with CONFIG_EPROBE_EVENTS=y. |
| 20 | + |
| 21 | +Eprobes are created via the /sys/kernel/tracing/dynamic_events file. |
| 22 | + |
| 23 | +Synopsis of eprobe_events |
| 24 | +------------------------- |
| 25 | +:: |
| 26 | + |
| 27 | + e[:[EGRP/][EEVENT]] GRP.EVENT [FETCHARGS] : Set a probe |
| 28 | + -:[EGRP/][EEVENT] : Clear a probe |
| 29 | + |
| 30 | + EGRP : Group name of the new event. If omitted, use "eprobes" for it. |
| 31 | + EEVENT : Event name. If omitted, the event name is generated and will |
| 32 | + be the same event name as the event it attached to. |
| 33 | + GRP : Group name of the event to attach to. |
| 34 | + EVENT : Event name of the event to attach to. |
| 35 | + |
| 36 | + FETCHARGS : Arguments. Each probe can have up to 128 args. |
| 37 | + $FIELD : Fetch the value of the event field called FIELD. |
| 38 | + @ADDR : Fetch memory at ADDR (ADDR should be in kernel) |
| 39 | + @SYM[+|-offs] : Fetch memory at SYM +|- offs (SYM should be a data symbol) |
| 40 | + $comm : Fetch current task comm. |
| 41 | + +|-[u]OFFS(FETCHARG) : Fetch memory at FETCHARG +|- OFFS address.(\*3)(\*4) |
| 42 | + \IMM : Store an immediate value to the argument. |
| 43 | + NAME=FETCHARG : Set NAME as the argument name of FETCHARG. |
| 44 | + FETCHARG:TYPE : Set TYPE as the type of FETCHARG. Currently, basic types |
| 45 | + (u8/u16/u32/u64/s8/s16/s32/s64), hexadecimal types |
| 46 | + (x8/x16/x32/x64), VFS layer common type(%pd/%pD), "char", |
| 47 | + "string", "ustring", "symbol", "symstr" and "bitfield" are |
| 48 | + supported. |
| 49 | + |
| 50 | +Types |
| 51 | +----- |
| 52 | +The FETCHARGS above is very similar to the kprobe events as described in |
| 53 | +Documentation/trace/kprobetrace.rst. |
| 54 | + |
| 55 | +The difference between eprobes and kprobes FETCHARGS is that eprobes has a |
| 56 | +$FIELD command that returns the content of the event field of the event |
| 57 | +that is attached. Eprobes do not have access to registers, stacks and function |
| 58 | +arguments that kprobes has. |
| 59 | + |
| 60 | +If a field argument is a pointer, it may be dereferenced just like a memory |
| 61 | +address using the FETCHARGS syntax. |
| 62 | + |
| 63 | + |
| 64 | +Attaching to dynamic events |
| 65 | +--------------------------- |
| 66 | + |
| 67 | +Eprobes may attach to dynamic events as well as to normal events. It may |
| 68 | +attach to a kprobe event, a synthetic event or a fprobe event. This is useful |
| 69 | +if the type of a field needs to be changed. See Example 2 below. |
| 70 | + |
| 71 | +Usage examples |
| 72 | +============== |
| 73 | + |
| 74 | +Example 1 |
| 75 | +--------- |
| 76 | + |
| 77 | +The basic usage of eprobes is to limit the data that is being recorded into |
| 78 | +the tracing buffer. For example, a common event to trace is the sched_switch |
| 79 | +trace event. That has a format of:: |
| 80 | + |
| 81 | + field:unsigned short common_type; offset:0; size:2; signed:0; |
| 82 | + field:unsigned char common_flags; offset:2; size:1; signed:0; |
| 83 | + field:unsigned char common_preempt_count; offset:3; size:1; signed:0; |
| 84 | + field:int common_pid; offset:4; size:4; signed:1; |
| 85 | + |
| 86 | + field:char prev_comm[16]; offset:8; size:16; signed:0; |
| 87 | + field:pid_t prev_pid; offset:24; size:4; signed:1; |
| 88 | + field:int prev_prio; offset:28; size:4; signed:1; |
| 89 | + field:long prev_state; offset:32; size:8; signed:1; |
| 90 | + field:char next_comm[16]; offset:40; size:16; signed:0; |
| 91 | + field:pid_t next_pid; offset:56; size:4; signed:1; |
| 92 | + field:int next_prio; offset:60; size:4; signed:1; |
| 93 | + |
| 94 | +The first four fields are common to all events and can not be limited. But the |
| 95 | +rest of the event has 60 bytes of information. It records the names of the |
| 96 | +previous and next tasks being scheduled out and in, as well as their pids and |
| 97 | +priorities. It also records the state of the previous task. If only the pids |
| 98 | +of the tasks are of interest, why waste the ring buffer with all the other |
| 99 | +fields? |
| 100 | + |
| 101 | +An eprobe can limit what gets recorded. Note, it does not help in performance, |
| 102 | +as all the fields are recorded in a temporary buffer to process the eprobe. |
| 103 | +:: |
| 104 | + |
| 105 | + # echo 'e:sched/switch sched.sched_switch prev=$prev_pid:u32 next=$next_pid:u32' >> /sys/kernel/tracing/dynamic_events |
| 106 | + # echo 1 > /sys/kernel/tracing/events/sched/switch/enable |
| 107 | + # cat /sys/kernel/tracing/trace |
| 108 | + |
| 109 | + # tracer: nop |
| 110 | + # |
| 111 | + # entries-in-buffer/entries-written: 2721/2721 #P:8 |
| 112 | + # |
| 113 | + # _-----=> irqs-off/BH-disabled |
| 114 | + # / _----=> need-resched |
| 115 | + # | / _---=> hardirq/softirq |
| 116 | + # || / _--=> preempt-depth |
| 117 | + # ||| / _-=> migrate-disable |
| 118 | + # |||| / delay |
| 119 | + # TASK-PID CPU# ||||| TIMESTAMP FUNCTION |
| 120 | + # | | | ||||| | | |
| 121 | + sshd-session-1082 [004] d..4. 5041.239906: switch: (sched.sched_switch) prev=1082 next=0 |
| 122 | + bash-1085 [001] d..4. 5041.240198: switch: (sched.sched_switch) prev=1085 next=141 |
| 123 | + kworker/u34:5-141 [001] d..4. 5041.240259: switch: (sched.sched_switch) prev=141 next=1085 |
| 124 | + <idle>-0 [004] d..4. 5041.240354: switch: (sched.sched_switch) prev=0 next=1082 |
| 125 | + bash-1085 [001] d..4. 5041.240385: switch: (sched.sched_switch) prev=1085 next=141 |
| 126 | + kworker/u34:5-141 [001] d..4. 5041.240410: switch: (sched.sched_switch) prev=141 next=1085 |
| 127 | + bash-1085 [001] d..4. 5041.240478: switch: (sched.sched_switch) prev=1085 next=0 |
| 128 | + sshd-session-1082 [004] d..4. 5041.240526: switch: (sched.sched_switch) prev=1082 next=0 |
| 129 | + <idle>-0 [001] d..4. 5041.247524: switch: (sched.sched_switch) prev=0 next=90 |
| 130 | + <idle>-0 [002] d..4. 5041.247545: switch: (sched.sched_switch) prev=0 next=16 |
| 131 | + kworker/1:1-90 [001] d..4. 5041.247580: switch: (sched.sched_switch) prev=90 next=0 |
| 132 | + rcu_sched-16 [002] d..4. 5041.247591: switch: (sched.sched_switch) prev=16 next=0 |
| 133 | + <idle>-0 [002] d..4. 5041.257536: switch: (sched.sched_switch) prev=0 next=16 |
| 134 | + rcu_sched-16 [002] d..4. 5041.257573: switch: (sched.sched_switch) prev=16 next=0 |
| 135 | + |
| 136 | +Note, without adding the "u32" after the prev_pid and next_pid, the values |
| 137 | +would default showing in hexadecimal. |
| 138 | + |
| 139 | +Example 2 |
| 140 | +--------- |
| 141 | + |
| 142 | +If a specific system call is to be recorded but the syscalls events are not |
| 143 | +enabled, the raw_syscalls can still be used (syscalls are system call |
| 144 | +events are not normal events, but are created from the raw_syscalls events |
| 145 | +within the kernel). In order to trace the openat system call, one can create |
| 146 | +an event probe on top of the raw_syscalls event: |
| 147 | +:: |
| 148 | + |
| 149 | + # cd /sys/kernel/tracing |
| 150 | + # cat events/raw_syscalls/sys_enter/format |
| 151 | + name: sys_enter |
| 152 | + ID: 395 |
| 153 | + format: |
| 154 | + field:unsigned short common_type; offset:0; size:2; signed:0; |
| 155 | + field:unsigned char common_flags; offset:2; size:1; signed:0; |
| 156 | + field:unsigned char common_preempt_count; offset:3; size:1; signed:0; |
| 157 | + field:int common_pid; offset:4; size:4; signed:1; |
| 158 | + |
| 159 | + field:long id; offset:8; size:8; signed:1; |
| 160 | + field:unsigned long args[6]; offset:16; size:48; signed:0; |
| 161 | + |
| 162 | + print fmt: "NR %ld (%lx, %lx, %lx, %lx, %lx, %lx)", REC->id, REC->args[0], REC->args[1], REC->args[2], REC->args[3], REC->args[4], REC->args[5] |
| 163 | + |
| 164 | +From the source code, the sys_openat() has: |
| 165 | +:: |
| 166 | + |
| 167 | + int sys_openat(int dirfd, const char *path, int flags, mode_t mode) |
| 168 | + { |
| 169 | + return my_syscall4(__NR_openat, dirfd, path, flags, mode); |
| 170 | + } |
| 171 | + |
| 172 | +The path is the second parameter, and that is what is wanted. |
| 173 | +:: |
| 174 | + |
| 175 | + # echo 'e:openat raw_syscalls.sys_enter nr=$id filename=+8($args):ustring' >> dynamic_events |
| 176 | + |
| 177 | +This is being run on x86_64 where the word size is 8 bytes and the openat |
| 178 | +system call __NR_openat is set at 257. |
| 179 | +:: |
| 180 | + |
| 181 | + # echo 'nr == 257' > events/eprobes/openat/filter |
| 182 | + |
| 183 | +Now enable the event and look at the trace. |
| 184 | +:: |
| 185 | + |
| 186 | + # echo 1 > events/eprobes/openat/enable |
| 187 | + # cat trace |
| 188 | + |
| 189 | + # tracer: nop |
| 190 | + # |
| 191 | + # entries-in-buffer/entries-written: 4/4 #P:8 |
| 192 | + # |
| 193 | + # _-----=> irqs-off/BH-disabled |
| 194 | + # / _----=> need-resched |
| 195 | + # | / _---=> hardirq/softirq |
| 196 | + # || / _--=> preempt-depth |
| 197 | + # ||| / _-=> migrate-disable |
| 198 | + # |||| / delay |
| 199 | + # TASK-PID CPU# ||||| TIMESTAMP FUNCTION |
| 200 | + # | | | ||||| | | |
| 201 | + cat-1298 [003] ...2. 2060.875970: openat: (raw_syscalls.sys_enter) nr=0x101 filename=(fault) |
| 202 | + cat-1298 [003] ...2. 2060.876197: openat: (raw_syscalls.sys_enter) nr=0x101 filename=(fault) |
| 203 | + cat-1298 [003] ...2. 2060.879126: openat: (raw_syscalls.sys_enter) nr=0x101 filename=(fault) |
| 204 | + cat-1298 [003] ...2. 2060.879639: openat: (raw_syscalls.sys_enter) nr=0x101 filename=(fault) |
| 205 | + |
| 206 | +The filename shows "(fault)". This is likely because the filename has not been |
| 207 | +pulled into memory yet and currently trace events cannot fault in memory that |
| 208 | +is not present. When an eprobe tries to read memory that has not been faulted |
| 209 | +in yet, it will show the "(fault)" text. |
| 210 | + |
| 211 | +To get around this, as the kernel will likely pull in this filename and make |
| 212 | +it present, attaching it to a synthetic event that can pass the address of the |
| 213 | +filename from the entry of the event to the end of the event, this can be used |
| 214 | +to show the filename when the system call returns. |
| 215 | + |
| 216 | +Remove the old eprobe:: |
| 217 | + |
| 218 | + # echo 1 > events/eprobes/openat/enable |
| 219 | + # echo '-:openat' >> dynamic_events |
| 220 | + |
| 221 | +This time make an eprobe where the address of the filename is saved:: |
| 222 | + |
| 223 | + # echo 'e:openat_start raw_syscalls.sys_enter nr=$id filename=+8($args):x64' >> dynamic_events |
| 224 | + |
| 225 | +Create a synthetic event that passes the address of the filename to the |
| 226 | +end of the event:: |
| 227 | + |
| 228 | + # echo 's:filename u64 file' >> dynamic_events |
| 229 | + # echo 'hist:keys=common_pid:f=filename if nr == 257' > events/eprobes/openat_start/trigger |
| 230 | + # echo 'hist:keys=common_pid:file=$f:onmatch(eprobes.openat_start).trace(filename,$file) if id == 257' > events/raw_syscalls/sys_exit/trigger |
| 231 | + |
| 232 | +Now that the address of the filename has been passed to the end of the |
| 233 | +system call, create another eprobe to attach to the exit event to show the |
| 234 | +string:: |
| 235 | + |
| 236 | + # echo 'e:openat synthetic.filename filename=+0($file):ustring' >> dynamic_events |
| 237 | + # echo 1 > events/eprobes/openat/enable |
| 238 | + # cat trace |
| 239 | + |
| 240 | + # tracer: nop |
| 241 | + # |
| 242 | + # entries-in-buffer/entries-written: 4/4 #P:8 |
| 243 | + # |
| 244 | + # _-----=> irqs-off/BH-disabled |
| 245 | + # / _----=> need-resched |
| 246 | + # | / _---=> hardirq/softirq |
| 247 | + # || / _--=> preempt-depth |
| 248 | + # ||| / _-=> migrate-disable |
| 249 | + # |||| / delay |
| 250 | + # TASK-PID CPU# ||||| TIMESTAMP FUNCTION |
| 251 | + # | | | ||||| | | |
| 252 | + cat-1331 [001] ...5. 2944.787977: openat: (synthetic.filename) filename="/etc/ld.so.cache" |
| 253 | + cat-1331 [001] ...5. 2944.788480: openat: (synthetic.filename) filename="/lib/x86_64-linux-gnu/libc.so.6" |
| 254 | + cat-1331 [001] ...5. 2944.793426: openat: (synthetic.filename) filename="/usr/lib/locale/locale-archive" |
| 255 | + cat-1331 [001] ...5. 2944.831362: openat: (synthetic.filename) filename="trace" |
| 256 | + |
| 257 | +Example 3 |
| 258 | +--------- |
| 259 | + |
| 260 | +If syscall trace events are available, the above would not need the first |
| 261 | +eprobe, but it would still need the last one:: |
| 262 | + |
| 263 | + # echo 's:filename u64 file' >> dynamic_events |
| 264 | + # echo 'hist:keys=common_pid:f=filename' > events/syscalls/sys_enter_openat/trigger |
| 265 | + # echo 'hist:keys=common_pid:file=$f:onmatch(syscalls.sys_enter_openat).trace(filename,$file)' > events/syscalls/sys_exit_openat/trigger |
| 266 | + # echo 'e:openat synthetic.filename filename=+0($file):ustring' >> dynamic_events |
| 267 | + # echo 1 > events/eprobes/openat/enable |
| 268 | + |
| 269 | +And this would produce the same result as Example 2. |
0 commit comments