|
| 1 | +============================== |
| 2 | +Using the tracer for debugging |
| 3 | +============================== |
| 4 | + |
| 5 | +Copyright 2024 Google LLC. |
| 6 | + |
| 7 | +:Author: Steven Rostedt < [email protected]> |
| 8 | +:License: The GNU Free Documentation License, Version 1.2 |
| 9 | + (dual licensed under the GPL v2) |
| 10 | + |
| 11 | +- Written for: 6.12 |
| 12 | + |
| 13 | +Introduction |
| 14 | +------------ |
| 15 | +The tracing infrastructure can be very useful for debugging the Linux |
| 16 | +kernel. This document is a place to add various methods of using the tracer |
| 17 | +for debugging. |
| 18 | + |
| 19 | +First, make sure that the tracefs file system is mounted:: |
| 20 | + |
| 21 | + $ sudo mount -t tracefs tracefs /sys/kernel/tracing |
| 22 | + |
| 23 | + |
| 24 | +Using trace_printk() |
| 25 | +-------------------- |
| 26 | + |
| 27 | +trace_printk() is a very lightweight utility that can be used in any context |
| 28 | +inside the kernel, with the exception of "noinstr" sections. It can be used |
| 29 | +in normal, softirq, interrupt and even NMI context. The trace data is |
| 30 | +written to the tracing ring buffer in a lockless way. To make it even |
| 31 | +lighter weight, when possible, it will only record the pointer to the format |
| 32 | +string, and save the raw arguments into the buffer. The format and the |
| 33 | +arguments will be post processed when the ring buffer is read. This way the |
| 34 | +trace_printk() format conversions are not done during the hot path, where |
| 35 | +the trace is being recorded. |
| 36 | + |
| 37 | +trace_printk() is meant only for debugging, and should never be added into |
| 38 | +a subsystem of the kernel. If you need debugging traces, add trace events |
| 39 | +instead. If a trace_printk() is found in the kernel, the following will |
| 40 | +appear in the dmesg:: |
| 41 | + |
| 42 | + ********************************************************** |
| 43 | + ** NOTICE NOTICE NOTICE NOTICE NOTICE NOTICE NOTICE ** |
| 44 | + ** ** |
| 45 | + ** trace_printk() being used. Allocating extra memory. ** |
| 46 | + ** ** |
| 47 | + ** This means that this is a DEBUG kernel and it is ** |
| 48 | + ** unsafe for production use. ** |
| 49 | + ** ** |
| 50 | + ** If you see this message and you are not debugging ** |
| 51 | + ** the kernel, report this immediately to your vendor! ** |
| 52 | + ** ** |
| 53 | + ** NOTICE NOTICE NOTICE NOTICE NOTICE NOTICE NOTICE ** |
| 54 | + ********************************************************** |
| 55 | + |
| 56 | +Debugging kernel crashes |
| 57 | +------------------------ |
| 58 | +There is various methods of acquiring the state of the system when a kernel |
| 59 | +crash occurs. This could be from the oops message in printk, or one could |
| 60 | +use kexec/kdump. But these just show what happened at the time of the crash. |
| 61 | +It can be very useful in knowing what happened up to the point of the crash. |
| 62 | +The tracing ring buffer, by default, is a circular buffer than will |
| 63 | +overwrite older events with newer ones. When a crash happens, the content of |
| 64 | +the ring buffer will be all the events that lead up to the crash. |
| 65 | + |
| 66 | +There are several kernel command line parameters that can be used to help in |
| 67 | +this. The first is "ftrace_dump_on_oops". This will dump the tracing ring |
| 68 | +buffer when a oops occurs to the console. This can be useful if the console |
| 69 | +is being logged somewhere. If a serial console is used, it may be prudent to |
| 70 | +make sure the ring buffer is relatively small, otherwise the dumping of the |
| 71 | +ring buffer may take several minutes to hours to finish. Here's an example |
| 72 | +of the kernel command line:: |
| 73 | + |
| 74 | + ftrace_dump_on_oops trace_buf_size=50K |
| 75 | + |
| 76 | +Note, the tracing buffer is made up of per CPU buffers where each of these |
| 77 | +buffers is broken up into sub-buffers that are by default PAGE_SIZE. The |
| 78 | +above trace_buf_size option above sets each of the per CPU buffers to 50K, |
| 79 | +so, on a machine with 8 CPUs, that's actually 400K total. |
| 80 | + |
| 81 | +Persistent buffers across boots |
| 82 | +------------------------------- |
| 83 | +If the system memory allows it, the tracing ring buffer can be specified at |
| 84 | +a specific location in memory. If the location is the same across boots and |
| 85 | +the memory is not modified, the tracing buffer can be retrieved from the |
| 86 | +following boot. There's two ways to reserve memory for the use of the ring |
| 87 | +buffer. |
| 88 | + |
| 89 | +The more reliable way (on x86) is to reserve memory with the "memmap" kernel |
| 90 | +command line option and then use that memory for the trace_instance. This |
| 91 | +requires a bit of knowledge of the physical memory layout of the system. The |
| 92 | +advantage of using this method, is that the memory for the ring buffer will |
| 93 | +always be the same:: |
| 94 | + |
| 95 | + memmap==12M$0x284500000 trace_instance=boot_map@0x284500000:12M |
| 96 | + |
| 97 | +The memmap above reserves 12 megabytes of memory at the physical memory |
| 98 | +location 0x284500000. Then the trace_instance option will create a trace |
| 99 | +instance "boot_map" at that same location with the same amount of memory |
| 100 | +reserved. As the ring buffer is broke up into per CPU buffers, the 12 |
| 101 | +megabytes will be broken up evenly between those CPUs. If you have 8 CPUs, |
| 102 | +each per CPU ring buffer will be 1.5 megabytes in size. Note, that also |
| 103 | +includes meta data, so the amount of memory actually used by the ring buffer |
| 104 | +will be slightly smaller. |
| 105 | + |
| 106 | +Another more generic but less robust way to allocate a ring buffer mapping |
| 107 | +at boot is with the "reserve_mem" option:: |
| 108 | + |
| 109 | + reserve_mem=12M:4096:trace trace_instance=boot_map@trace |
| 110 | + |
| 111 | +The reserve_mem option above will find 12 megabytes that are available at |
| 112 | +boot up, and align it by 4096 bytes. It will label this memory as "trace" |
| 113 | +that can be used by later command line options. |
| 114 | + |
| 115 | +The trace_instance option creates a "boot_map" instance and will use the |
| 116 | +memory reserved by reserve_mem that was labeled as "trace". This method is |
| 117 | +more generic but may not be as reliable. Due to KASLR, the memory reserved |
| 118 | +by reserve_mem may not be located at the same location. If this happens, |
| 119 | +then the ring buffer will not be from the previous boot and will be reset. |
| 120 | + |
| 121 | +Sometimes, by using a larger alignment, it can keep KASLR from moving things |
| 122 | +around in such a way that it will move the location of the reserve_mem. By |
| 123 | +using a larger alignment, you may find better that the buffer is more |
| 124 | +consistent to where it is placed:: |
| 125 | + |
| 126 | + reserve_mem=12M:0x2000000:trace trace_instance=boot_map@trace |
| 127 | + |
| 128 | +On boot up, the memory reserved for the ring buffer is validated. It will go |
| 129 | +through a series of tests to make sure that the ring buffer contains valid |
| 130 | +data. If it is, it will then set it up to be available to read from the |
| 131 | +instance. If it fails any of the tests, it will clear the entire ring buffer |
| 132 | +and initialize it as new. |
| 133 | + |
| 134 | +The layout of this mapped memory may not be consistent from kernel to |
| 135 | +kernel, so only the same kernel is guaranteed to work if the mapping is |
| 136 | +preserved. Switching to a different kernel version may find a different |
| 137 | +layout and mark the buffer as invalid. |
| 138 | + |
| 139 | +Using trace_printk() in the boot instance |
| 140 | +----------------------------------------- |
| 141 | +By default, the content of trace_printk() goes into the top level tracing |
| 142 | +instance. But this instance is never preserved across boots. To have the |
| 143 | +trace_printk() content, and some other internal tracing go to the preserved |
| 144 | +buffer (like dump stacks), either set the instance to be the trace_printk() |
| 145 | +destination from the kernel command line, or set it after boot up via the |
| 146 | +trace_printk_dest option. |
| 147 | + |
| 148 | +After boot up:: |
| 149 | + |
| 150 | + echo 1 > /sys/kernel/tracing/instances/boot_map/options/trace_printk_dest |
| 151 | + |
| 152 | +From the kernel command line:: |
| 153 | + |
| 154 | + reserve_mem=12M:4096:trace trace_instance=boot_map^traceprintk^traceoff@trace |
| 155 | + |
| 156 | +If setting it from the kernel command line, it is recommended to also |
| 157 | +disable tracing with the "traceoff" flag, and enable tracing after boot up. |
| 158 | +Otherwise the trace from the most recent boot will be mixed with the trace |
| 159 | +from the previous boot, and may make it confusing to read. |
0 commit comments