|
| 1 | +--- |
| 2 | +title: More powerful Go execution traces |
| 3 | +date: 2024-03-14 |
| 4 | +by: |
| 5 | +- Michael Knyszek |
| 6 | +tags: |
| 7 | +- debug |
| 8 | +- technical |
| 9 | +- tracing |
| 10 | +summary: "New features and improvements to execution traces from the last year." |
| 11 | +--- |
| 12 | + |
| 13 | +The [runtime/trace](/pkg/runtime/trace) package contains a powerful tool for understanding and |
| 14 | +troubleshooting Go programs. |
| 15 | +The functionality within allows one to produce a trace of each goroutine's execution over some |
| 16 | +time period. |
| 17 | +With the [`go tool trace` command](/pkg/cmd/trace) (or the excellent open source |
| 18 | +[gotraceui tool](https://gotraceui.dev/)), one may then visualize and explore the data within these |
| 19 | +traces. |
| 20 | + |
| 21 | +The magic of a trace is that it can easily reveal things about a program that are hard to see in |
| 22 | +other ways. |
| 23 | +For example, a concurrency bottleneck where lots of goroutines block on the same channel might be |
| 24 | +quite difficult to see in a CPU profile, because there's no execution to sample. |
| 25 | +But in an execution trace, the _lack_ of execution will show up with amazing clarity, and the stack |
| 26 | +traces of blocked goroutines will quickly point at the culprit. |
| 27 | + |
| 28 | +{{image "execution-traces-2024/gotooltrace.png"}} |
| 29 | + |
| 30 | +Go developers are even able to instrument their own programs with [tasks](/pkg/runtime/trace#Task), |
| 31 | +[regions](/pkg/runtime/trace#WithRegion), and [logs](/pkg/runtime/trace#Log) that |
| 32 | +they can use to correlate their higher-level concerns with lower-level execution details. |
| 33 | + |
| 34 | +## Issues |
| 35 | + |
| 36 | +Unfortunately, the wealth of information in execution traces can often be out of reach. |
| 37 | +Four big issues with traces have historically gotten in the way. |
| 38 | + |
| 39 | +- Traces had high overheads. |
| 40 | +- Traces didn't scale well, and could become too big to analyze. |
| 41 | +- It was often unclear when to start tracing to capture a specific bad behavior. |
| 42 | +- Only the most adventurous gophers could programmatically analyze traces, given the lack of a |
| 43 | + public package for parsing and interpreting execution traces. |
| 44 | + |
| 45 | +If you've used traces in the last few years, you've likely been frustrated by one or more of these |
| 46 | +problems. |
| 47 | +But we're excited to share that over the last two Go releases we've made big progress in all four |
| 48 | +of these areas. |
| 49 | + |
| 50 | +## Low-overhead tracing |
| 51 | + |
| 52 | +Prior to Go 1.21, the run-time overhead of tracing was somewhere between 10–20% CPU for many |
| 53 | +applications, which limits tracing to situational usage, rather than continuous usage like CPU |
| 54 | +profiling. |
| 55 | +It turned out that much of the cost of tracing came down to tracebacks. |
| 56 | +Many events produced by the runtime have stack traces attached, which are invaluable to actually |
| 57 | +identifying what goroutines where doing at key moments in their execution. |
| 58 | + |
| 59 | +Thanks to work by Felix Geisendörfer and Nick Ripley on optimizing the efficiency of tracebacks, |
| 60 | +the run-time CPU overhead of execution traces has been cut dramatically, down to 1–2% for many |
| 61 | +applications. |
| 62 | +You can read more about the work done here in [Felix's great blog |
| 63 | +post](https://blog.felixge.de/reducing-gos-execution-tracer-overhead-with-frame-pointer-unwinding/) |
| 64 | +on the topic. |
| 65 | + |
| 66 | +## Scalable traces |
| 67 | + |
| 68 | +The trace format and its events were designed around relatively efficient emission, but required |
| 69 | +tooling to parse and keep around the state of the entirety of a trace. |
| 70 | +A few hundred MiB trace could require several GiB of RAM to analyze! |
| 71 | + |
| 72 | +This issue is unfortunately fundamental to how traces are generated. |
| 73 | +To keep run-time overheads low, all events are written to the equivalent of thread-local buffers. |
| 74 | +But this means events appear out of their true order, and the burden is placed on the trace |
| 75 | +tooling to figure out what really happened. |
| 76 | + |
| 77 | +The key insight to making traces scale while keeping overheads low was to occasionally split the |
| 78 | +trace being generated. |
| 79 | +Each split point would behave a bit like simultaneously disabling and reenabling tracing in one |
| 80 | +go. |
| 81 | +All the trace data so far would represent a complete and self-contained trace, while the new trace |
| 82 | +data would seamlessly pick up from where it left off. |
| 83 | + |
| 84 | +As you might imagine, fixing this required [rethinking and rewriting a lot of the foundation of |
| 85 | +the trace implementation](/issue/60773) in the runtime. |
| 86 | +We're happy to say that the work landed in Go 1.22 and is now generally available. |
| 87 | +[A lot of nice improvements](/doc/go1.22#runtime/trace) came with the rewrite, including some |
| 88 | +improvements to the [`go tool trace` command](/doc/go1.22#trace) as well. |
| 89 | +The gritty details are all in the [design |
| 90 | +document](https://github.com/golang/proposal/blob/master/design/60773-execution-tracer-overhaul.md), |
| 91 | +if you're curious. |
| 92 | + |
| 93 | +(Note: `go tool trace` still loads the full trace into memory, but [removing this |
| 94 | +limitation](/issue/65315) for traces produced by Go 1.22+ programs is now feasible.) |
| 95 | + |
| 96 | +## Flight recording |
| 97 | + |
| 98 | +Suppose you work on a web service and an RPC took a very long time. |
| 99 | +You couldn't start tracing at the point you knew the RPC was already taking a while, because the |
| 100 | +root cause of the slow request already happened and wasn't recorded. |
| 101 | + |
| 102 | +There's a technique that can help with this called flight recording, which you may already be |
| 103 | +familiar with from other programming environments. |
| 104 | +The insight with flight recording is to have tracing on continuously and always keep the most |
| 105 | +recent trace data around, just in case. |
| 106 | +Then, once something interesting happens, the program can just write out whatever it has! |
| 107 | + |
| 108 | +Before traces could be split, this was pretty much a non-starter. |
| 109 | +But because continuous tracing is now viable thanks to low overheads, and the fact that the runtime |
| 110 | +can now split traces any time it needs, it turns out it was straightforward to implement flight |
| 111 | +recording. |
| 112 | + |
| 113 | +As a result, we're happy to announce a flight recorder experiment, available in the |
| 114 | +[golang.org/x/exp/trace package](/pkg/golang.org/x/exp/trace#FlightRecorder). |
| 115 | + |
| 116 | +Please try it out! |
| 117 | +Below is an example that sets up flight recording to capture a long HTTP request to get you started. |
| 118 | + |
| 119 | +{{code "execution-traces-2024/flightrecorder.go" `/START/` `/END/`}} |
| 120 | + |
| 121 | +If you have any feedback, positive or negative, please share it to the [proposal |
| 122 | +issue](/issue/63185)! |
| 123 | + |
| 124 | +## Trace reader API |
| 125 | + |
| 126 | +Along with the trace implementation rewrite came an effort to clean up the other trace internals, |
| 127 | +like `go tool trace`. |
| 128 | +This spawned an attempt to create a trace reader API that was good enough to share and that could |
| 129 | +make traces more accessible. |
| 130 | + |
| 131 | +Just like the flight recorder, we're happy to announce that we also have an experimental trace reader |
| 132 | +API that we'd like to share. |
| 133 | +It's available in the [same package as the flight recorder, |
| 134 | +golang.org/x/exp/trace](/pkg/golang.org/x/exp/trace#Reader). |
| 135 | + |
| 136 | +We think it's good enough to start building things on top of, so please try it out! |
| 137 | +Below is an example that measures the proportion of goroutine block events that blocked to wait on |
| 138 | +the network. |
| 139 | + |
| 140 | +{{code "execution-traces-2024/reader.go" `/START/` `/END/`}} |
| 141 | + |
| 142 | +And just like the flight recorder, there's a [proposal issue](/issue/62627) that would |
| 143 | +be a great place to leave feedback! |
| 144 | + |
| 145 | +We'd like to quickly call out Dominik Honnef as someone who tried it out early, provided great |
| 146 | +feedback, and has contributed support for older trace versions to the API. |
| 147 | + |
| 148 | +## Thank you! |
| 149 | + |
| 150 | +This work was completed, in no small part, thanks to the help of the those in the [diagnostics |
| 151 | +working group](/issue/57175), started over a year ago as a collaboration between stakeholders from |
| 152 | +across the Go community, and open to the public. |
| 153 | + |
| 154 | +We'd like to take a moment to thank those community members who have attended the diagnostic |
| 155 | +meetings regularly over the last year: Felix Geisendörfer, Nick Ripley, Rhys Hiltner, Dominik |
| 156 | +Honnef, Bryan Boreham, thepudds. |
| 157 | + |
| 158 | +The discussions, feedback, and work you all put in have been instrumental to getting us to where we |
| 159 | +are today. |
| 160 | +Thank you! |
0 commit comments