Skip to content

Commit c2d58a8

Browse files
mknyszekgopherbot
authored andcommitted
_content/blog: add execution traces 2024 blog post
Change-Id: Id12e94913263048df46496cc77fb0376edd81a77 Reviewed-on: https://go-review.googlesource.com/c/website/+/570603 LUCI-TryBot-Result: Go LUCI <[email protected]> Auto-Submit: Eli Bendersky <[email protected]> Reviewed-by: Eli Bendersky <[email protected]> Reviewed-by: Michael Pratt <[email protected]>
1 parent 251edba commit c2d58a8

File tree

4 files changed

+269
-0
lines changed

4 files changed

+269
-0
lines changed
Lines changed: 160 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,160 @@
1+
---
2+
title: More powerful Go execution traces
3+
date: 2024-03-14
4+
by:
5+
- Michael Knyszek
6+
tags:
7+
- debug
8+
- technical
9+
- tracing
10+
summary: "New features and improvements to execution traces from the last year."
11+
---
12+
13+
The [runtime/trace](/pkg/runtime/trace) package contains a powerful tool for understanding and
14+
troubleshooting Go programs.
15+
The functionality within allows one to produce a trace of each goroutine's execution over some
16+
time period.
17+
With the [`go tool trace` command](/pkg/cmd/trace) (or the excellent open source
18+
[gotraceui tool](https://gotraceui.dev/)), one may then visualize and explore the data within these
19+
traces.
20+
21+
The magic of a trace is that it can easily reveal things about a program that are hard to see in
22+
other ways.
23+
For example, a concurrency bottleneck where lots of goroutines block on the same channel might be
24+
quite difficult to see in a CPU profile, because there's no execution to sample.
25+
But in an execution trace, the _lack_ of execution will show up with amazing clarity, and the stack
26+
traces of blocked goroutines will quickly point at the culprit.
27+
28+
{{image "execution-traces-2024/gotooltrace.png"}}
29+
30+
Go developers are even able to instrument their own programs with [tasks](/pkg/runtime/trace#Task),
31+
[regions](/pkg/runtime/trace#WithRegion), and [logs](/pkg/runtime/trace#Log) that
32+
they can use to correlate their higher-level concerns with lower-level execution details.
33+
34+
## Issues
35+
36+
Unfortunately, the wealth of information in execution traces can often be out of reach.
37+
Four big issues with traces have historically gotten in the way.
38+
39+
- Traces had high overheads.
40+
- Traces didn't scale well, and could become too big to analyze.
41+
- It was often unclear when to start tracing to capture a specific bad behavior.
42+
- Only the most adventurous gophers could programmatically analyze traces, given the lack of a
43+
public package for parsing and interpreting execution traces.
44+
45+
If you've used traces in the last few years, you've likely been frustrated by one or more of these
46+
problems.
47+
But we're excited to share that over the last two Go releases we've made big progress in all four
48+
of these areas.
49+
50+
## Low-overhead tracing
51+
52+
Prior to Go 1.21, the run-time overhead of tracing was somewhere between 10–20% CPU for many
53+
applications, which limits tracing to situational usage, rather than continuous usage like CPU
54+
profiling.
55+
It turned out that much of the cost of tracing came down to tracebacks.
56+
Many events produced by the runtime have stack traces attached, which are invaluable to actually
57+
identifying what goroutines where doing at key moments in their execution.
58+
59+
Thanks to work by Felix Geisendörfer and Nick Ripley on optimizing the efficiency of tracebacks,
60+
the run-time CPU overhead of execution traces has been cut dramatically, down to 1–2% for many
61+
applications.
62+
You can read more about the work done here in [Felix's great blog
63+
post](https://blog.felixge.de/reducing-gos-execution-tracer-overhead-with-frame-pointer-unwinding/)
64+
on the topic.
65+
66+
## Scalable traces
67+
68+
The trace format and its events were designed around relatively efficient emission, but required
69+
tooling to parse and keep around the state of the entirety of a trace.
70+
A few hundred MiB trace could require several GiB of RAM to analyze!
71+
72+
This issue is unfortunately fundamental to how traces are generated.
73+
To keep run-time overheads low, all events are written to the equivalent of thread-local buffers.
74+
But this means events appear out of their true order, and the burden is placed on the trace
75+
tooling to figure out what really happened.
76+
77+
The key insight to making traces scale while keeping overheads low was to occasionally split the
78+
trace being generated.
79+
Each split point would behave a bit like simultaneously disabling and reenabling tracing in one
80+
go.
81+
All the trace data so far would represent a complete and self-contained trace, while the new trace
82+
data would seamlessly pick up from where it left off.
83+
84+
As you might imagine, fixing this required [rethinking and rewriting a lot of the foundation of
85+
the trace implementation](/issue/60773) in the runtime.
86+
We're happy to say that the work landed in Go 1.22 and is now generally available.
87+
[A lot of nice improvements](/doc/go1.22#runtime/trace) came with the rewrite, including some
88+
improvements to the [`go tool trace` command](/doc/go1.22#trace) as well.
89+
The gritty details are all in the [design
90+
document](https://github.com/golang/proposal/blob/master/design/60773-execution-tracer-overhaul.md),
91+
if you're curious.
92+
93+
(Note: `go tool trace` still loads the full trace into memory, but [removing this
94+
limitation](/issue/65315) for traces produced by Go 1.22+ programs is now feasible.)
95+
96+
## Flight recording
97+
98+
Suppose you work on a web service and an RPC took a very long time.
99+
You couldn't start tracing at the point you knew the RPC was already taking a while, because the
100+
root cause of the slow request already happened and wasn't recorded.
101+
102+
There's a technique that can help with this called flight recording, which you may already be
103+
familiar with from other programming environments.
104+
The insight with flight recording is to have tracing on continuously and always keep the most
105+
recent trace data around, just in case.
106+
Then, once something interesting happens, the program can just write out whatever it has!
107+
108+
Before traces could be split, this was pretty much a non-starter.
109+
But because continuous tracing is now viable thanks to low overheads, and the fact that the runtime
110+
can now split traces any time it needs, it turns out it was straightforward to implement flight
111+
recording.
112+
113+
As a result, we're happy to announce a flight recorder experiment, available in the
114+
[golang.org/x/exp/trace package](/pkg/golang.org/x/exp/trace#FlightRecorder).
115+
116+
Please try it out!
117+
Below is an example that sets up flight recording to capture a long HTTP request to get you started.
118+
119+
{{code "execution-traces-2024/flightrecorder.go" `/START/` `/END/`}}
120+
121+
If you have any feedback, positive or negative, please share it to the [proposal
122+
issue](/issue/63185)!
123+
124+
## Trace reader API
125+
126+
Along with the trace implementation rewrite came an effort to clean up the other trace internals,
127+
like `go tool trace`.
128+
This spawned an attempt to create a trace reader API that was good enough to share and that could
129+
make traces more accessible.
130+
131+
Just like the flight recorder, we're happy to announce that we also have an experimental trace reader
132+
API that we'd like to share.
133+
It's available in the [same package as the flight recorder,
134+
golang.org/x/exp/trace](/pkg/golang.org/x/exp/trace#Reader).
135+
136+
We think it's good enough to start building things on top of, so please try it out!
137+
Below is an example that measures the proportion of goroutine block events that blocked to wait on
138+
the network.
139+
140+
{{code "execution-traces-2024/reader.go" `/START/` `/END/`}}
141+
142+
And just like the flight recorder, there's a [proposal issue](/issue/62627) that would
143+
be a great place to leave feedback!
144+
145+
We'd like to quickly call out Dominik Honnef as someone who tried it out early, provided great
146+
feedback, and has contributed support for older trace versions to the API.
147+
148+
## Thank you!
149+
150+
This work was completed, in no small part, thanks to the help of the those in the [diagnostics
151+
working group](/issue/57175), started over a year ago as a collaboration between stakeholders from
152+
across the Go community, and open to the public.
153+
154+
We'd like to take a moment to thank those community members who have attended the diagnostic
155+
meetings regularly over the last year: Felix Geisendörfer, Nick Ripley, Rhys Hiltner, Dominik
156+
Honnef, Bryan Boreham, thepudds.
157+
158+
The discussions, feedback, and work you all put in have been instrumental to getting us to where we
159+
are today.
160+
Thank you!
Lines changed: 54 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,54 @@
1+
//go:build OMIT
2+
3+
package main
4+
5+
import (
6+
"bytes"
7+
"log"
8+
"net/http"
9+
"os"
10+
"sync"
11+
"time"
12+
13+
"golang.org/x/exp/trace"
14+
)
15+
16+
func main() {
17+
// START OMIT
18+
// Set up the flight recorder.
19+
fr := trace.NewFlightRecorder()
20+
fr.Start()
21+
22+
// Set up and run an HTTP server.
23+
var once sync.Once
24+
http.HandleFunc("/my-endpoint", func(w http.ResponseWriter, r *http.Request) {
25+
start := time.Now()
26+
27+
// Do the work...
28+
doWork(w, r)
29+
30+
// We saw a long request. Take a snapshot!
31+
if time.Since(start) > 300*time.Millisecond {
32+
// Do it only once for simplicitly, but you can take more than one.
33+
once.Do(func() {
34+
// Grab the snapshot.
35+
var b bytes.Buffer
36+
_, err = fr.WriteTo(&b)
37+
if err != nil {
38+
log.Print(err)
39+
return
40+
}
41+
// Write it to a file.
42+
if err := os.WriteFile("trace.out", b.Bytes(), 0o755); err != nil {
43+
log.Print(err)
44+
return
45+
}
46+
})
47+
}
48+
})
49+
log.Fatal(http.ListenAndServe(":8080", nil))
50+
// END OMIT
51+
}
52+
53+
func doWork(_ http.ResponseWriter, _ *http.Request) {
54+
}
466 KB
Loading
Lines changed: 55 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,55 @@
1+
//go:build OMIT
2+
3+
package main
4+
5+
import (
6+
"fmt"
7+
"io"
8+
"log"
9+
"os"
10+
"strings"
11+
12+
"golang.org/x/exp/trace"
13+
)
14+
15+
func main() {
16+
// START OMIT
17+
// Start reading from STDIN.
18+
r, err := trace.NewReader(os.Stdin)
19+
if err != nil {
20+
log.Fatal(err)
21+
}
22+
23+
var blocked int
24+
var blockedOnNetwork int
25+
for {
26+
// Read the event.
27+
ev, err := r.ReadEvent()
28+
if err == io.EOF {
29+
break
30+
} else if err != nil {
31+
log.Fatal(err)
32+
}
33+
34+
// Process it.
35+
if ev.Kind() == trace.EventStateTransition {
36+
st := ev.StateTransition()
37+
if st.Resource.Kind == trace.ResourceGoroutine {
38+
id := st.Resource.Goroutine()
39+
from, to := st.GoroutineTransition()
40+
41+
// Look for goroutines blocking, and count them.
42+
if from.Executing() && to == trace.GoWaiting {
43+
blocked++
44+
if strings.Contains(st.Reason, "network") {
45+
blockedOnNetwork++
46+
}
47+
}
48+
}
49+
}
50+
}
51+
// Print what we found.
52+
p := 100 * float64(blockedOnNetwork) / float64(blocked)
53+
fmt.Printf("%2.3f%% instances of goroutines blocking were to block on the network\n", p)
54+
// END OMIT
55+
}

0 commit comments

Comments
 (0)