Skip to content

Conversation

@dt
Copy link
Member

@dt dt commented Aug 9, 2025

This adds a new goroutine profile mode (debug=26257) that emits one entry per goroutine in a format identical to debug=2 except that additionally includes any pprof labels in the header, following the [state] segment, e.g:

goroutine 123 [select, 2 minutes]{"svc": "users"}:
pkg.FuncA(...)
...

The implementation of this mode is however significantly different from debug=2, and instead based on the same underlying mechanism as debug=1. Unlike the collection of debug=2 profiles, which walks all stacks while the world is stopped, the implementation of this mode uses the concurrent collection integrated with the scheduler that backs debug=1. As a result, this new mode reduces the duration of the stop-the-world phase, particularly when profiling processes with many goroutines, compared to debug=2, and demonstrated by the included benchmark:

                                              │       2        │                 26257                 │
                                              │ max_latency_ns │ max_latency_ns  vs base               │
StackCollectionLatency/goroutines=100x3-14      1013.17k ± 47%   84.06k ±   27%  -91.70% (p=0.002 n=6)
StackCollectionLatency/goroutines=100x10-14      769.23k ±  7%   80.29k ±   22%  -89.56% (p=0.002 n=6)
StackCollectionLatency/goroutines=100x50-14      2172.4k ±  9%   181.8k ±   46%  -91.63% (p=0.002 n=6)
StackCollectionLatency/goroutines=1000x3-14      7133.9k ±  3%   195.7k ±   42%  -97.26% (p=0.002 n=6)
StackCollectionLatency/goroutines=1000x10-14    11787.6k ± 48%   494.4k ±   77%  -95.81% (p=0.002 n=6)
StackCollectionLatency/goroutines=1000x50-14    20234.0k ± 87%   174.8k ±  137%  -99.14% (p=0.002 n=6)
StackCollectionLatency/goroutines=10000x3-14    68611.0k ± 49%   168.5k ± 2768%  -99.75% (p=0.002 n=6)
StackCollectionLatency/goroutines=10000x10-14    60.261M ± 95%   3.460M ±  166%  -94.26% (p=0.002 n=6)
StackCollectionLatency/goroutines=10000x50-14   284.144M ± 40%   4.672M ±   89%  -98.36% (p=0.002 n=6)
StackCollectionLatency/goroutines=25000x3-14    171.290M ± 48%   4.287M ±  394%  -97.50% (p=0.002 n=6)
StackCollectionLatency/goroutines=25000x10-14   150.827M ± 92%   6.424M ±  158%  -95.74% (p=0.002 n=6)
StackCollectionLatency/goroutines=25000x50-14   708.238M ± 34%   2.249M ±  410%  -99.68% (p=0.002 n=6)
geomean                                           25.08M         624.2k          -97.51%

This concurrent collection approach and its relaxed consistency compared to keeping the world stopped doesm mean the behavior of this new mode is not exactly identical to debug=2. Thus despite the performance benefits, the new implementation is not used for the existing debug=2 mode, but rather only for the new mode making it an opt-in alternative.

This comment was marked as outdated.

@dt dt force-pushed the crdb-1245-stk branch from 1f006c4 to abf8ca5 Compare August 9, 2025 02:19
@dt dt requested a review from Copilot August 9, 2025 02:20

This comment was marked as outdated.

@dt dt force-pushed the crdb-1245-stk branch from abf8ca5 to 2b942ff Compare August 9, 2025 02:36
@dt dt requested a review from Copilot August 9, 2025 02:37

This comment was marked as outdated.

@dt dt force-pushed the crdb-1245-stk branch from 2b942ff to 8555ffb Compare August 9, 2025 02:56
@dt dt requested a review from Copilot August 9, 2025 02:57
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR adds a new goroutine profile mode (debug=26257) that provides detailed goroutine information including labels while using a concurrent collection approach to reduce stop-the-world latency compared to debug=2. The implementation leverages the same underlying mechanism as debug=1 for concurrent stack collection while maintaining compatibility with the debug=2 output format.

  • Adds a new debug mode (26257) that includes pprof labels in goroutine profiles with reduced STW time
  • Implements concurrent goroutine profile collection with individual goroutine metadata tracking
  • Provides comprehensive test coverage including latency benchmarks demonstrating significant performance improvements

Reviewed Changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.

File Description
src/runtime/pprof/pprof.go Adds routing for new debug mode and implements stack formatting with labels
src/runtime/mprof.go Extends concurrent profile collection to support goroutine metadata capture
src/internal/profilerecord/profilerecord.go Defines new GoroutineInfo struct for metadata storage
src/runtime/pprof/pprof_test.go Adds comprehensive tests and benchmarks for the new functionality

for i, r := range records {
goroutineID := infos[i].ID
state := "unknown"
if i < len(infos) && infos[i].State != "" {
Copy link

Copilot AI Aug 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The condition 'i < len(infos)' is redundant since the function parameter ensures 'infos' has the same length as 'records'. This check can be simplified to just 'infos[i].State != ""'.

Suggested change
if i < len(infos) && infos[i].State != "" {
if infos[i].State != "" {

Copilot uses AI. Check for mistakes.

// Get label information using the same format as debug=1.
var labelStr string
if i < len(labels) && labels[i] != nil {
Copy link

Copilot AI Aug 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar to the infos check, 'i < len(labels)' is redundant since labels has the same length as records. This can be simplified to 'labels[i] != nil'.

Suggested change
if i < len(labels) && labels[i] != nil {
if labels[i] != nil {

Copilot uses AI. Check for mistakes.
}

// Print "created by" information if available (skip main goroutine)
if i < len(infos) && goroutineID != 1 && infos[i].CreationPC != 0 {
Copy link

Copilot AI Aug 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The condition 'i < len(infos)' is redundant as infos has the same length as records. This can be simplified to 'goroutineID != 1 && infos[i].CreationPC != 0'.

Suggested change
if i < len(infos) && goroutineID != 1 && infos[i].CreationPC != 0 {
if goroutineID != 1 && infos[i].CreationPC != 0 {

Copilot uses AI. Check for mistakes.
@dt dt force-pushed the crdb-1245-stk branch from 8555ffb to e80d5e3 Compare August 9, 2025 19:32
…ced STW

This adds a new goroutine profile mode (debug=26257) that emits one
entry per goroutine in a format nearly identical to debug=2 that
additionally includes any pprof labels in the  header, following the [state]
segment, e.g:

    goroutine 123 [select, 2 minutes]{"svc": "users"}:

The implementation of this mode is however significantly different from
debug=2, and instead based on the same underlying mechanism as debug=1.
Unlike the collection of debug=2 profiles, which walks all stacks while
the world is stopped, the implementation of this mode uses the
concurrent collection integrated with the scheduler that backs debug=1.
As a result, this new mode reduces the duration of the stop-the-world
phase, particularly when profiling processes with many goroutines,
compared to debug=2, and demonstrated by the included benchmark:

                       │    debug=2     │          debug=26257
                       │ max_latency_ns │ max_latency_ns  vs base
goroutines=100x3-14      1013.17k ± 47%   84.06k ±   27%  -91.70% (p=0.002 n=6)
goroutines=100x10-14      769.23k ±  7%   80.29k ±   22%  -89.56% (p=0.002 n=6)
goroutines=100x50-14      2172.4k ±  9%   181.8k ±   46%  -91.63% (p=0.002 n=6)
goroutines=1000x3-14      7133.9k ±  3%   195.7k ±   42%  -97.26% (p=0.002 n=6)
goroutines=1000x10-14    11787.6k ± 48%   494.4k ±   77%  -95.81% (p=0.002 n=6)
goroutines=1000x50-14    20234.0k ± 87%   174.8k ±  137%  -99.14% (p=0.002 n=6)
goroutines=10000x3-14    68611.0k ± 49%   168.5k ± 2768%  -99.75% (p=0.002 n=6)
goroutines=10000x10-14    60.261M ± 95%   3.460M ±  166%  -94.26% (p=0.002 n=6)
goroutines=10000x50-14   284.144M ± 40%   4.672M ±   89%  -98.36% (p=0.002 n=6)
goroutines=25000x3-14    171.290M ± 48%   4.287M ±  394%  -97.50% (p=0.002 n=6)
goroutines=25000x10-14   150.827M ± 92%   6.424M ±  158%  -95.74% (p=0.002 n=6)
goroutines=25000x50-14   708.238M ± 34%   2.249M ±  410%  -99.68% (p=0.002 n=6)
geomean                    25.08M         624.2k          -97.51%

This concurrent collection approach and its relaxed consistency compared
to keeping the world stopped doesm mean the behavior of this new mode is
not exactly identical to debug=2. Additionally, currently this mode
always elides argument values when printing stacks, whereas debug=2
includes them most of the time. These behavior differences mean that
despite the performance benefits, the new implementation is not used for
the existing debug=2 mode, but rather only for the new mode making it an
opt-in alternative.
Copy link
Member

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not an expert on the Go runtime, so take that with a grain of salt, but the approach looks good to me. What I don't understand is what the plan would be for this after this merges. Ideally, we can find a team that "nominally" owns this in case it ever causes issues (Obs? DB Server?). We would also need follow-up changes in CRDB to actually cash in on this functionality, and again, there should be some ownership there.

goroutineProfile.records = p
goroutineProfile.labels = labels
goroutineProfile.infos = infos

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't you need to also update doRecordGoroutineProfile so that it updates goroutineProfile.infos?


// printGoroutineStacksWithLabels formats goroutine records in a format similar
// to runtime.Stack, but also includes labels as well.
func printGoroutineStacksWithLabels(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a smaller diff that takes the existing print code and adds an argument for it?

func writeGoroutineStacksWithLabels(w io.Writer) error {
n, ok := pprof_goroutineStacksWithLabels(nil, nil, nil)

for {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are you confident in the robustness of this loop? Maybe worth to additionally put in a growth factor of (say) 2 on each iteration.

}
}

fmt.Fprintf(w, "goroutine %d [%s%s]%s:\n", goroutineID, state, blockedStr, labelStr)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one side-effect of the format change is that panicparse and friends won't accept this any more. Can you think of a way to squeeze the labels in in a way that doesn't "nominally" change the format? I can't - but thought I'd ask.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we wanted to use backtrace/debug=2 format we could inject them inside the []; I previously had assumed that was strictly state[, waitSince] but looking at the 1.25 changes that add bubble info there, and the scanning states, I realized that it is expected that any number of comma separated fields can be in the [] so we could put labels there. But we can never get the args in a resume-the-world safe way to I think maybe trying to make debug=2 without STW is a fool's errand and we should focus on getting what we want (state, creator, waitSince) in debug=0/1, ala golang#74954

@dt
Copy link
Member Author

dt commented Sep 2, 2025

I've been convinced on my upstream proposal to pursue only adding a binary pprof format debug=3, that injects synthesized, per-goroutine labels with the extra runtime info. The collection is more or less identical to what I have here but the printing is different.

dt@342d90c

@dt
Copy link
Member Author

dt commented Sep 2, 2025

I made https://davidt.io/stackgazer support binary debug=0 profiles yesterday so I'm going to rework this branch to follow the suggestion on the upstream proposal to stick to only extending the binary format instead of making new text one. The core collection logic stays the same but the format function is pretty different. One question is how much I want to mess with printCountProfile which handles formatting debug0/debug1: it currently makes a map of counts, so I could just add goid to its map keys in debug=3 mode, which would work but is a bit wasteful -- the map keys are unique at that point making the whole grouping pass pointless. But the alternative is to really duplicate a bunch of its logic, or seriously rip up printCountProfile. Hmmm.

@dt
Copy link
Member Author

dt commented Oct 7, 2025

#9

@dt dt closed this Oct 7, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants