Skip to content

Commit c6d599f

Browse files
committed
sql: add plan gist matching to stmt diagnostics feature
This commit extends the stmt diagnostics feature to add optional plan-gist-based matching. Previously, we filtered statements based only on the fingerprint but now we can optionally ask for a particular plan (by specifying the target plan gist). All other aspects of the feature (minimum execution latency, sampling probability) are unaffected. The caveat to the implementation is that the plan gist of the running statement is available after the optimizer has done its part, so whenever plan-gist-based matching is desired, the trace will not include the optimizer part as well as the plan string won't be available. This commit also made a minor change to always store the memo and the opt planning catalog in `planTop`. Previously, this was stored only when the bundle collection is enabled, but we now can enable it after the optimizer, at which point the memo and the catalog might be lost. The optimizer now stores it unconditionally, but then if we choose to not collect the bundle once the plan gist is available, we release these things. This allows us to still get `opt` files in the bundle. Release note (sql change): Statement diagnostics feature has been extended to support collecting a bundle for a particular plan. Namely, the existing fingerprint-based matching has been extended to also include plan-gist-based matching. Such bundle will miss a couple of things: `plan.txt` file as well as the tracing of the optimizer. At the moment, the feature is only exposed via an overload to `crdb_internal.request_statement_bundle` builtin function. We now also support "anti-match" - i.e. collecting a bundle for any plan other than the provided plan gist.
1 parent 24d8278 commit c6d599f

33 files changed

+1034
-386
lines changed

docs/generated/http/full.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -4388,6 +4388,8 @@ Support status: [reserved](#support-status)
43884388
| min_execution_latency | [google.protobuf.Duration](#cockroach.server.serverpb.CreateStatementDiagnosticsReportRequest-google.protobuf.Duration) | | MinExecutionLatency, when non-zero, indicates the minimum execution latency of a query for which to collect the diagnostics report. In other words, if a query executes faster than this threshold, then the diagnostics report is not collected on it, and we will try to get a bundle the next time we see the query fingerprint.<br><br>NB: if MinExecutionLatency is non-zero, then all queries that match the fingerprint will be traced until a slow enough query comes along. This tracing might have some performance overhead. | [reserved](#support-status) |
43894389
| expires_after | [google.protobuf.Duration](#cockroach.server.serverpb.CreateStatementDiagnosticsReportRequest-google.protobuf.Duration) | | ExpiresAfter, when non-zero, sets the expiration interval of this request. | [reserved](#support-status) |
43904390
| sampling_probability | [double](#cockroach.server.serverpb.CreateStatementDiagnosticsReportRequest-double) | | SamplingProbability controls how likely we are to try and collect a diagnostics report for a given execution. The semantics with MinExecutionLatency are worth noting (and perhaps simplifying?): - If SamplingProbability is zero, we're always sampling. This is for compatibility with pre-22.2 versions where this parameter was not available. - If SamplingProbability is non-zero, MinExecutionLatency must be non-zero. We'll sample stmt executions with the given probability until: (a) we capture one that exceeds MinExecutionLatency, or (b) we hit the ExpiresAfter point.<br><br>SamplingProbability lets users control at a per-stmt granularity how much collection overhead is acceptable to try an capture an outlier execution for further analysis (are high p99.9s due to latch waits? racing with split transfers?). A high sampling rate can capture a trace sooner, but the added overhead may also cause the trace to be non-representative if the tracing overhead across all requests is causing resource saturation (network, memory) and resulting in slowdown.<br><br>TODO(irfansharif): Wire this up to the UI code. When selecting the latency threshold, we should want to force specifying a sampling probability.<br><br>TODO(irfansharif): We could do better than a hard-coded default value for probability (100% could be too high-overhead so probably not the right one). Strawman: could consider the recent request rate for the fingerprint (say averaged over the last 10m? 30m?), consider what %-ile the latency target we're looking to capture is under, and suggest a sampling probability that gets you at least one trace in the next T seconds with 95% likelihood? Or provide a hint for how long T is for the currently chosen sampling probability. | [reserved](#support-status) |
4391+
| plan_gist | [string](#cockroach.server.serverpb.CreateStatementDiagnosticsReportRequest-string) | | PlanGist, when set, indicates a particular plan that we want collect diagnostics for. This can be useful when a single fingerprint can result in multiple plans.<br><br>There is a caveat to using this filtering: since the plan gist for a running query is only available after the optimizer has done its part, the trace will only include things after the optimizer is done. | [reserved](#support-status) |
4392+
| anti_plan_gist | [bool](#cockroach.server.serverpb.CreateStatementDiagnosticsReportRequest-bool) | | AntiPlanGist, when set, indicates that any plan not matching PlanGist will do. | [reserved](#support-status) |
43914393

43924394

43934395

docs/generated/settings/settings-for-tenants.txt

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -304,4 +304,4 @@ trace.opentelemetry.collector string address of an OpenTelemetry trace collecto
304304
trace.snapshot.rate duration 0s if non-zero, interval at which background trace snapshots are captured tenant-rw
305305
trace.span_registry.enabled boolean true if set, ongoing traces can be seen at https://<ui>/#/debug/tracez tenant-rw
306306
trace.zipkin.collector string the address of a Zipkin instance to receive traces, as <host>:<port>. If no port is specified, 9411 will be used. tenant-rw
307-
version version 1000023.1-16 set the active cluster version in the format '<major>.<minor>' tenant-rw
307+
version version 1000023.1-18 set the active cluster version in the format '<major>.<minor>' tenant-rw

docs/generated/settings/settings.html

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -260,6 +260,6 @@
260260
<tr><td><div id="setting-trace-span-registry-enabled" class="anchored"><code>trace.span_registry.enabled</code></div></td><td>boolean</td><td><code>true</code></td><td>if set, ongoing traces can be seen at https://&lt;ui&gt;/#/debug/tracez</td><td>Serverless/Dedicated/Self-Hosted</td></tr>
261261
<tr><td><div id="setting-trace-zipkin-collector" class="anchored"><code>trace.zipkin.collector</code></div></td><td>string</td><td><code></code></td><td>the address of a Zipkin instance to receive traces, as &lt;host&gt;:&lt;port&gt;. If no port is specified, 9411 will be used.</td><td>Serverless/Dedicated/Self-Hosted</td></tr>
262262
<tr><td><div id="setting-ui-display-timezone" class="anchored"><code>ui.display_timezone</code></div></td><td>enumeration</td><td><code>etc/utc</code></td><td>the timezone used to format timestamps in the ui [etc/utc = 0, america/new_york = 1]</td><td>Dedicated/Self-Hosted</td></tr>
263-
<tr><td><div id="setting-version" class="anchored"><code>version</code></div></td><td>version</td><td><code>1000023.1-16</code></td><td>set the active cluster version in the format &#39;&lt;major&gt;.&lt;minor&gt;&#39;</td><td>Serverless/Dedicated/Self-Hosted</td></tr>
263+
<tr><td><div id="setting-version" class="anchored"><code>version</code></div></td><td>version</td><td><code>1000023.1-18</code></td><td>set the active cluster version in the format &#39;&lt;major&gt;.&lt;minor&gt;&#39;</td><td>Serverless/Dedicated/Self-Hosted</td></tr>
264264
</tbody>
265265
</table>

pkg/cli/clisqlclient/statement_diag.go

Lines changed: 52 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -13,6 +13,7 @@ package clisqlclient
1313
import (
1414
"context"
1515
"database/sql/driver"
16+
"fmt"
1617
"io"
1718
"os"
1819
"strconv"
@@ -101,8 +102,13 @@ func stmtDiagListBundlesInternal(ctx context.Context, conn Conn) ([]StmtDiagBund
101102
type StmtDiagActivationRequest struct {
102103
ID int64
103104
// Statement is the SQL statement fingerprint.
104-
Statement string
105-
RequestedAt time.Time
105+
Statement string
106+
// If empty then any plan will do.
107+
PlanGist string
108+
// If true and PlanGist is not empty, then any plan not matching the gist
109+
// will do.
110+
AntiPlanGist bool
111+
RequestedAt time.Time
106112
// Zero value indicates that there is no sampling probability set on the
107113
// request.
108114
SamplingProbability float64
@@ -126,9 +132,38 @@ func StmtDiagListOutstandingRequests(
126132
return result, nil
127133
}
128134

135+
func isAtLeast23dot2ClusterVersion(ctx context.Context, conn Conn) (bool, error) {
136+
// Check whether the upgrade to add the plan_gist and anti_plan_gist columns
137+
// to the statement_diagnostics_requests system table has already been run.
138+
row, err := conn.QueryRow(ctx, `
139+
SELECT
140+
count(*)
141+
FROM
142+
[SHOW COLUMNS FROM system.statement_diagnostics_requests]
143+
WHERE
144+
column_name = 'plan_gist';`)
145+
if err != nil {
146+
return false, err
147+
}
148+
c, ok := row[0].(int64)
149+
if !ok {
150+
return false, nil
151+
}
152+
return c == 1, nil
153+
}
154+
129155
func stmtDiagListOutstandingRequestsInternal(
130156
ctx context.Context, conn Conn,
131157
) ([]StmtDiagActivationRequest, error) {
158+
var extraColumns string
159+
atLeast23dot2, err := isAtLeast23dot2ClusterVersion(ctx, conn)
160+
if err != nil {
161+
return nil, err
162+
}
163+
if atLeast23dot2 {
164+
extraColumns = ", plan_gist, anti_plan_gist"
165+
}
166+
132167
// Converting an INTERVAL to a number of milliseconds within that interval
133168
// is a pain - we extract the number of seconds and multiply it by 1000,
134169
// then we extract the number of milliseconds and add that up to the
@@ -138,16 +173,16 @@ func stmtDiagListOutstandingRequestsInternal(
138173
EXTRACT(millisecond FROM min_execution_latency)::INT8 -
139174
EXTRACT(second FROM min_execution_latency)::INT8 * 1000`
140175
rows, err := conn.Query(ctx,
141-
"SELECT id, statement_fingerprint, requested_at, "+getMilliseconds+`, expires_at, sampling_probability
176+
fmt.Sprintf("SELECT id, statement_fingerprint, requested_at, "+getMilliseconds+`, expires_at, sampling_probability%s
142177
FROM system.statement_diagnostics_requests
143178
WHERE NOT completed
144-
ORDER BY requested_at DESC`,
179+
ORDER BY requested_at DESC`, extraColumns),
145180
)
146181
if err != nil {
147182
return nil, err
148183
}
149184
var result []StmtDiagActivationRequest
150-
vals := make([]driver.Value, 6)
185+
vals := make([]driver.Value, 8)
151186
for {
152187
if err := rows.Next(vals); err == io.EOF {
153188
break
@@ -157,6 +192,8 @@ func stmtDiagListOutstandingRequestsInternal(
157192
var minExecutionLatency time.Duration
158193
var expiresAt time.Time
159194
var samplingProbability float64
195+
var planGist string
196+
var antiPlanGist bool
160197

161198
if ms, ok := vals[3].(int64); ok {
162199
minExecutionLatency = time.Millisecond * time.Duration(ms)
@@ -167,9 +204,19 @@ func stmtDiagListOutstandingRequestsInternal(
167204
if sp, ok := vals[5].(float64); ok {
168205
samplingProbability = sp
169206
}
207+
if atLeast23dot2 {
208+
if gist, ok := vals[6].(string); ok {
209+
planGist = gist
210+
}
211+
if antiGist, ok := vals[7].(bool); ok {
212+
antiPlanGist = antiGist
213+
}
214+
}
170215
info := StmtDiagActivationRequest{
171216
ID: vals[0].(int64),
172217
Statement: vals[1].(string),
218+
PlanGist: planGist,
219+
AntiPlanGist: antiPlanGist,
173220
RequestedAt: vals[2].(time.Time),
174221
SamplingProbability: samplingProbability,
175222
MinExecutionLatency: minExecutionLatency,

pkg/cli/statement_diag.go

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -85,7 +85,7 @@ func runStmtDiagList(cmd *cobra.Command, args []string) (resErr error) {
8585
} else {
8686
fmt.Printf("Outstanding activation requests:\n")
8787
w := tabwriter.NewWriter(&buf, 4, 0, 2, ' ', 0)
88-
fmt.Fprint(w, " ID\tActivation time\tStatement\tSampling probability\tMin execution latency\tExpires at\n")
88+
fmt.Fprint(w, " ID\tActivation time\tStatement\tPlan gist\tAnti plan gist\tSampling probability\tMin execution latency\tExpires at\n")
8989
for _, r := range reqs {
9090
minExecLatency := "N/A"
9191
if r.MinExecutionLatency != 0 {
@@ -102,8 +102,8 @@ func runStmtDiagList(cmd *cobra.Command, args []string) (resErr error) {
102102
samplingProbability = fmt.Sprintf("%0.4f", r.SamplingProbability)
103103
}
104104
fmt.Fprintf(
105-
w, " %d\t%s\t%s\t%s\t%s\t%s\n",
106-
r.ID, r.RequestedAt.UTC().Format(timeFmt), r.Statement, samplingProbability, minExecLatency, expiresAt,
105+
w, " %d\t%s\t%s\t%s\t%t\t%s\t%s\t%s\n",
106+
r.ID, r.RequestedAt.UTC().Format(timeFmt), r.Statement, r.PlanGist, r.AntiPlanGist, samplingProbability, minExecLatency, expiresAt,
107107
)
108108
}
109109
_ = w.Flush()

pkg/cli/statement_diag_test.go

Lines changed: 24 additions & 22 deletions
Original file line numberDiff line numberDiff line change
@@ -31,13 +31,15 @@ func Example_statement_diag() {
3131
(20, 'SELECT _ FROM _ WHERE _ > _', 'SELECT a FROM t WHERE b > 1', '2010-01-02 03:04:06', ARRAY[1001,1002,1003]),
3232
(30, 'SELECT _ FROM _ WHERE _ > _', 'SELECT a FROM t WHERE b > 1', '2010-01-02 03:04:07', ARRAY[1001])`,
3333

34-
`INSERT INTO system.statement_diagnostics_requests(id, completed, statement_fingerprint, statement_diagnostics_id, requested_at, sampling_probability, min_execution_latency, expires_at)
35-
VALUES (1, TRUE, 'SELECT _ FROM _', 10, '2010-01-02 03:04:00', NULL, NULL, NULL),
36-
(2, TRUE, 'SELECT _ FROM _ WHERE _ > _', 20, '2010-01-02 03:04:02', 0.5, NULL, NULL),
37-
(3, TRUE, 'SELECT _ FROM _ WHERE _ > _', 30, '2010-01-02 03:04:05', 1.0, NULL, NULL),
38-
(4, FALSE, 'SELECT _ + _', NULL, '2010-01-02 03:04:10', 0.8, '1d 2h 3m 4s 5ms 6us', NULL),
39-
(5, FALSE, 'SELECT _ - _', NULL, '2010-01-02 03:04:11', 1.0, NULL, '2030-01-02 03:04:12'),
40-
(6, FALSE, 'SELECT _ / _', NULL, '2010-01-02 03:04:12', NULL, '0s', NULL)`,
34+
`INSERT INTO system.statement_diagnostics_requests(
35+
id, completed, statement_fingerprint, plan_gist, anti_plan_gist, statement_diagnostics_id,
36+
requested_at, sampling_probability, min_execution_latency, expires_at)
37+
VALUES (1, TRUE, 'SELECT _ FROM _', '', NULL, 10, '2010-01-02 03:04:00', NULL, NULL, NULL),
38+
(2, TRUE, 'SELECT _ FROM _ WHERE _ > _', '', NULL, 20, '2010-01-02 03:04:02', 0.5, NULL, NULL),
39+
(3, TRUE, 'SELECT _ FROM _ WHERE _ > _', '', NULL, 30, '2010-01-02 03:04:05', 1.0, NULL, NULL),
40+
(4, FALSE, 'SELECT _ + _', '', NULL, NULL, '2010-01-02 03:04:10', 0.8, '1d 2h 3m 4s 5ms 6us', NULL),
41+
(5, FALSE, 'SELECT _ - _', 'foo', false, NULL, '2010-01-02 03:04:11', 1.0, NULL, '2030-01-02 03:04:12'),
42+
(6, FALSE, 'SELECT _ / _', 'bar', true, NULL, '2010-01-02 03:04:12', NULL, '0s', NULL)`,
4143
}
4244

4345
for _, cmd := range commands {
@@ -94,10 +96,10 @@ func Example_statement_diag() {
9496
// 10 2010-01-02 03:04:05 UTC SELECT _ FROM _
9597
//
9698
// Outstanding activation requests:
97-
// ID Activation time Statement Sampling probability Min execution latency Expires at
98-
// 6 2010-01-02 03:04:12 UTC SELECT _ / _ 1.0000 N/A never
99-
// 5 2010-01-02 03:04:11 UTC SELECT _ - _ 1.0000 N/A 2030-01-02 03:04:12 +0000 UTC
100-
// 4 2010-01-02 03:04:10 UTC SELECT _ + _ 0.8000 26h3m4.005s never
99+
// ID Activation time Statement Plan gist Anti plan gist Sampling probability Min execution latency Expires at
100+
// 6 2010-01-02 03:04:12 UTC SELECT _ / _ bar true 1.0000 N/A never
101+
// 5 2010-01-02 03:04:11 UTC SELECT _ - _ foo false 1.0000 N/A 2030-01-02 03:04:12 +0000 UTC
102+
// 4 2010-01-02 03:04:10 UTC SELECT _ + _ false 0.8000 26h3m4.005s never
101103
// statement-diag download 13
102104
// ERROR: failed to download statement diagnostics bundle 13 to 'stmt-bundle-13.zip': no statement diagnostics bundle with ID 13
103105
// statement-diag download 20 tempfile.zip
@@ -120,18 +122,18 @@ func Example_statement_diag() {
120122
// 20 2010-01-02 03:04:06 UTC SELECT _ FROM _ WHERE _ > _
121123
//
122124
// Outstanding activation requests:
123-
// ID Activation time Statement Sampling probability Min execution latency Expires at
124-
// 6 2010-01-02 03:04:12 UTC SELECT _ / _ 1.0000 N/A never
125-
// 5 2010-01-02 03:04:11 UTC SELECT _ - _ 1.0000 N/A 2030-01-02 03:04:12 +0000 UTC
126-
// 4 2010-01-02 03:04:10 UTC SELECT _ + _ 0.8000 26h3m4.005s never
125+
// ID Activation time Statement Plan gist Anti plan gist Sampling probability Min execution latency Expires at
126+
// 6 2010-01-02 03:04:12 UTC SELECT _ / _ bar true 1.0000 N/A never
127+
// 5 2010-01-02 03:04:11 UTC SELECT _ - _ foo false 1.0000 N/A 2030-01-02 03:04:12 +0000 UTC
128+
// 4 2010-01-02 03:04:10 UTC SELECT _ + _ false 0.8000 26h3m4.005s never
127129
// statement-diag delete --all
128130
// statement-diag list
129131
// No statement diagnostics bundles available.
130132
// Outstanding activation requests:
131-
// ID Activation time Statement Sampling probability Min execution latency Expires at
132-
// 6 2010-01-02 03:04:12 UTC SELECT _ / _ 1.0000 N/A never
133-
// 5 2010-01-02 03:04:11 UTC SELECT _ - _ 1.0000 N/A 2030-01-02 03:04:12 +0000 UTC
134-
// 4 2010-01-02 03:04:10 UTC SELECT _ + _ 0.8000 26h3m4.005s never
133+
// ID Activation time Statement Plan gist Anti plan gist Sampling probability Min execution latency Expires at
134+
// 6 2010-01-02 03:04:12 UTC SELECT _ / _ bar true 1.0000 N/A never
135+
// 5 2010-01-02 03:04:11 UTC SELECT _ - _ foo false 1.0000 N/A 2030-01-02 03:04:12 +0000 UTC
136+
// 4 2010-01-02 03:04:10 UTC SELECT _ + _ false 0.8000 26h3m4.005s never
135137
// statement-diag cancel xx
136138
// ERROR: invalid ID
137139
// statement-diag cancel 5 6
@@ -142,9 +144,9 @@ func Example_statement_diag() {
142144
// statement-diag list
143145
// No statement diagnostics bundles available.
144146
// Outstanding activation requests:
145-
// ID Activation time Statement Sampling probability Min execution latency Expires at
146-
// 6 2010-01-02 03:04:12 UTC SELECT _ / _ 1.0000 N/A never
147-
// 5 2010-01-02 03:04:11 UTC SELECT _ - _ 1.0000 N/A 2030-01-02 03:04:12 +0000 UTC
147+
// ID Activation time Statement Plan gist Anti plan gist Sampling probability Min execution latency Expires at
148+
// 6 2010-01-02 03:04:12 UTC SELECT _ / _ bar true 1.0000 N/A never
149+
// 5 2010-01-02 03:04:11 UTC SELECT _ - _ foo false 1.0000 N/A 2030-01-02 03:04:12 +0000 UTC
148150
// statement-diag cancel 123
149151
// ERROR: no outstanding activation request with ID 123
150152
// statement-diag cancel --all

pkg/cli/zip_table_registry.go

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1232,6 +1232,8 @@ var zipSystemTables = DebugZipTableRegistry{
12321232
"min_execution_latency",
12331233
"expires_at",
12341234
"sampling_probability",
1235+
"plan_gist",
1236+
"anti_plan_gist",
12351237
},
12361238
},
12371239
"system.table_statistics": {

pkg/clusterversion/cockroach_versions.go

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -553,6 +553,10 @@ const (
553553
// FormatVirtualSSTables, allowing use of virtual sstables in Pebble.
554554
V23_2_PebbleFormatVirtualSSTables
555555

556+
// V23_2_StmtDiagForPlanGist enables statement diagnostic feature to collect
557+
// the bundle for particular plan gist.
558+
V23_2_StmtDiagForPlanGist
559+
556560
// *************************************************
557561
// Step (1) Add new versions here.
558562
// Do not add new versions to a patch release.
@@ -962,6 +966,10 @@ var rawVersionsSingleton = keyedVersions{
962966
Key: V23_2_PebbleFormatVirtualSSTables,
963967
Version: roachpb.Version{Major: 23, Minor: 1, Internal: 16},
964968
},
969+
{
970+
Key: V23_2_StmtDiagForPlanGist,
971+
Version: roachpb.Version{Major: 23, Minor: 1, Internal: 18},
972+
},
965973

966974
// *************************************************
967975
// Step (2): Add new versions here.

pkg/server/serverpb/status.proto

Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1800,6 +1800,17 @@ message CreateStatementDiagnosticsReportRequest {
18001800
// likelihood? Or provide a hint for how long T is for the currently chosen
18011801
// sampling probability.
18021802
double sampling_probability = 4;
1803+
// PlanGist, when set, indicates a particular plan that we want collect
1804+
// diagnostics for. This can be useful when a single fingerprint can result in
1805+
// multiple plans.
1806+
//
1807+
// There is a caveat to using this filtering: since the plan gist for a
1808+
// running query is only available after the optimizer has done its part, the
1809+
// trace will only include things after the optimizer is done.
1810+
string plan_gist = 5;
1811+
// AntiPlanGist, when set, indicates that any plan not matching PlanGist will
1812+
// do.
1813+
bool anti_plan_gist = 6;
18031814
}
18041815

18051816
message CreateStatementDiagnosticsReportResponse {

0 commit comments

Comments
 (0)