Skip to content

Commit f0afff0

Browse files
craig[bot]ZhouXing19xinhaozwenyihu6rickystewart
committed
156307: sql: introduce canary stats settings r=ZhouXing19 a=ZhouXing19 Informs: #150015 This PR introduce 2 key configurations for the Canary Statistics Rollout feature. Note that this PR just to introduce the configuration settings. The core implementation for canary stats rollout will be in #156385. ### Table Storage parameter `sql_stats_canary_window` (duration) ```sql CREATE TABLE t (x int) WITH (sql_stats_canary_window = '20s') ``` This duration value determines specifies how long the newly collected statistics will be eligible for selection along with the most recent full statistics for the optimizer. It is needed for the canary statistics rollout feature. Only tables with a non-zero canary window will have canary statistics rollout enabled. Release note (sql change): A new table storage parameter `sql_stats_canary_window` has been introduced to enable gradual rollout of newly collected table statistics. It takes a duration string as the value. When set with a non-negative duration, the new statistics remain in a "canary" state for the specified duration before being promoted to stable. This allows for controlled exposure and intervention opportunities before statistics are fully deployed across all queries. ---- ### Cluster setting `sql.stats.canary_fraction` (float in [0 - 1]) ```sql SET CLUSTER SETTING sql.stats.canary_fraction = 0.2 ``` This `canaryFraction` controls the probabilistic sampling rate for queries participating in the canary statistics rollout feature. It determines what fraction of queries will use "canary statistics" (newly collected stats within their canary window) versus "stable statistics" (previously proven stats). For example, a value of 0.2 means 20% of queries will test canary stats while 80% use stable stats. The selection is atomic per query: if a query is chosen for canary evaluation, it will use canary statistics for ALL tables it references (where available). A query never uses a mix of canary and stable statistics. Since this "dice roll" happens for every non-internal query, the memo would otherwise flip frequently, negating the benefits of the query plan cache and causing performance regressions. To mitigate this, queries selected for the canary path bypass the query plan cache entirely: they neither look up existing cached memos nor invalidate them. Instead, we create a one-time memo used only for that single query execution. This approach assumes sql.stats.canary_fraction will be set to a small value, ensuring that canary queries remain a small fraction of total queries and minimizing the performance impact of recomputation. One exception is that, we don't roll the dice when preapring a statement. It means during statement preparation, `UseCanaryStats` is always false, so the memo cache remains enabled. The rule of thumb is: the cached memo, either in query cache or prepared stmt, are always for stable stats. ### Session Variable `canary_stats_mode` (enum: {auto, off, on}) - `on`: All queries in the session use canary stats for planning - `off`: All queries in the session use stable stats for planning - `auto`: The system decides based on `sql.stats.canary_fraction` for each query execution Release note (sql change): We introduce two new settings to control the use of canary statistics in query planning: 1. Cluster setting `sql.stats.canary_fraction` (float, range [0, 1]): Controls what fraction of queries use "canary statistics" (newly collected stats within their canary window) versus "stable statistics" (previously proven stats). For example, a value of 0.2 means 20% of queries will use canary stats while 80% use stable stats. The selection is atomic per query: if a query is chosen for canary evaluation, it uses canary statistics for ALL tables it references (where available), and it won't use query cache. A query never uses a mix of canary and stable statistics. 2. Session variable `canary_stats_mode` (enum: {auto, off, on}, default: auto): - `on`: All queries in the session use canary stats for planning - `off`: All queries in the session use stable stats for planning - `auto`: The system decides based on `sql.stats.canary_fraction` for each query execution 157146: db-console: add metrics workspace to debug page r=xinhaoz a=xinhaoz This debug page is similar to `Custom Time Series` but allows for exporting and loading of custom time series dashboards. Epic: none Release note: None 157862: decommission: retry on errors for AllocatorCheckRange r=wenyihu6 a=wenyihu6 Fixes: #156849 Release note: decommission pre-check may have failed on transient errors; this is now fixed with a retry loop. --- **decommission: retry on errors for AllocatorCheckRange** Previously, the decommission pre-check would fail for a range if evalStore.AllocatorCheckRange returned an error. However, transient errors, such as throttled stores, are only expected to last about 5 seconds (FailedReservationsTimeout) and can cause the pre-check to fail. This commit adds a retry loop around AllocatorCheckRange to retry on any errors. Alternatively, we could check for throttling errors specifically and retry only on throttling stores, but that would require string or error comparisons, which complicates the code. So we retry just on all errors here given this only affects the decommission pre-check. --- **kv: add TestDecommissionPreCheckRetryThrottledStores** Previously, we made decommission prechecks retry on errors, since some transient issues resolve quickly and shouldn’t cause the precheck to fail. This commit adds a test that verifies the precheck retries when it encounters transient throttled errors. 157927: roachtest: link on `large` pool r=rail a=rickystewart Release note: none Epic: none Co-authored-by: ZhouXing19 <[email protected]> Co-authored-by: Xin Hao Zhang <[email protected]> Co-authored-by: wenyihu6 <[email protected]> Co-authored-by: Ricky Stewart <[email protected]>
5 parents 33ebee8 + 9f48101 + 1dd9858 + 37f7b9a + 6c13079 commit f0afff0

File tree

53 files changed

+1870
-177
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

53 files changed

+1870
-177
lines changed

docs/generated/settings/settings-for-tenants.txt

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -361,6 +361,7 @@ sql.stats.automatic_full_collection.enabled boolean true automatic full statisti
361361
sql.stats.automatic_partial_collection.enabled boolean true automatic partial statistics collection mode application
362362
sql.stats.automatic_partial_collection.fraction_stale_rows float 0.05 target fraction of stale rows per table that will trigger a partial statistics refresh application
363363
sql.stats.automatic_partial_collection.min_stale_rows integer 100 target minimum number of stale rows per table that will trigger a partial statistics refresh application
364+
sql.stats.canary_fraction float 0 probability that table statistics will use canary mode instead of stable mode for query planning [0.0-1.0] application
364365
sql.stats.cleanup.recurrence string @hourly cron-tab recurrence for SQL Stats cleanup job application
365366
sql.stats.detailed_latency_metrics.enabled boolean false label latency metrics with the statement fingerprint. Workloads with tens of thousands of distinct query fingerprints should leave this setting false. (experimental, affects performance for workloads with high fingerprint cardinality) application
366367
sql.stats.error_on_concurrent_create_stats.enabled boolean false set to true to error on concurrent CREATE STATISTICS jobs, instead of skipping them application

docs/generated/settings/settings.html

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -316,6 +316,7 @@
316316
<tr><td><div id="setting-sql-stats-automatic-partial-collection-enabled" class="anchored"><code>sql.stats.automatic_partial_collection.enabled</code></div></td><td>boolean</td><td><code>true</code></td><td>automatic partial statistics collection mode</td><td>Basic/Standard/Advanced/Self-Hosted</td></tr>
317317
<tr><td><div id="setting-sql-stats-automatic-partial-collection-fraction-stale-rows" class="anchored"><code>sql.stats.automatic_partial_collection.fraction_stale_rows</code></div></td><td>float</td><td><code>0.05</code></td><td>target fraction of stale rows per table that will trigger a partial statistics refresh</td><td>Basic/Standard/Advanced/Self-Hosted</td></tr>
318318
<tr><td><div id="setting-sql-stats-automatic-partial-collection-min-stale-rows" class="anchored"><code>sql.stats.automatic_partial_collection.min_stale_rows</code></div></td><td>integer</td><td><code>100</code></td><td>target minimum number of stale rows per table that will trigger a partial statistics refresh</td><td>Basic/Standard/Advanced/Self-Hosted</td></tr>
319+
<tr><td><div id="setting-sql-stats-canary-fraction" class="anchored"><code>sql.stats.canary_fraction</code></div></td><td>float</td><td><code>0</code></td><td>probability that table statistics will use canary mode instead of stable mode for query planning [0.0-1.0]</td><td>Basic/Standard/Advanced/Self-Hosted</td></tr>
319320
<tr><td><div id="setting-sql-stats-cleanup-recurrence" class="anchored"><code>sql.stats.cleanup.recurrence</code></div></td><td>string</td><td><code>@hourly</code></td><td>cron-tab recurrence for SQL Stats cleanup job</td><td>Basic/Standard/Advanced/Self-Hosted</td></tr>
320321
<tr><td><div id="setting-sql-stats-detailed-latency-metrics-enabled" class="anchored"><code>sql.stats.detailed_latency_metrics.enabled</code></div></td><td>boolean</td><td><code>false</code></td><td>label latency metrics with the statement fingerprint. Workloads with tens of thousands of distinct query fingerprints should leave this setting false. (experimental, affects performance for workloads with high fingerprint cardinality)</td><td>Basic/Standard/Advanced/Self-Hosted</td></tr>
321322
<tr><td><div id="setting-sql-stats-error-on-concurrent-create-stats-enabled" class="anchored"><code>sql.stats.error_on_concurrent_create_stats.enabled</code></div></td><td>boolean</td><td><code>false</code></td><td>set to true to error on concurrent CREATE STATISTICS jobs, instead of skipping them</td><td>Basic/Standard/Advanced/Self-Hosted</td></tr>

pkg/ccl/logictestccl/tests/3node-tenant/generated_test.go

Lines changed: 7 additions & 0 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

pkg/ccl/logictestccl/tests/local-read-committed/generated_test.go

Lines changed: 7 additions & 0 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

pkg/ccl/logictestccl/tests/local-repeatable-read/generated_test.go

Lines changed: 7 additions & 0 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

pkg/cli/testdata/doctor/test_recreate_zipdir

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -14,5 +14,5 @@ SELECT crdb_internal.unsafe_upsert_namespace_entry(100, 0, 'public', 101, true);
1414
SELECT crdb_internal.unsafe_upsert_descriptor(102, decode('125a0a08706f73746772657310661a300a0b0a0561646d696e100218020a0d0a067075626c696310801018000a0a0a04726f6f74100218021204726f6f741803220028013a0c0a067075626c69631202086740004a005a0210007000', 'hex'), true);
1515
SELECT crdb_internal.unsafe_upsert_namespace_entry(0, 0, 'postgres', 102, true);
1616
SELECT crdb_internal.unsafe_upsert_descriptor(103, decode('2249086612067075626c6963186722310a0b0a0561646d696e100218020a0d0a067075626c696310840418000a0a0a04726f6f7410021802120561646d696e18032a00300140004a007000', 'hex'), true);
17-
SELECT crdb_internal.unsafe_upsert_descriptor(104, decode('0a8d030a01741868206428013a0042280a016910011a0e0801104018002a0030005014600020013000680070007800800100880100980100423c0a05726f77696410021a0e0801104018002a0030005014600020002a0e756e697175655f726f77696428293001680070007800800100880100980100480352780a06745f706b6579100118012205726f7769642a0169300240004a10080010001a00200028003000380040005a0070017a0408002000800100880100900104980101a20106080012001800a80100b20100ba0100c00100c801d88aed86ddb7c5e517d00101e00100e9010000000000000000f20100f8010060026a210a0b0a0561646d696e100218020a0a0a04726f6f74100218021204726f6f741803800101880103980100b2011b0a077072696d61727910001a01691a05726f776964200120022801b80101c20100e80100f2010408001200f801008002009202009a0200b20200b80200c00265c80200e00200800300880302a80300b00300d00300d80300e00300f80300880400980400a00400a80400b00400', 'hex'), true);
17+
SELECT crdb_internal.unsafe_upsert_descriptor(104, decode('0a90030a01741868206428013a0042280a016910011a0e0801104018002a0030005014600020013000680070007800800100880100980100423c0a05726f77696410021a0e0801104018002a0030005014600020002a0e756e697175655f726f77696428293001680070007800800100880100980100480352780a06745f706b6579100118012205726f7769642a0169300240004a10080010001a00200028003000380040005a0070017a0408002000800100880100900104980101a20106080012001800a80100b20100ba0100c00100c801d88aed86ddb7c5e517d00101e00100e9010000000000000000f20100f8010060026a210a0b0a0561646d696e100218020a0a0a04726f6f74100218021204726f6f741803800101880103980100b2011b0a077072696d61727910001a01691a05726f776964200120022801b80101c20100e80100f2010408001200f801008002009202009a0200b20200b80200c00265c80200e00200800300880302a80300b00300d00300d80300e00300f80300880400980400a00400a80400b00400b80400', 'hex'), true);
1818
COMMIT;

0 commit comments

Comments
 (0)