You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
148753: changefeedccl: gate Kafka v2 message too large error detail behind cluster setting r=asg0451 a=elizaMkraule
A recent change added detailed logging for Kafka v2 changefeed
messages that exceed the broker's size limit. These logs now
include the message key, size, and MVCC timestamp to aid in
debugging.
To make this safe for backporting, the behavior is now gated
behind the cluster setting:
changefeed.kafka_v2_include_error_details
In the main branch, this setting defaults to true to preserve
the enhanced observability.
In release branch backports, it will default to false.
When enabled, the log will include:
- The key of the offending message
- Combined key + value size
- MVCC timestamp
When disabled, the log reverts to the previous, minimal format.
Related to:
Jira issue: [CRDB-49646](https://cockroachlabs.atlassian.net/browse/CRDB-49646)
See also #144994
Release note (general change): Kafka v2 changefeed sinks now
support a cluster setting that enables detailed error logging
for messages exceeding Kafka v2 size limit.
149538: sql: don't throw errors for skipped auto stats jobs r=mw5h a=mw5h
Previously, auto stats jobs would throw errors and increase failed jobs
counters if they attempted to start while a stats collection was already
in progress on the table. For large clusters with
'sql.stats.automatic_job_check_before_creating_job.enabled' set to true,
this could create quite a few failed jobs. These failed jobs don't seem
to cause any performance issues, but they clutter logs, potentially
obscuring real problems and alarming customers, who then file tickets
with support to figure out why their jobs are failing.
This patch:
* refactors the autostats checks to reduce code duplication.
* swallows the error for concurrent auto stats creation, logging at
INFO level instead.
* changes the create stats jobs test so that it no longer expects these
jobs creations to fail and instead expects the stats to not be
collected.
* fixes a bug in the create stats jobs test that would cause it to hang
instead of exiting on error.
* adds a cluster setting,
sql.stats.error_on_concurrent_create_stats.enabled, which controls
this new behavior. By default the old behavior is maintained.
Fixes: #148413
Release note (ops change): CockroachDB now has a cluster setting,
sql.stats.error_on_concurrent_create_stats.enabled, which modifies how
it reacts to concurrent auto stats jobs. The default, true, maintains
the previous behavior. Setting this to false will cause the concurrent
auto stats job to be skipped with just a log entry and no increased
error counters.
149699: asim: update range usage info and store capacity r=tbg a=wenyihu6
**asim: remove unused range usage info**
This commit removes the unused field RangeUsageInfo
from TransferLeaseOp.
Epic: none
Release note: none
---
**asim: support request_cpu_per_access and raft_cpu_per_write**
This commit adds support for the request_cpu_per_access and raft_cpu_per_write
options in the gen_load command. It only adds the options to the data-driven
framework and workload generator. But no real changes have been made to
LoadEvent, and they currently have no effect on range usage or apply load.
Future commits will implement the actual impact.
Epic: none
Release note: none
---
**asim: add impact from request_cpu_per_access and raft_cpu_per_write**
Previously, request_cpu_per_access and raft_cpu_per_write were added to the
workload generator in data driven tests, but they had no actual effect to the
cluster yet. This commit makes them take effect by applying the impact from
LoadEvent including CPUPerSecond in store capacity and recording range load
stats.
Epic: none
Release note: none
---
**asim: add store capacity cpu stats to storemetrics**
Previously, store capacity cpu was populated. This commit adds the corresponding
stats to the StoreMetrics.
Epic: none
Release note: none
---
**asim: removes redundant size assignment**
This commit removes an redundant size assignment for rangeInfo in
LoadRangeInfo, since the caller already populates rangesInfo.
Epic: none
Release note: none
---
**asim: add a comment for RangeUsageInfo.WritesPerSecond**
This commit adds a comment clarifying nuances with WritesPerSecond in
RangeUsageInfo. It is actually the sum of writes rather than the rate. It's
currently unused outside of two unit tests TestWorkloadApply and
TestCapacityOverride. They both abuse this field by verifying that the writes
reach the replicas and the sum of writes is expected. Since it is tricky to
assert on the exact per rate stat, we currently leave it as is. But we should
fix this later.
Epic: none
Release note: none
---
**asim: account for follower replica load**
Previously, asim only accounted for load on the leaseholder, ignoring
non-leaseholder replicas. This commit updates it to consider all replicas for
RangeUsageInfo and store capacity aggregation. RangeUsageInfo handles
leaseholder checks and clears request CPU and QPS stats for non-leaseholder
replicas.
Epic: none
Release note: none
---
**asim: add range info String method**
This commit adds a RangeIndo string method.
Epic: none
Release note: none
---
**asim: add write bytes per sec to capacity & range usage**
Previously, we added WriteBytesPerSecond to roachpb.StoreCapacity. This commit
plumbs it through store capacity aggregation, range load usage, and
StoreMetrics. MMA will later use to track write bandwidth usage across stores.
Epic: none
Release note: none
---
**asim: support node_cpu_rate_capacity with gen_cluster**
Previously, we added roachpb.NodeCapacity for MMA to compute resource
utilization across stores. This commit integrates it into the asim setup,
enabling gen_cluster to use the node_cpu_rate_capacity option. Note that no
functions currently access node capacity; future MMA integration commits will
utilize it.
Epic: none
Release note: none
---
**asim: add comments for datadriven**
This commit updates comments for a few options we added recently.
Epic: none
Release note: none
149717: logictest: add back assertion that was rewritten accidentally r=rafiss a=rafiss
ee263e2 rewrote this test so that it expects no spanconfig. This was likely a mistake caused by rewriting before retrying for long enough.
This patch adds back the assertion, and adds another one that should prevent accidental rewrites.
fixes#148603
Release note: None
Co-authored-by: Eliza Kraule <[email protected]>
Co-authored-by: Matt White <[email protected]>
Co-authored-by: wenyihu6 <[email protected]>
Co-authored-by: Rafi Shamim <[email protected]>
Copy file name to clipboardExpand all lines: docs/generated/settings/settings-for-tenants.txt
+2Lines changed: 2 additions & 0 deletions
Original file line number
Diff line number
Diff line change
@@ -18,6 +18,7 @@ changefeed.event_consumer_worker_queue_size integer 16 if changefeed.event_consu
18
18
changefeed.event_consumer_workers integer 0 the number of workers to use when processing events: <0 disables, 0 assigns a reasonable default, >0 assigns the setting value. for experimental/core changefeeds and changefeeds using parquet format, this is disabled application
19
19
changefeed.fast_gzip.enabled boolean true use fast gzip implementation application
20
20
changefeed.span_checkpoint.lag_threshold (alias: changefeed.frontier_highwater_lag_checkpoint_threshold) duration 10m0s the amount of time a changefeed's lagging (slowest) spans must lag behind its leading (fastest) spans before a span-level checkpoint to save leading span progress is written; if 0, span-level checkpoints due to lagging spans is disabled application
21
+
changefeed.kafka_v2_error_details.enabled boolean true if enabled, Kafka v2 sinks will include the message key, size, and MVCC timestamp in message too large errors application
21
22
changefeed.memory.per_changefeed_limit byte size 512 MiB controls amount of data that can be buffered per changefeed application
22
23
changefeed.resolved_timestamp.min_update_interval (alias: changefeed.min_highwater_advance) duration 0s minimum amount of time that must have elapsed since the last time a changefeed's resolved timestamp was updated before it is eligible to be updated again; default of 0 means no minimum interval is enforced but updating will still be limited by the average time it takes to checkpoint progress application
23
24
changefeed.node_throttle_config string specifies node level throttling configuration for all changefeeeds application
@@ -355,6 +356,7 @@ sql.stats.automatic_partial_collection.fraction_stale_rows float 0.05 target fra
355
356
sql.stats.automatic_partial_collection.min_stale_rows integer 100 target minimum number of stale rows per table that will trigger a partial statistics refresh application
sql.stats.detailed_latency_metrics.enabled boolean false label latency metrics with the statement fingerprint. Workloads with tens of thousands of distinct query fingerprints should leave this setting false. (experimental, affects performance for workloads with high fingerprint cardinality) application
359
+
sql.stats.error_on_concurrent_create_stats.enabled boolean true set to true to error on concurrent CREATE STATISTICS jobs, instead of skipping them application
358
360
sql.stats.flush.enabled boolean true if set, SQL execution statistics are periodically flushed to disk application
359
361
sql.stats.flush.interval duration 10m0s the interval at which SQL execution statistics are flushed to disk, this value must be less than or equal to 1 hour application
360
362
sql.stats.forecasts.enabled boolean true when true, enables generation of statistics forecasts by default for all tables application
Copy file name to clipboardExpand all lines: docs/generated/settings/settings.html
+2Lines changed: 2 additions & 0 deletions
Original file line number
Diff line number
Diff line change
@@ -23,6 +23,7 @@
23
23
<tr><td><divid="setting-changefeed-event-consumer-workers" class="anchored"><code>changefeed.event_consumer_workers</code></div></td><td>integer</td><td><code>0</code></td><td>the number of workers to use when processing events: <0 disables, 0 assigns a reasonable default, >0 assigns the setting value. for experimental/core changefeeds and changefeeds using parquet format, this is disabled</td><td>Serverless/Dedicated/Self-Hosted</td></tr>
24
24
<tr><td><divid="setting-changefeed-fast-gzip-enabled" class="anchored"><code>changefeed.fast_gzip.enabled</code></div></td><td>boolean</td><td><code>true</code></td><td>use fast gzip implementation</td><td>Serverless/Dedicated/Self-Hosted</td></tr>
25
25
<tr><td><divid="setting-changefeed-frontier-highwater-lag-checkpoint-threshold" class="anchored"><code>changefeed.span_checkpoint.lag_threshold<br/>(alias: changefeed.frontier_highwater_lag_checkpoint_threshold)</code></div></td><td>duration</td><td><code>10m0s</code></td><td>the amount of time a changefeed's lagging (slowest) spans must lag behind its leading (fastest) spans before a span-level checkpoint to save leading span progress is written; if 0, span-level checkpoints due to lagging spans is disabled</td><td>Serverless/Dedicated/Self-Hosted</td></tr>
26
+
<tr><td><divid="setting-changefeed-kafka-v2-error-details-enabled" class="anchored"><code>changefeed.kafka_v2_error_details.enabled</code></div></td><td>boolean</td><td><code>true</code></td><td>if enabled, Kafka v2 sinks will include the message key, size, and MVCC timestamp in message too large errors</td><td>Serverless/Dedicated/Self-Hosted</td></tr>
26
27
<tr><td><divid="setting-changefeed-memory-per-changefeed-limit" class="anchored"><code>changefeed.memory.per_changefeed_limit</code></div></td><td>byte size</td><td><code>512 MiB</code></td><td>controls amount of data that can be buffered per changefeed</td><td>Serverless/Dedicated/Self-Hosted</td></tr>
27
28
<tr><td><divid="setting-changefeed-min-highwater-advance" class="anchored"><code>changefeed.resolved_timestamp.min_update_interval<br/>(alias: changefeed.min_highwater_advance)</code></div></td><td>duration</td><td><code>0s</code></td><td>minimum amount of time that must have elapsed since the last time a changefeed's resolved timestamp was updated before it is eligible to be updated again; default of 0 means no minimum interval is enforced but updating will still be limited by the average time it takes to checkpoint progress</td><td>Serverless/Dedicated/Self-Hosted</td></tr>
28
29
<tr><td><divid="setting-changefeed-node-throttle-config" class="anchored"><code>changefeed.node_throttle_config</code></div></td><td>string</td><td><code></code></td><td>specifies node level throttling configuration for all changefeeeds</td><td>Serverless/Dedicated/Self-Hosted</td></tr>
@@ -310,6 +311,7 @@
310
311
<tr><td><divid="setting-sql-stats-automatic-partial-collection-min-stale-rows" class="anchored"><code>sql.stats.automatic_partial_collection.min_stale_rows</code></div></td><td>integer</td><td><code>100</code></td><td>target minimum number of stale rows per table that will trigger a partial statistics refresh</td><td>Serverless/Dedicated/Self-Hosted</td></tr>
311
312
<tr><td><divid="setting-sql-stats-cleanup-recurrence" class="anchored"><code>sql.stats.cleanup.recurrence</code></div></td><td>string</td><td><code>@hourly</code></td><td>cron-tab recurrence for SQL Stats cleanup job</td><td>Serverless/Dedicated/Self-Hosted</td></tr>
312
313
<tr><td><divid="setting-sql-stats-detailed-latency-metrics-enabled" class="anchored"><code>sql.stats.detailed_latency_metrics.enabled</code></div></td><td>boolean</td><td><code>false</code></td><td>label latency metrics with the statement fingerprint. Workloads with tens of thousands of distinct query fingerprints should leave this setting false. (experimental, affects performance for workloads with high fingerprint cardinality)</td><td>Serverless/Dedicated/Self-Hosted</td></tr>
314
+
<tr><td><divid="setting-sql-stats-error-on-concurrent-create-stats-enabled" class="anchored"><code>sql.stats.error_on_concurrent_create_stats.enabled</code></div></td><td>boolean</td><td><code>true</code></td><td>set to true to error on concurrent CREATE STATISTICS jobs, instead of skipping them</td><td>Serverless/Dedicated/Self-Hosted</td></tr>
313
315
<tr><td><divid="setting-sql-stats-flush-enabled" class="anchored"><code>sql.stats.flush.enabled</code></div></td><td>boolean</td><td><code>true</code></td><td>if set, SQL execution statistics are periodically flushed to disk</td><td>Serverless/Dedicated/Self-Hosted</td></tr>
314
316
<tr><td><divid="setting-sql-stats-flush-interval" class="anchored"><code>sql.stats.flush.interval</code></div></td><td>duration</td><td><code>10m0s</code></td><td>the interval at which SQL execution statistics are flushed to disk, this value must be less than or equal to 1 hour</td><td>Serverless/Dedicated/Self-Hosted</td></tr>
315
317
<tr><td><divid="setting-sql-stats-forecasts-enabled" class="anchored"><code>sql.stats.forecasts.enabled</code></div></td><td>boolean</td><td><code>true</code></td><td>when true, enables generation of statistics forecasts by default for all tables</td><td>Serverless/Dedicated/Self-Hosted</td></tr>
0 commit comments