Skip to content

Commit ff1b46d

Browse files
committed
sql: move job check back out of txn creating autostats job
In #119383 we moved the find-running-jobs-of-type query into the same transaction that creates the autostats jobs, in order to avoid a race between different nodes trying to start an autostats job. In practice, however, we've seen that this combined read-write transaction suffers from many retries due to RETRY_SERIALIZABLE errors, when racing transactions write to system.jobs and then fail to refresh the find-running-jobs read. One the one hand, the fact that these transactions fail to commit does mean fewer rows are visible in system.jobs. On the other hand, each aborted transaction leaves behind an intent to clean up, so I don't think the load on system.jobs is any lighter than it was before. These types of races, where writing a new row depends on the absence of other rows, is kind of the worst case for optimistic Serializable isolation. If we could materialize the conflict to a single existing row that the transaction locks, we'd avoid both the serialization failures and the race to start a job. But we don't have such a row. (Maybe one day we could create a "NEXT AUTOSTATS JOB" sentinel row in system.jobs that the winner locks before claiming it and starting the job? But that seems like a big change.) For now, this PR partially reverts #119383 so that find-running-jobs is back in a separate read-only transaction before creating the job. Using a read-only transaction to check avoids the serializable errors, at the cost of sometimes creating too many jobs which will then immediately fail the repeated check within the job. We think this is going to reduce the load on system.jobs during times of high contention. Cluster setting `sql.stats.automatic_job_check_before_creating_job` can be used to get back the old behavior. In the future when we allow more than one autostats job, we hope to improve the situation by changing the mechanism for preventing concurrent stats collections on the same table. Maybe then we can have some kind of sentinel row per table to materialize the conflict. Informs: #108435 Release note: None
1 parent 6d76d83 commit ff1b46d

File tree

1 file changed

+21
-5
lines changed

1 file changed

+21
-5
lines changed

pkg/sql/create_stats.go

Lines changed: 21 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -76,6 +76,13 @@ var nonIndexJSONHistograms = settings.RegisterBoolSetting(
7676
false,
7777
settings.WithPublic)
7878

79+
var automaticJobCheckBeforeCreatingJob = settings.RegisterBoolSetting(
80+
settings.ApplicationLevel,
81+
"sql.stats.automatic_job_check_before_creating_job.enabled",
82+
"set to true to perform the autostats job check before creating the job, instead of in the same "+
83+
"transaction as creating the job",
84+
true)
85+
7986
const nonIndexColHistogramBuckets = 2
8087

8188
// StubTableStats generates "stub" statistics for a table which are missing
@@ -148,18 +155,27 @@ func (n *createStatsNode) runJob(ctx context.Context) error {
148155
}
149156
details := record.Details.(jobspb.CreateStatsDetails)
150157

151-
if n.Name != jobspb.AutoStatsName && n.Name != jobspb.AutoPartialStatsName {
158+
jobCheckBefore := automaticJobCheckBeforeCreatingJob.Get(n.p.ExecCfg().SV())
159+
if (n.Name == jobspb.AutoStatsName || n.Name == jobspb.AutoPartialStatsName) && jobCheckBefore {
160+
// Don't start the job if there is already a CREATE STATISTICS job running.
161+
// (To handle race conditions we check this again after the job starts,
162+
// but this check is used to prevent creating a large number of jobs that
163+
// immediately fail).
164+
if err := checkRunningJobs(
165+
ctx, nil /* job */, n.p, n.Name == jobspb.AutoPartialStatsName, n.p.ExecCfg().JobRegistry,
166+
details.Table.ID,
167+
); err != nil {
168+
return err
169+
}
170+
} else {
152171
telemetry.Inc(sqltelemetry.CreateStatisticsUseCounter)
153172
}
154173

155174
var job *jobs.StartableJob
156175
jobID := n.p.ExecCfg().JobRegistry.MakeJobID()
157176
if err := n.p.ExecCfg().InternalDB.Txn(ctx, func(ctx context.Context, txn isql.Txn) (err error) {
158-
if n.Name == jobspb.AutoStatsName || n.Name == jobspb.AutoPartialStatsName {
177+
if (n.Name == jobspb.AutoStatsName || n.Name == jobspb.AutoPartialStatsName) && !jobCheckBefore {
159178
// Don't start the job if there is already a CREATE STATISTICS job running.
160-
// (To handle race conditions we check this again after the job starts,
161-
// but this check is used to prevent creating a large number of jobs that
162-
// immediately fail).
163179
if err := checkRunningJobsInTxn(
164180
ctx, n.p.EvalContext().Settings, jobspb.InvalidJobID, txn,
165181
); err != nil {

0 commit comments

Comments
 (0)