Skip to content

Commit 9f07b51

Browse files
contention: Increase RetryBudgetForMissingResult
Previously the retry budget was set to 1, however this budget lead to a significant amount of failed resolutions. To see why a retry budget of 1 is not sufficient consider the case where an in progress transaction is in the writer buffer when resolution is attempted. The in progress txn is then ingested into the cache after the txn resolution endpoint drains the write buffer - i.e. it is stored in the cache with the appstatspb.InvalidTransactionFingerprintID value. Now the transaction finishes and its respective fingerprint ID is recorded. However, it is in the writer buffer of the txn id cache. When resolution is attempted again, the lookup gets the invalid / in-progress value that is stored in the cache. The subsequent flush then gets the cache to ingest the actual fingerprint ID value for the txn. But we've run out of budget, and don't retry resolution. This commit increases the budget to 2. In addition to handling the case above, experimentally it shows to lower the number of failed resolutions (see issue linked). Lastly, this commit removes dead code in the TxnID resolution endpoint. A map was being created and never added to. The logic resulted in the RPC flushing the TxnID Cache on every invocation, that behaviour is preserved and made more explicit. Fixes: #148686 Release note: None
1 parent 2f606a6 commit 9f07b51

File tree

2 files changed

+5
-13
lines changed

2 files changed

+5
-13
lines changed

pkg/server/status.go

Lines changed: 4 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -86,7 +86,6 @@ import (
8686
"github.com/cockroachdb/cockroach/pkg/util/timeutil"
8787
"github.com/cockroachdb/cockroach/pkg/util/tracing/tracingpb"
8888
"github.com/cockroachdb/cockroach/pkg/util/uint128"
89-
"github.com/cockroachdb/cockroach/pkg/util/uuid"
9089
"github.com/cockroachdb/errors"
9190
"github.com/cockroachdb/redact"
9291
"github.com/google/pprof/profile"
@@ -416,11 +415,6 @@ func (b *baseStatusServer) localTxnIDResolution(
416415
) *serverpb.TxnIDResolutionResponse {
417416
txnIDCache := b.sqlServer.pgServer.SQLServer.GetTxnIDCache()
418417

419-
unresolvedTxnIDs := make(map[uuid.UUID]struct{}, len(req.TxnIDs))
420-
for _, txnID := range req.TxnIDs {
421-
unresolvedTxnIDs[txnID] = struct{}{}
422-
}
423-
424418
resp := &serverpb.TxnIDResolutionResponse{
425419
ResolvedTxnIDs: make([]contentionpb.ResolvedTxnID, 0, len(req.TxnIDs)),
426420
}
@@ -434,12 +428,10 @@ func (b *baseStatusServer) localTxnIDResolution(
434428
}
435429
}
436430

437-
// If we encounter any transaction ID that we cannot resolve, we tell the
438-
// txnID cache to drain its write buffer (note: The .DrainWriteBuffer() call
439-
// is asynchronous). The client of this RPC will perform retries.
440-
if len(unresolvedTxnIDs) > 0 {
441-
txnIDCache.DrainWriteBuffer()
442-
}
431+
// Note(alyshan): TxnIDResolution is only called by the contention event resolver today.
432+
// The resolver relies on these resolution calls to trigger a drain of the writer buffer
433+
// on the txn id cache.
434+
txnIDCache.DrainWriteBuffer()
443435

444436
return resp
445437
}

pkg/sql/contention/resolver.go

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -73,7 +73,7 @@ const (
7373
// in-memory data corruption (shouldn't happen in normal circumstances,
7474
// since access to txnID cache is all synchronized). In this case, no
7575
// amount of retries will be able to resolveLocked the txnID.
76-
retryBudgetForMissingResult = uint32(1)
76+
retryBudgetForMissingResult = uint32(2)
7777

7878
// retryBudgetForRPCFailure is the number of times the resolverQueue will
7979
// retry resolving until giving up. This needs to be a finite number to handle

0 commit comments

Comments
 (0)