-
Notifications
You must be signed in to change notification settings - Fork 76
Flexible retry backoff policies #182
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
3034fdf to
d7039d1
Compare
|
We haven’t implemented this because cockroachdb-go uses save point based retries and because of savepoint-based retries, a transaction can continue holding locks in between retries. So that means with exponential backoff, you'll be holding locks for longer, thus putting the app at even greater risk of contention. If you could show us that this is not the case with backoffs, we will consider adding this! Thanks! |
I just pushed 9dda97b. Could you take a look at that? I'm not sure what all the implications are of separating the database transaction lifecycle from the |
| } | ||
|
|
||
| // Vargo converts a go-retry style Delay provider into a RetryPolicy | ||
| func Vargo(fn func() VargoBackoff) RetryPolicy { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm wondering if we could think of a different name to go with here. Users of cockroach-go who don't already know about the sethvargo/go-retry library may be confused to see this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm open to suggestions. I was trying to build an explicit integration without adding a transitive dependency.
| if restartErr := tx.Exec(ctx, "ROLLBACK"); restartErr != nil { | ||
| return newTxnRestartError(restartErr, err) | ||
| } | ||
| if restartErr := tx.Exec(ctx, "BEGIN"); restartErr != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems like it could have issues. Some libraries may not like it if we use the transaction after rolling it back. For example, I see this in the pgx code where it returns an error if you try to use a transaction that was already rolled back: https://github.com/jackc/pgx/blob/ecc9203ef42fbba50507e773901b5aead75288ef/tx.go#L205-L224
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It returns an error if you use a transaction object that has the closed flag set. Since we're not doing that, it ought to be ok, I would expect.
Extends the existing retry limit mechanism to allow for delays between retries. Includes implementations for fixed delays and exponential backoff.
crdb/tx.go
Outdated
| // WithMaxRetries configures context so that ExecuteTx retries tx specified | ||
| // number of times when encountering retryable errors. | ||
| // Setting retries to 0 will retry indefinitely. | ||
| // Setting retries to 0 will not retry: the transaction will be tried only once. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should try to avoid breaking the API here. This changes the behavior of WithMaxRetries(ctx, 0) from "retry indefinitely" to "no retries," which would affect any existing code relying on this.
Maybe we could add a separate WithNoRetries() function, and change something about the internal representation so that the outward facing API remains the same.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree. Done.
| if restartErr := tx.Exec(ctx, "ROLLBACK"); restartErr != nil { | ||
| return newTxnRestartError(restartErr, err) | ||
| } | ||
| if restartErr := tx.Exec(ctx, "BEGIN"); restartErr != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This still doesn't seem valid to me -- if we reuse the same tx object here, it will not succeed when sending BEGIN right after a ROLLBACK. (see https://github.com/jackc/pgx/blob/ecc9203ef42fbba50507e773901b5aead75288ef/tx.go#L205-L224)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My understanding is that we not calling tx.Rollback, but rather tx.Exec(ctx, "ROLLBACK"). Please check the comments in the code, and let me know if that makes sense.
| return newTxnRestartError(rollbackErr, err) | ||
| // We have a retryable error. Check the retry policy. | ||
| delay, retryErr := retryFunc(err) | ||
| if delay > 0 && retryErr == nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't follow this. Why are we doing a retry if retryErr is nil?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added documentation on the RetryFunc function definition to clarify the semantics. Implementation of RetryFunc may return errors (typically MaxRetriesExceededError) to stop the retry process.
crdb/common.go
Outdated
| if delay > 0 && retryErr == nil { | ||
| // We don't want to hold locks while waiting for a backoff, so restart the entire transaction | ||
| if restartErr := tx.Exec(ctx, "ROLLBACK"); restartErr != nil { | ||
| return newTxnRestartError(restartErr, err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the wrong error to return here -- as per this line, when this error is displayed, it will say that there was an issue with ROLLBACK TO SAVEPOINT, which is not what this line is doing:
Line 70 in c787987
| const msgPattern = "restarting txn failed. ROLLBACK TO SAVEPOINT " + |
(This is also the wrong error to return on lines 98 and 101.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. Passing the operation to newTxnRestartError.
| } | ||
| } | ||
| } | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should also have tests for edge cases like zero BaseDelay, negative RetryLimit, and MaxDelay < BaseDelay.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
| if rollbackErr := tx.Exec(ctx, "ROLLBACK TO SAVEPOINT cockroach_restart"); rollbackErr != nil { | ||
| return newTxnRestartError(rollbackErr, err) | ||
| // We have a retryable error. Check the retry policy. | ||
| delay, retryErr := retryFunc(err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
before we enter the retry logic, we should check for context cancellation:
if err := ctx.Err(); err != nil {
return err
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
| NewRetry() RetryFunc | ||
| } | ||
|
|
||
| type LimitBackoffRetryPolicy struct { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is public-facing so should have a comment explaining what it is and how to use it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
crdb/retry.go
Outdated
| } | ||
|
|
||
| // ExpBackoffRetryPolicy implements RetryPolicy using an exponential backoff with optional | ||
| // saturation. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't follow what "saturation" means here. Can we write this comment to explain more how to use this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done, reworded comment.
| actualDelays := make([]time.Duration, 0, len(expectedDelays)) | ||
| rf := policy.NewRetry() | ||
| for { | ||
| delay, err := rf(nil) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We're always passing a nil error here -- can we add tests that check the behavior with a non-nil error?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
Document LimitBackoffRetryPolicy, ExpBackoffRetryPolicy, and Vargo adapter with detailed examples. Preserve backward compatibility by making WithMaxRetries(ctx, 0) mean unlimited retries (original behavior). Add WithNoRetries() for disabling retries and introduce sentinel constants. Enhance RetryFunc documentation to clarify return value semantics and add additional test cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the review. Please take another look (will squash the commits once we are done).
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @rafiss and @sean-)
| if rollbackErr := tx.Exec(ctx, "ROLLBACK TO SAVEPOINT cockroach_restart"); rollbackErr != nil { | ||
| return newTxnRestartError(rollbackErr, err) | ||
| // We have a retryable error. Check the retry policy. | ||
| delay, retryErr := retryFunc(err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
| return newTxnRestartError(rollbackErr, err) | ||
| // We have a retryable error. Check the retry policy. | ||
| delay, retryErr := retryFunc(err) | ||
| if delay > 0 && retryErr == nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added documentation on the RetryFunc function definition to clarify the semantics. Implementation of RetryFunc may return errors (typically MaxRetriesExceededError) to stop the retry process.
crdb/common.go
Outdated
| if delay > 0 && retryErr == nil { | ||
| // We don't want to hold locks while waiting for a backoff, so restart the entire transaction | ||
| if restartErr := tx.Exec(ctx, "ROLLBACK"); restartErr != nil { | ||
| return newTxnRestartError(restartErr, err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. Passing the operation to newTxnRestartError.
| if restartErr := tx.Exec(ctx, "ROLLBACK"); restartErr != nil { | ||
| return newTxnRestartError(restartErr, err) | ||
| } | ||
| if restartErr := tx.Exec(ctx, "BEGIN"); restartErr != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My understanding is that we not calling tx.Rollback, but rather tx.Exec(ctx, "ROLLBACK"). Please check the comments in the code, and let me know if that makes sense.
| NewRetry() RetryFunc | ||
| } | ||
|
|
||
| type LimitBackoffRetryPolicy struct { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
crdb/retry.go
Outdated
| } | ||
|
|
||
| // ExpBackoffRetryPolicy implements RetryPolicy using an exponential backoff with optional | ||
| // saturation. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done, reworded comment.
| actualDelays := make([]time.Duration, 0, len(expectedDelays)) | ||
| rf := policy.NewRetry() | ||
| for { | ||
| delay, err := rf(nil) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
| } | ||
| } | ||
| } | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
crdb/tx.go
Outdated
| // WithMaxRetries configures context so that ExecuteTx retries tx specified | ||
| // number of times when encountering retryable errors. | ||
| // Setting retries to 0 will retry indefinitely. | ||
| // Setting retries to 0 will not retry: the transaction will be tried only once. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree. Done.
Extends the existing retry limit mechanism to allow for delays between retries. Includes implementations for fixed delays and exponential backoff.