Skip to content

Generalize retry code#380

Merged
ggreer merged 4 commits intomainfrom
ggreer/retry
May 12, 2025
Merged

Generalize retry code#380
ggreer merged 4 commits intomainfrom
ggreer/retry

Conversation

@ggreer
Copy link
Contributor

@ggreer ggreer commented May 12, 2025

Syncing, granting, and revoking all have retry logic, and they're all pretty similar. This unifies them into a single retryer package.

Summary by CodeRabbit

  • New Features

    • Introduced a standardized retry mechanism with configurable backoff for improved reliability during operations that may encounter transient errors.
  • Refactor

    • Unified and simplified retry logic across grant, revoke, and sync operations, reducing code duplication and enhancing maintainability.
  • Tests

    • Added tests to validate the new retry behavior and ensure correct handling of retryable and non-retryable errors.

@coderabbitai
Copy link

coderabbitai bot commented May 12, 2025

Walkthrough

The changes introduce a new retry package to centralize and standardize retry logic with configurable backoff. Existing custom retry mechanisms in connector builder and syncer implementations are refactored to use this new package, removing duplicated logic and internal state management. Associated tests validate the new retry behavior.

Changes

File(s) Change Summary
pkg/connectorbuilder/connectorbuilder.go Refactored Grant and Revoke methods to use the new retry.Retryer for retry logic; removed custom retry helper and related imports.
pkg/retry/retry.go Introduced new retry package with Retryer, RetryConfig, and standardized retry/backoff logic for gRPC operations.
pkg/retry/retry_test.go Added unit test TestBasicRetry to validate retry behavior and timing in the new retry package.
pkg/sync/syncer.go Replaced internal retry logic and state with the new retry.Retryer; removed custom retry method and related fields/imports.

Sequence Diagram(s)

sequenceDiagram
    participant Caller
    participant Retryer
    participant Provisioner

    Caller->>Retryer: ShouldWaitAndRetry(ctx, err)
    alt Error is retryable and attempts remain
        Retryer->>Caller: Wait and return true
        loop While ShouldWaitAndRetry returns true
            Caller->>Provisioner: Grant/Revoke/Sync operation
            Provisioner-->>Caller: Response or Error
            Caller->>Retryer: ShouldWaitAndRetry(ctx, err)
        end
    else Error is not retryable or attempts exhausted
        Retryer->>Caller: Return false
    end
Loading

Possibly related PRs

Suggested reviewers

  • jirwin
  • laurenleach

Poem

In the warren of code, retries did sprawl,
But now a new package brings order to all.
No more math or custom loops to maintain,
Just one Retryer to handle the pain.
With each hop and delay, we gracefully try,
Until success or carrots run dry! 🥕

Tip

⚡️ Faster reviews with caching
  • CodeRabbit now supports caching for code and dependencies, helping speed up reviews. This means quicker feedback, reduced wait times, and a smoother review experience overall. Cached data is encrypted and stored securely. This feature will be automatically enabled for all accounts on May 16th. To opt out, configure Review - Disable Cache at either the organization or repository level. If you prefer to disable all data retention across your organization, simply turn off the Data Retention setting under your Organization Settings.

Enjoy the performance boost—your workflow just got faster.


📜 Recent review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between f268889 and 535ed56.

📒 Files selected for processing (1)
  • pkg/connectorbuilder/connectorbuilder.go (3 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • pkg/connectorbuilder/connectorbuilder.go
⏰ Context from checks skipped due to timeout of 90000ms (1)
  • GitHub Check: go-test (1.23.x, windows-latest)
✨ Finishing Touches
  • 📝 Generate Docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Explain this complex logic.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai explain this code block.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and explain its main purpose.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR.
  • @coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Nitpick comments (4)
pkg/retry/retry.go (2)

67-74: Linear back-off without jitter can create thundering-herd issues and overflows.

  1. A fleet of workers will all retry at the same deterministic interval, amplifying load.
  2. time.Duration(int64(r.attempts))*r.initialDelay can overflow before you reach math.MaxInt64 nanoseconds on 32-bit platforms.

Consider exponential back-off with ±15 % jitter and a saturating multiply to avoid overflow.

Also applies to: 99-101


105-112: Unnecessary for { select { … }} loop – single wait is enough.

The loop executes at most once because both branches return. Removing the loop simplifies reasoning and avoids future bugs if new cases are added.

-	for {
-		select {
-		case <-time.After(wait):
-			return true
-		case <-ctx.Done():
-			return false
-		}
-	}
+	select {
+	case <-time.After(wait):
+		return true
+	case <-ctx.Done():
+		return false
+	}
pkg/retry/retry_test.go (1)

38-44: Tests rely on real wall-clock sleeps → slow & flaky.

ShouldWaitAndRetry blocks for ~100 ms and the assertion assumes <300 ms.
On contended CI agents this can overshoot and fail spuriously.

Use a tiny InitialDelay (e.g. 1 ms) or inject a clock interface so the test can fast-forward time deterministically.

pkg/sync/syncer.go (1)

365-370: Retryer is hard-coded; make it configurable for flexibility and testing.

Creating a retryer with unlimited attempts & fixed delays inside Sync:

retryer := retry.NewRetryer(ctx, retry.RetryConfig{ … })

ties production and unit tests to these constants. Consider:

  • Accepting a RetryConfig or retry.Retryer via WithRetryConfig/WithRetryer option.
  • Falling back to the current defaults when the option is not provided.

This allows tighter retry budgets in tests and different back-off strategies for callers.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 3055c58 and f2e1816.

📒 Files selected for processing (4)
  • pkg/connectorbuilder/connectorbuilder.go (3 hunks)
  • pkg/retry/retry.go (1 hunks)
  • pkg/retry/retry_test.go (1 hunks)
  • pkg/sync/syncer.go (7 hunks)
🧰 Additional context used
🧬 Code Graph Analysis (1)
pkg/sync/syncer.go (2)
pkg/retry/retry.go (2)
  • NewRetryer (31-45)
  • RetryConfig (25-29)
pkg/sync/state.go (3)
  • SyncResourcesOp (110-110)
  • SyncExternalResourcesOp (114-114)
  • SyncAssetsOp (115-115)
⏰ Context from checks skipped due to timeout of 90000ms (1)
  • GitHub Check: go-test (1.23.x, windows-latest)
🔇 Additional comments (5)
pkg/connectorbuilder/connectorbuilder.go (5)

25-25: Clean import addition.

The code correctly adds the new retry package import needed for the refactored retry functionality.


918-943: Well-structured refactoring of the Grant method.

This refactoring effectively abstracts provisioner behavior for both v1 and v2 implementations using a common function type while centralizing retry logic. The creation of grantFunc to handle the differences between provisioner versions is especially clean.

The retry configuration parameters (3 max attempts, 30s initial delay, 120s max delay) appear reasonable for this use case.


948-960: Clean retry loop implementation.

The retry loop has been simplified to use the centralized retry mechanism. The code now:

  1. Attempts the operation
  2. Returns on success
  3. Uses the retryer's ShouldWaitAndRetry method to determine whether to retry
  4. Properly handles and reports errors

This is more maintainable than the previous custom retry implementation.


972-990: Good reuse of retry pattern in Revoke method.

This refactoring applies the same pattern used in the Grant method. The code sets up a consistent retryer configuration and abstracts over different provisioner versions, promoting code consistency between the two methods.


996-1008: Consistent error handling in Revoke method.

The retry implementation follows the same pattern as in the Grant method, ensuring consistency across the codebase. The error handling, logging, and metric recording are all properly maintained during the refactoring.

Comment on lines +75 to +96
// If error contains rate limit data, use that instead
if st, ok := status.FromError(err); ok {
details := st.Details()
for _, detail := range details {
if rlData, ok := detail.(*v2.RateLimitDescription); ok {
waitResetAt := time.Until(rlData.ResetAt.AsTime())
if waitResetAt <= 0 {
continue
}
duration := time.Duration(rlData.Limit)
if duration <= 0 {
continue
}
waitResetAt /= duration
// Round up to the nearest second to make sure we don't hit the rate limit again
waitResetAt = time.Duration(math.Ceil(waitResetAt.Seconds())) * time.Second
if waitResetAt > 0 {
wait = waitResetAt
break
}
}
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Rate-limit wait calculation appears incorrect.

waitResetAt /= duration divides a duration by the numeric limit, yielding an arbitrary nanosecond value.
Usually you want time.Until(reset) + smallBuffer or time.Until(reset)/remaining, not “duration ÷ limit”.

Please revisit the algorithm; otherwise the retryer may sleep far too short or far too long.

Comment on lines +55 to +57
if status.Code(err) != codes.Unavailable && status.Code(err) != codes.DeadlineExceeded {
return false
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Verification agent

🧩 Analysis chain

Consider handling additional recoverable gRPC codes (e.g. ResourceExhausted, Internal).

Only Unavailable and DeadlineExceeded are currently retried. Connectors often surface transient throttling under ResourceExhausted and some load-balancers use Internal for “connection reset” conditions. You may unintentionally fail otherwise transient operations.

-	if status.Code(err) != codes.Unavailable && status.Code(err) != codes.DeadlineExceeded {
+	switch status.Code(err) {
+	case codes.Unavailable, codes.DeadlineExceeded, codes.ResourceExhausted, codes.Internal:
+		// retryable
+	default:
 		return false
-	}
+	}

🏁 Script executed:

#!/bin/bash
# Locate retry.go and display the relevant retry logic context
file=$(fd --exact-name retry.go)
echo "Found file: $file"
sed -n '40,80p' "$file"

Length of output: 345


🏁 Script executed:

#!/bin/bash
# Locate retry.go using find
file=$(find . -type f -name "retry.go" | head -n1)
if [ -z "$file" ]; then
  echo "retry.go not found"
  exit 1
fi
echo "Found file: $file"
# Show lines around the retry condition (lines 40–80)
sed -n '40,80p' "$file"

Length of output: 1269


Enhance retry logic to include additional transient gRPC codes

Currently only Unavailable and DeadlineExceeded are retried. In practice you’ll also see transient throttling surfaced as ResourceExhausted and connection‐reset errors as Internal. Without including these, short‐lived failures will fall through and not be retried.

Suggested change in pkg/retry/retry.go around line 55:

-   if status.Code(err) != codes.Unavailable && status.Code(err) != codes.DeadlineExceeded {
-       return false
-   }
+   switch status.Code(err) {
+   case codes.Unavailable,
+        codes.DeadlineExceeded,
+        codes.ResourceExhausted,
+        codes.Internal:
+       // retryable
+   default:
+       return false
+   }

This ensures you’ll back off and retry on throttling (ResourceExhausted) and connection resets (Internal) as well.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (3)
pkg/retry/retry.go (3)

59-65: Off-by-one semantics make MaxAttempts act as "max retries − 1".

r.attempts++ happens before the check, so a MaxAttempts of 3 allows only 2 retries.
If the intent is "retry at most N times", move the increment after the guard (or change the comparison to >).


55-57: Enhance retry logic to include additional transient gRPC codes

Currently only Unavailable and DeadlineExceeded are retried. In practice you'll also see transient throttling surfaced as ResourceExhausted and connection‐reset errors as Internal. Without including these, short‐lived failures will fall through and not be retried.


75-96: Rate-limit wait calculation appears incorrect.

waitResetAt /= duration divides a duration by the numeric limit, yielding an arbitrary nanosecond value.
Usually you want time.Until(reset) + smallBuffer or time.Until(reset)/remaining, not "duration ÷ limit".

Please revisit the algorithm; otherwise the retryer may sleep far too short or far too long.

🧹 Nitpick comments (3)
pkg/retry/retry.go (3)

25-29: Documentation inconsistency with implementation for MaxDelay.

The comment states "0 means no limit" for MaxDelay, but the implementation at line 42 sets a default of 60 seconds if it's 0. Either update the comment or change the implementation to match the documented behavior.

-	MaxDelay     time.Duration // Default is 60 seconds. 0 means no limit.
+	MaxDelay     time.Duration // Default is 60 seconds. 0 will be set to default.

31-45: Context parameter is unused in NewRetryer.

The function accepts a context parameter but doesn't use it anywhere within the function. Consider removing it to avoid confusion.

-func NewRetryer(ctx context.Context, config RetryConfig) *Retryer {
+func NewRetryer(config RetryConfig) *Retryer {

67-73: Handling overflow check is unnecessary.

The check for r.attempts > math.MaxInt64 is incredibly unlikely to be true in practice, as it would require billions of years of constant retrying at reasonable intervals. Consider simplifying this.

-	if r.attempts > math.MaxInt64 {
-		wait = r.maxDelay
-	} else {
-		wait = time.Duration(int64(r.attempts)) * r.initialDelay
-	}
+	wait = time.Duration(int64(r.attempts)) * r.initialDelay
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between f2e1816 and f268889.

📒 Files selected for processing (2)
  • pkg/retry/retry.go (1 hunks)
  • pkg/retry/retry_test.go (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • pkg/retry/retry_test.go
🔇 Additional comments (1)
pkg/retry/retry.go (1)

103-112: LGTM! Good implementation of cancellable delay.

The wait loop pattern correctly handles both waiting for the specified duration and respecting context cancellation, which is important for clean shutdown behavior.

annos, err := provisioner.Grant(ctx, request.Principal, request.Entitlement)
if err != nil {
l.Error("error: grant failed", zap.Error(err))
if !b.shouldWaitAndRetry(ctx, err, baseDelay) || attempt >= 2 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this change I think increased the attempt count fwiw

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the old code tried twice before erroring. I changed it to three because that seems like a good number of attempts before giving up.


// use linear backoff by default
var wait time.Duration
if r.attempts > math.MaxInt64 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's a lot of attempts

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was to get the linter to shut up.

provisionerV2, v2ok := b.resourceProvisionersV2[rt]
if !v1ok && !v2ok {
retryer := retry.NewRetryer(ctx, retry.RetryConfig{
MaxAttempts: 3,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MaxAttempts should be 2.
The Temporal activity has a timeout of 5 minutes, so the time from start to close shouldn't exceed that limit. c1 ticket would end up in a weird state if we retry 3 times.

For example,
image

Customers are not aware of the current status of the ticket.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Three retries would wait 30, 60, then 90 seconds, which is 3 minutes total.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Three minutes seems to be the ideal timing. I haven’t looked into the code to see how C1 tickets change status, but based on my testing, three attempts don’t work — the C1 ticket seems to be stuck.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

image I am not sure if it's because of dev env. I retry twice, which is 30 + 60 seconds, but the c1 ticket still takes around 4 minutes to respond. so I don't think it would work if the last retry takes 90 seconds.

@ggreer ggreer merged commit f99253c into main May 12, 2025
5 checks passed
@ggreer ggreer deleted the ggreer/retry branch May 12, 2025 23:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants