Skip to content

Conversation

@KodaiD
Copy link
Contributor

@KodaiD KodaiD commented Jun 16, 2025

Description

This PR resolves a ResourceNotFoundException encountered during upsert operations on DynamoAdmin metadata tables.

The issue stems from attempts to insert data immediately following table creation, where the table may not be fully ready even if the DescribeTable response is TableStatus.ACTIVE. AWS states that there are cases where table status propagation can take time.

This PR adds a retry mechanism to wait until the table becomes available.

Related issues and/or PRs

N/A

Changes made

  • Introduced a retry mechanism in the following methods to handle ResourceNotFoundException during upsert operations:
    • upsertTableMetadata
    • upsertIntoNamespacesTable

Checklist

  • I have commented my code, particularly in hard-to-understand areas.
  • I have updated the documentation to reflect the changes.
  • I have considered whether similar issues could occur in other products, components, or modules if this PR is for bug fixes.
  • Any remaining open issues linked to this PR are documented and up-to-date (Jira, GitHub, etc.).
  • Tests (unit, integration, etc.) have been added for the changes.
  • My changes generate no new warnings.
  • Any dependent changes in other PRs have been merged and published.

Additional notes (optional)

N/A

Release notes

N/A

@KodaiD KodaiD self-assigned this Jun 16, 2025
@KodaiD KodaiD added the bugfix label Jun 16, 2025
@KodaiD KodaiD requested a review from Copilot June 16, 2025 07:20
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR adds retry logic to handle ResourceNotFoundException during metadata upsert operations when a newly created DynamoDB table isn’t immediately available.

  • Introduces DEFAULT_MAX_RETRY_COUNT constant for retry limits.
  • Wraps putItem calls in upsertIntoNamespacesTable and upsertTableMetadata with retry loops.
Comments suppressed due to low confidence (2)

core/src/main/java/com/scalar/db/storage/dynamo/DynamoAdmin.java:242

  • New retry behavior for ResourceNotFoundException isn’t covered by existing tests. Add unit or integration tests that simulate table-creation delays to verify the retry and timeout logic.
int retryCount = 0;

core/src/main/java/com/scalar/db/storage/dynamo/DynamoAdmin.java:253

  • [nitpick] The error message for retry exhaustion is identical to the general Exception catch. Consider differentiating the message to clarify that the retries were exhausted versus other failures.
throw new ExecutionException("Inserting the " + namespace + " namespace into the namespaces table failed", e);

} catch (Exception e) {
throw new ExecutionException(
"Inserting the " + namespace + " namespace into the namespaces table failed", e);
int retryCount = 0;
Copy link

Copilot AI Jun 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] The retry logic in upsertIntoNamespacesTable and upsertTableMetadata is duplicated. Consider extracting it into a shared helper method to reduce code duplication and improve maintainability.

Copilot uses AI. Check for mistakes.
throw new ExecutionException(
"Inserting the " + namespace + " namespace into the namespaces table failed", e);
}
Uninterruptibles.sleepUninterruptibly(waitingDurationSecs, TimeUnit.SECONDS);
Copy link

Copilot AI Jun 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] Using a fixed delay may lead to unnecessary wait times or throttling. Consider implementing exponential backoff (with jitter) to more efficiently handle table readiness.

Suggested change
Uninterruptibles.sleepUninterruptibly(waitingDurationSecs, TimeUnit.SECONDS);
long backoffDelay = calculateExponentialBackoffWithJitter(retryCount);
Uninterruptibles.sleepUninterruptibly(backoffDelay, TimeUnit.MILLISECONDS);

Copilot uses AI. Check for mistakes.
Copy link
Contributor

@Torch3333 Torch3333 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thank you!

@komamitsu
Copy link
Contributor

komamitsu commented Jun 17, 2025

@KodaiD I think it would be great if this PR includes new unit tests for the change.

Also, this is just an idea, but using https://github.com/failsafe-lib/failsafe or https://github.com/resilience4j/resilience4j (a bit heavy?) might be helpful for this kind of retry. This is not a requirement but just FYI.

@KodaiD
Copy link
Contributor Author

KodaiD commented Jun 17, 2025

@KodaiD I think it would be great if this PR includes new unit tests for the change.

@komamitsu Thank you for your review! I added unit test in 67c5e94. PTAL!

Also, this is just an idea, but using https://github.com/failsafe-lib/failsafe or https://github.com/resilience4j/resilience4j (a bit heavy?) might be helpful for this kind of retry. This is not a requirement but just FYI.

Thank you for your comment on this as well! DynamoAdmin contains multiple existing retry logics in addition to the one I added, so I simply made changes to align with the style of those existing ones.

Copy link
Contributor

@komamitsu komamitsu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thank you!

Copy link
Collaborator

@brfrn169 brfrn169 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a comment. Other than that, LGTM!

Copy link
Contributor

@feeblefakie feeblefakie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants