Add configuration for consumer handling of retry on container creation#26698
Conversation
There was a problem hiding this comment.
Pull request overview
Adds a configurable cap on retries for retriable errors during container creation/attach flows, addressing scenarios where hosts have a bounded time window for creation.
Changes:
- Extend
runWithRetryto support an optionalmaxRetrieslimit and throw a non-retriable wrapped error when exceeded. - Plumb a new
maxCreateRetriesoption throughContainer.attach()(with config override viaFluid.Container.CreateMaxRetries) into the attach-timerunWithRetrycall. - Add unit tests for the new retry-limit behavior and update the legacy beta API report.
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
| packages/loader/driver-utils/src/test/runWithRetry.spec.ts | Adds coverage for bounded retries, success-before-limit, unlimited default, and 0 retries behavior. |
| packages/loader/driver-utils/src/runWithRetry.ts | Implements maxRetries support and telemetry when the retry cap is exceeded. |
| packages/loader/container-loader/src/container.ts | Adds maxCreateRetries attach option and wires config/attach override into attach-time retries. |
| packages/common/container-definitions/src/loader.ts | Updates IContainer.attach() typing/docs to expose maxCreateRetries. |
| packages/common/container-definitions/api-report/container-definitions.legacy.beta.api.md | Updates generated legacy beta API report to reflect the new attach option. |
| // Check if max retries limit has been reached | ||
| if (progress.maxRetries !== undefined && numRetries > progress.maxRetries) { | ||
| logger.sendTelemetryEvent( |
There was a problem hiding this comment.
progress.maxRetries is used without validation. If a caller passes NaN, Infinity, a negative value, or a non-integer, the comparison numRetries > progress.maxRetries can behave unexpectedly (e.g., NaN makes the limit never trigger, reintroducing infinite retries). Consider normalizing the option (e.g., require a finite, non-negative integer; otherwise treat as undefined or throw a non-retriable UsageError).
|
we built the ability to allow customers to retry attach if it should fail. did we consider moving retry for attach/create failures to the consumer when retriable attach is enabled: "Fluid.Container.RetryOnAttachFailure" with the eventually goal for enabling it by default, as the current behavior of just retrying forever isn't great for customers. Bascially, we can just set retries to 0 when Fluid.Container.RetryOnAttachFailure is on, and then the consumer owns retry, and whatever policy they want? what i like about this approach is it lets use move away from exposing complex policy in our apis, and lets consumers have control. |
shlevari
left a comment
There was a problem hiding this comment.
Looks good in general, just some light comments about the naming and exposure surface for controlling the retries.
| // When RetryOnAttachFailure is enabled, use no internal retries | ||
| // The consumer will own the retry policy | ||
| const retryOnAttachFailure = | ||
| this.mc.config.getBoolean("Fluid.Container.RetryOnAttachFailure") === true; |
There was a problem hiding this comment.
Is this flag name already locked in? To me its naming is unintuitive -- it sounds like it's telling the Fluid Container to do retries, but instead its saying the caller will do the retries.
There was a problem hiding this comment.
Also, does it make sense to give the caller control over how many retries to do? If they want to disable it, set it to zero, but that way it gives them the opportunity to optimize for their specific startup time requirements
There was a problem hiding this comment.
+1, MaxRetriesOnAttachFailure seems nice, otherwise DisableRetryOnAttachFailure maybe
There was a problem hiding this comment.
@anthony-murphy Thoughts on changing the name to something like Fluid.Container.DisableRetryOnAttachFailure?
Also, does it make sense to give the caller control over how many retries to do?
The important thing here is that the errors are bubbled up to the consumer of attach. I don't have any strong preference on how consumers can implement a retry mechanism, but retries are not necessary for this ask.
There was a problem hiding this comment.
the flag primarily allows attach to be retired by the consumer, as without it the container will close when attach fails
There was a problem hiding this comment.
the goal is to get out of the container level retry game here, and let the consumer decide what they want to do, retry calling, serialize the container and save it somewhere, or dispose the container throw it away
There was a problem hiding this comment.
I changed the flag name to DisableCloseOnAttachFailure to be extra clear about what it does
| // When RetryOnAttachFailure is enabled, use no internal retries | ||
| // The consumer will own the retry policy | ||
| const retryOnAttachFailure = | ||
| this.mc.config.getBoolean("Fluid.Container.RetryOnAttachFailure") === true; |
There was a problem hiding this comment.
+1, MaxRetriesOnAttachFailure seems nice, otherwise DisableRetryOnAttachFailure maybe
Currently, we continue to retry retriable errors indefinitely on container creation. However in case of certain consumers where there is limited time (30/60 seconds) for the create, we should not have infinite retries even on retriable errors.
If the consumer of
attachwants to handle the retry mechanism, they can enable the"Fluid.Container.DisableCloseOnAttachFailure"flag. This flag will cause no internal retries to occur and let any error surface to the consumer ofattach.AB#57593