fix: Wait for MinIO Server before creating bucket and catch errors #12

phoevos · 2025-02-27T12:26:51Z

Wait for the MinIO Server to be healthy before attempting to create a bucket as part of the model-bucket-init service. This is necessary because the MinIO Server may not be ready to accept requests when the service starts (which is the only thing the default depends_on condition guarantees), resulting in errors like the following:

Unable to initialize new alias from the provided credentials. Server not initialized, please try again.

The above can be seen in failed integration tests on the Gateway side, while it looks like the change introduced here resolves the issue (subsequent CI runs using this commit to set up CMS succeed).

On top of that, chain the commands involved in creating a MinIO bucket to correctly catch errors and trigger service restarts. Previously, if the first command (i.e. adding a MinIO host) failed, the second one would still run, overwriting the error code and preventing the service from restarting. Using the && operator to chain the commands ensures
that the second command will only run if the first one is successful and that the error code is correctly propagated, allowing Docker to retry in the event of a failure. Note that the MinIO client commands can fail due to a variety of reasons, such as network issues (some of which might be transient), the MinIO Server not being ready, or incorrect credentials.

Wait for the MinIO Server to be healthy before attempting to create a bucket as part of the 'model-bucket-init' service. This is necessary because the MinIO Server may not be ready to accept requests when the service starts, resulting in errors like the following: ``` Unable to initialize new alias from the provided credentials. Server not initialized, please try again. ``` Signed-off-by: Phoevos Kalemkeris <[email protected]>

baixiac

Good to have minio liveness as a dependency. As discussed, please add the fix to the init command which currently swallows the error code and suppresses the container restarting. Even if the added liveness check passes, the command may still fail silently to intermittent issues such as network instability or misconfiguration on minio. Exposing the error code ensures the retries on bucket creation until it succeeds.

Chain the commands involved in creating a MinIO bucket run as part of the 'model-bucket-init' service in the 'docker-compose-mlflow.yml' file to correctly catch errors and trigger service restarts. Previously, if the first command (i.e. adding a MinIO host) failed, the second one would still run, overwriting the error code and preventing the service from restarting. Using the '&&' operator to chain the commands ensures that the second command will only run if the first one is successful and that the error code is correctly propagated, allowing Docker to retry in the event of a failure. Note that the MinIO client commands can fail due to a variety of reasons, such as network issues (some of which might be transient), the MinIO Server not being ready, or incorrect credentials. Signed-off-by: Phoevos Kalemkeris <[email protected]>

phoevos · 2025-03-04T14:00:30Z

@baixiac, I've added a commit to chain the commands and properly propagate errors. Note that the 2 fixes address different kinds of issues. The first one, waiting until the MinIO service is ready, is something that should be there when services depend on each other's functionality, regardless of the specific logic running in the dependent. The second one is specific to the bash script currently running there and even though it happens to fix the issue that triggered this discussion in the first place, it alters the service's behaviour, especially when it comes to non-transient failures. For instance, if incorrect credentials are provided, we wouldn't really want to keep retrying but rather fail gracefully. This is the reason why I suggested that it should be part of a different discussion and PR, rather than squashed together with a shallow fix to a broader issue. That said, I'm eager to see this merged, only adding the context for future reference. Perhaps we could add a simple change like the following to arbitrarily limit the number of restarts or a restart policy if we also want to add a delay between attempts.

restart: on-failure:10

baixiac

For instance, if incorrect credentials are provided, we wouldn't really want to keep retrying but rather fail gracefully.

That means more logic needs to be added to handle different failing cases. Besides, the user may or may not notice if the container/job ran once and exited before carrying on inspecting its logs.

The fix looks good to me, and you have carte blanche to add retry with max attempts and backoff jitter.

phoevos added the bug Something isn't working label Feb 27, 2025

phoevos requested a review from baixiac February 27, 2025 12:26

baixiac reviewed Mar 3, 2025

View reviewed changes

phoevos changed the title ~~fix: Wait for MinIO Server to be healthy before creating bucket~~ fix: Wait for MinIO Server before creating bucket and catch errors Mar 4, 2025

baixiac approved these changes Mar 4, 2025

View reviewed changes

phoevos merged commit 1372b87 into master Mar 4, 2025
7 checks passed

phoevos deleted the fix-phoevos-model-bucket-init-wait branch March 4, 2025 16:52

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: Wait for MinIO Server before creating bucket and catch errors #12

fix: Wait for MinIO Server before creating bucket and catch errors #12

Uh oh!

phoevos commented Feb 27, 2025 •

edited

Loading

Uh oh!

baixiac left a comment

Uh oh!

phoevos commented Mar 4, 2025 •

edited

Loading

Uh oh!

baixiac left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

fix: Wait for MinIO Server before creating bucket and catch errors #12

fix: Wait for MinIO Server before creating bucket and catch errors #12

Uh oh!

Conversation

phoevos commented Feb 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

baixiac left a comment

Choose a reason for hiding this comment

Uh oh!

phoevos commented Mar 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

baixiac left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

phoevos commented Feb 27, 2025 •

edited

Loading

phoevos commented Mar 4, 2025 •

edited

Loading