Skip to content

Conversation

@phoevos
Copy link
Member

@phoevos phoevos commented Feb 27, 2025

Wait for the MinIO Server to be healthy before attempting to create a bucket as part of the model-bucket-init service. This is necessary because the MinIO Server may not be ready to accept requests when the service starts (which is the only thing the default depends_on condition guarantees), resulting in errors like the following:

Unable to initialize new alias from the provided credentials. Server not initialized, please try again.

The above can be seen in failed integration tests on the Gateway side, while it looks like the change introduced here resolves the issue (subsequent CI runs using this commit to set up CMS succeed).

On top of that, chain the commands involved in creating a MinIO bucket to correctly catch errors and trigger service restarts. Previously, if the first command (i.e. adding a MinIO host) failed, the second one would still run, overwriting the error code and preventing the service from restarting. Using the && operator to chain the commands ensures
that the second command will only run if the first one is successful and that the error code is correctly propagated, allowing Docker to retry in the event of a failure. Note that the MinIO client commands can fail due to a variety of reasons, such as network issues (some of which might be transient), the MinIO Server not being ready, or incorrect credentials.

Wait for the MinIO Server to be healthy before attempting to create a
bucket as part of the 'model-bucket-init' service. This is necessary
because the MinIO Server may not be ready to accept requests when the
service starts, resulting in errors like the following:

```
Unable to initialize new alias from the provided credentials. Server
not initialized, please try again.
```

Signed-off-by: Phoevos Kalemkeris <[email protected]>
@phoevos phoevos added the bug Something isn't working label Feb 27, 2025
@phoevos phoevos requested a review from baixiac February 27, 2025 12:26
Copy link
Member

@baixiac baixiac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good to have minio liveness as a dependency. As discussed, please add the fix to the init command which currently swallows the error code and suppresses the container restarting. Even if the added liveness check passes, the command may still fail silently to intermittent issues such as network instability or misconfiguration on minio. Exposing the error code ensures the retries on bucket creation until it succeeds.

Chain the commands involved in creating a MinIO bucket run as part of
the 'model-bucket-init' service in the 'docker-compose-mlflow.yml' file
to correctly catch errors and trigger service restarts. Previously, if
the first command (i.e. adding a MinIO host) failed, the second one
would still run, overwriting the error code and preventing the service
from restarting. Using the '&&' operator to chain the commands ensures
that the second command will only run if the first one is successful and
that the error code is correctly propagated, allowing Docker to retry in
the event of a failure. Note that the MinIO client commands can fail due
to a variety of reasons, such as network issues (some of which might be
transient), the MinIO Server not being ready, or incorrect credentials.

Signed-off-by: Phoevos Kalemkeris <[email protected]>
@phoevos
Copy link
Member Author

phoevos commented Mar 4, 2025

@baixiac, I've added a commit to chain the commands and properly propagate errors. Note that the 2 fixes address different kinds of issues. The first one, waiting until the MinIO service is ready, is something that should be there when services depend on each other's functionality, regardless of the specific logic running in the dependent. The second one is specific to the bash script currently running there and even though it happens to fix the issue that triggered this discussion in the first place, it alters the service's behaviour, especially when it comes to non-transient failures. For instance, if incorrect credentials are provided, we wouldn't really want to keep retrying but rather fail gracefully. This is the reason why I suggested that it should be part of a different discussion and PR, rather than squashed together with a shallow fix to a broader issue. That said, I'm eager to see this merged, only adding the context for future reference. Perhaps we could add a simple change like the following to arbitrarily limit the number of restarts or a restart policy if we also want to add a delay between attempts.

restart: on-failure:10

@phoevos phoevos changed the title fix: Wait for MinIO Server to be healthy before creating bucket fix: Wait for MinIO Server before creating bucket and catch errors Mar 4, 2025
Copy link
Member

@baixiac baixiac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For instance, if incorrect credentials are provided, we wouldn't really want to keep retrying but rather fail gracefully.

That means more logic needs to be added to handle different failing cases. Besides, the user may or may not notice if the container/job ran once and exited before carrying on inspecting its logs.

The fix looks good to me, and you have carte blanche to add retry with max attempts and backoff jitter.

@phoevos phoevos merged commit 1372b87 into master Mar 4, 2025
7 checks passed
@phoevos phoevos deleted the fix-phoevos-model-bucket-init-wait branch March 4, 2025 16:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants