fix: upgrade Intel MPI to 2021.14 and fix CI race condition#757
fix: upgrade Intel MPI to 2021.14 and fix CI race condition#757Syed-Suhaan wants to merge 4 commits intokubeflow:masterfrom
Conversation
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
9bc2924 to
c50b4fa
Compare
tenzen-y
left a comment
There was a problem hiding this comment.
Thank you for working on this issue 👍
build/base/entrypoint.sh
Outdated
| # Add a delay to allow worker's sshd to be fully ready in slow CI environments | ||
| sleep 5 |
There was a problem hiding this comment.
Instead of this sleep, what about increasing the backoff to 1 or more?
There was a problem hiding this comment.
Replaced sleep with backoff=1 as suggested.
There was a problem hiding this comment.
Could you use the previous indent?
There was a problem hiding this comment.
Fixed the indentation to match the previous style.
There was a problem hiding this comment.
Fixed the indentation to match the previous style.
c50b4fa to
c62e651
Compare
|
I've applied the feedback: replaced sleep with backoff logic and fixed Dockerfile indentation. Verified locally |
tenzen-y
left a comment
There was a problem hiding this comment.
Let's see if the errors have gone away in CI.
build/base/intel-builder.Dockerfile
Outdated
| libstdc++-12-dev binutils procps clang \ | ||
| intel-oneapi-compiler-dpcpp-cpp \ | ||
| intel-oneapi-mpi-devel-2021.14 \ |
There was a problem hiding this comment.
It seems that you forgot to push the new commit to this PR?
| libstdc++-12-dev binutils procps clang \ | |
| intel-oneapi-compiler-dpcpp-cpp \ | |
| intel-oneapi-mpi-devel-2021.14 \ | |
| libstdc++-12-dev binutils procps clang \ | |
| intel-oneapi-compiler-dpcpp-cpp \ | |
| intel-oneapi-mpi-devel-2021.14 \ |
build/base/intel.Dockerfile
Outdated
| dnsutils \ | ||
| intel-oneapi-mpi-2021.14 \ |
There was a problem hiding this comment.
| dnsutils \ | |
| intel-oneapi-mpi-2021.14 \ | |
| dnsutils \ | |
| intel-oneapi-mpi-2021.14 \ |
3cdc928 to
f4b36f1
Compare
Upgrades Intel MPI version from 2021.13 to 2021.14. Adds a 5-second initialization delay in entrypoint.sh to prevent SSH handshake failures (kex_exchange_identification) observed in CI environments. Fixes Issue kubeflow#678. Signed-off-by: Syed-Suhaan <suhaansyed2004@gmail.com>
f4b36f1 to
fbe4d12
Compare
…etic, and fix Dockerfile indentation Signed-off-by: Syed-Suhaan <suhaansyed2004@gmail.com>
b67fb9b to
c05d2af
Compare
|
Fixed entrypoint.sh race condition by switching to native bash backoff (removing awk dependency), and corrected Dockerfile indentation. |
build/base/entrypoint.sh
Outdated
| max_retry=10 | ||
| counter=0 | ||
| backoff=0.1 | ||
| backoff=1 |
There was a problem hiding this comment.
| backoff=1 | |
| backoff=3 |
@Syed-Suhaan It seems 1 sec is not enough: https://github.com/kubeflow/mpi-operator/actions/runs/20128945073/job/57765870629?pr=757
Let's increase it to 3 sec.
There was a problem hiding this comment.
@Syed-Suhaan Uhm, 3 sec seems still not enough time. Did 3 sec work well in your local?
If not, we might need to increase it to 5 sec.
There was a problem hiding this comment.
Yes, 3s works locally, but my local environment resolves instantly so it doesn't reproduce the race condition. Since the CI logs show the environment is noticeably slower, I've updated it to 5s as suggested to ensure we clear that race condition reliably.
Signed-off-by: Syed-Suhaan <suhaansyed2004@gmail.com>
Signed-off-by: Syed-Suhaan <suhaansyed2004@gmail.com>
20d76f5 to
4e29773
Compare
Upgrades Intel MPI version from 2021.13 to 2021.14.
Adds a 5-second initialization delay in entrypoint.sh to prevent SSH handshake failures (kex_exchange_identification) observed in CI environments.
Fixes Issue #678.