Conversation
|
Note Gemini is unable to generate a summary for this pull request due to the file types involved not being currently supported. |
|
Assigning reviewers: R: @Abacn for label build. Note: If you would like to opt out of this review, comment Available commands:
The PR bot will only process comments in the main thread (not review comments). |
2bd7b78 to
f611cf8
Compare
f611cf8 to
0f85fec
Compare
| TOX_TESTENV_PASSENV: "DOCKER_*,TESTCONTAINERS_*,TC_*,BEAM_*,GRPC_*,OMP_*,OPENBLAS_*,PYTHONHASHSEED,PYTEST_*" | ||
| # Aggressive retry and timeout settings for flaky CI | ||
| PYTEST_ADDOPTS: "-v --tb=short --maxfail=5 --durations=30 --reruns=5 --reruns-delay=15 --timeout=600 --disable-warnings" | ||
| PYTEST_ADDOPTS: "-v --tb=short --maxfail=5 --durations=30 --reruns=5 --reruns-delay=15 --timeout=900 --disable-warnings" |
There was a problem hiding this comment.
Not sure further increasing timeout and/or rerun could help. We already see the test suite stuck for >5h e.g. https://github.com/apache/beam/actions/runs/23779768061
There was a problem hiding this comment.
Ahh I didn't see this run, then icreasing pytest timeoutcan mask hangs and extend stuck runs so i revert the timeout bump and keep this PR focused on grpc stability env vars only as a short term mitigation @Abacn
| TC_MAX_TRIES: "15" | ||
| TC_SLEEP_TIME: "5" | ||
| # Additional gRPC stability for flaky environment | ||
| GRPC_ARG_KEEPALIVE_TIME_MS: "60000" |
There was a problem hiding this comment.
Wondering if there is some racing condition for tests run in parllell cc: @tvalentyn. Nevertheless we can add these environment variables at the moment.
There was a problem hiding this comment.
if we can repro this stuckness locally, or ssh into GHA worker that is stuck, then we could spy on the python process with tools like pystack, and then examine stracktraces.
There was a problem hiding this comment.
that could give some clues.
There was a problem hiding this comment.
so should I run with SSH access on a stuck self-hosted runner and collect pystack traces there, or add temporary CI instrumentation (faulthandler + periodic pystack dumps as artifacts) so the next stuck run captures stack traces automatically? @tvalentyn
There was a problem hiding this comment.
so should I run with SSH access on a stuck self-hosted runner
is it an option? is it easy?
how often does this issue reproduce?
overall, i'd pick whatever option is more straightforward.
There was a problem hiding this comment.
from my side, SSH into a stuck self-hosted runner is possible probably only if we have runner access granted so it’s not always the easiest path to start with and locally I couldn’t reproduce yet due to WSL network limits (can’t reach plugins.gradle.org/pypi) so I dont have a solid repro rate from local runs so given that the most straightforward option is to add temporary CI instrumentation (faulthandler + periodic pystack dumps uploaded as artifacts) so every stuck run automatically gives us stack traces
There was a problem hiding this comment.
sure we can try; it might be possible to also run that instrumentation on a separate fork if you can repro reliably on a fork and for whatever reason we don't want to merge the change (e.g. it makes tests much slower or unstable)
Please add a meaningful description for your change here
Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:
addresses #123), if applicable. This will automatically add a link to the pull request in the issue. If you would like the issue to automatically close on merging the pull request, commentfixes #<ISSUE NUMBER>instead.CHANGES.mdwith noteworthy changes.See the Contributor Guide for more tips on how to make review process smoother.
To check the build health, please visit https://github.com/apache/beam/blob/master/.test-infra/BUILD_STATUS.md
GitHub Actions Tests Status (on master branch)
See CI.md for more information about GitHub Actions CI or the workflows README to see a list of phrases to trigger workflows.