-
Notifications
You must be signed in to change notification settings - Fork 4.5k
increase timeout via posargs #37989
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
increase timeout via posargs #37989
Changes from 1 commit
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -111,12 +111,14 @@ jobs: | |
| env: | ||
| TOX_TESTENV_PASSENV: "DOCKER_*,TESTCONTAINERS_*,TC_*,BEAM_*,GRPC_*,OMP_*,OPENBLAS_*,PYTHONHASHSEED,PYTEST_*" | ||
| # Aggressive retry and timeout settings for flaky CI | ||
| PYTEST_ADDOPTS: "-v --tb=short --maxfail=5 --durations=30 --reruns=5 --reruns-delay=15 --timeout=600 --disable-warnings" | ||
| PYTEST_ADDOPTS: "-v --tb=short --maxfail=5 --durations=30 --reruns=5 --reruns-delay=15 --timeout=900 --disable-warnings" | ||
| # Container stability - much more generous timeouts | ||
| TC_TIMEOUT: "300" | ||
| TC_MAX_TRIES: "15" | ||
| TC_SLEEP_TIME: "5" | ||
| # Additional gRPC stability for flaky environment | ||
| GRPC_ARG_KEEPALIVE_TIME_MS: "60000" | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Wondering if there is some racing condition for tests run in parllell cc: @tvalentyn. Nevertheless we can add these environment variables at the moment.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. if we can repro this stuckness locally, or ssh into GHA worker that is stuck, then we could spy on the python process with tools like
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. that could give some clues.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. so should I run with SSH access on a stuck self-hosted runner and collect pystack traces there, or add temporary CI instrumentation (faulthandler + periodic pystack dumps as artifacts) so the next stuck run captures stack traces automatically? @tvalentyn
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
is it an option? is it easy? overall, i'd pick whatever option is more straightforward.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. from my side, SSH into a stuck self-hosted runner is possible probably only if we have runner access granted so it’s not always the easiest path to start with and locally I couldn’t reproduce yet due to WSL network limits (can’t reach plugins.gradle.org/pypi) so I dont have a solid repro rate from local runs so given that the most straightforward option is to add temporary CI instrumentation (faulthandler + periodic pystack dumps uploaded as artifacts) so every stuck run automatically gives us stack traces
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. sure we can try; it might be possible to also run that instrumentation on a separate fork if you can repro reliably on a fork and for whatever reason we don't want to merge the change (e.g. it makes tests much slower or unstable) |
||
| GRPC_ARG_KEEPALIVE_TIMEOUT_MS: "60000" | ||
| GRPC_ARG_MAX_CONNECTION_IDLE_MS: "60000" | ||
| GRPC_ARG_HTTP2_BDP_PROBE: "1" | ||
| GRPC_ARG_SO_REUSEPORT: "1" | ||
|
|
@@ -125,6 +127,7 @@ jobs: | |
| # Additional gRPC settings | ||
| GRPC_ARG_MAX_RECONNECT_BACKOFF_MS: "120000" | ||
| GRPC_ARG_INITIAL_RECONNECT_BACKOFF_MS: "2000" | ||
| BEAM_RUNNER_BUNDLE_TIMEOUT_MS: "600000" | ||
| uses: ./.github/actions/gradle-command-self-hosted-action | ||
| with: | ||
| gradle-command: :sdks:python:test-suites:tox:py${{steps.set_py_ver_clean.outputs.py_ver_clean}}:preCommitPy${{steps.set_py_ver_clean.outputs.py_ver_clean}} | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure further increasing timeout and/or rerun could help. We already see the test suite stuck for >5h e.g. https://github.com/apache/beam/actions/runs/23779768061
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ahh I didn't see this run, then icreasing pytest timeoutcan mask hangs and extend stuck runs so i revert the timeout bump and keep this PR focused on grpc stability env vars only as a short term mitigation @Abacn
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SGTM