-
Notifications
You must be signed in to change notification settings - Fork 6
OHOS: Robustify the runners #46
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
- Upgrade Dependencies - Have a new function `check_and_inc_retries` that will panic if the retries is too much without success. We will exit the process and crash. If we succeed we set the retries to zero. - Unify processor timeout throughout main loop. Some paths through the main loop had some timeouts while some had none. We now have a unified timeout at the end (before killing offline runners) of 5 seconds. - Read the env variable "RUNNER_SUFFIX" to add a suffix to the runner name, so we do not have to hardcode it. - Introduce a new function `call_github_runner_api` which encapsulates simple github api calls with gh. `spawn_runner` now uses this new functionality for spawning. Previously some - Introduce `kill_offline_runners` which calsl the github api for current runners, takes runners that are offline and have a name starting with `dresden-hos`. Then it removes these runners. This happens at the end and 5 seconds after we started the runners. The runners need some time to start and connect to github properly. The timeout seems ok tbut we can probably increase it. Sorry for the massive changes but I think the codebase is so small that it is allowed :D Signed-off-by: Narfinger <[email protected]>
jschwe
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you also have a look at the github runner run.sh and check if we can determine if the runner spawned successfully by wrapping run.sh?
|
|
||
| thread::sleep(Duration::from_secs(5)); | ||
| // Check if some still running images are listed as offline from github api point of view | ||
| if let Err(e) = kill_offline_runners(&servo_ci_scope) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we want to do this all the time, only when we detected a problem. github API calls are rate-limited, and the limit is not even too high.
Signed-off-by: Narfinger <[email protected]>
87b9cb4 to
2f00564
Compare
This involves a couple of changes:
check_and_inc_retriesthat will panic if theretries is too much without success. We will exit the process and
crash. If we succeed we set the retries to zero.
main loop had some timeouts while some had none. We now have a unified
timeout at the end (before killing offline runners) of 5 seconds.
name, so we do not have to hardcode it.
call_github_runner_apiwhich encapsulatessimple github api calls with gh.
spawn_runnernow uses this newfunctionality for spawning. Previously some
kill_offline_runnerswhich calsl the github api forcurrent runners, takes runners that are offline and have a name
starting with
dresden-hos. Then it removes these runners. This happensat the end and 5 seconds after we started the runners. The runners need
some time to start and connect to github properly. The timeout seems ok
but we can probably increase it.
Testing: Tested on runner CI2. It succesfully removed old offline runners on github and started new ones.
I plan to let CI2 run this new code for a bit to see if there are any regressions without changing CI1.
Hopefully this fixes: #44
Sorry for the massive changes but I think the codebase is so small that
it is allowed :D