-
Notifications
You must be signed in to change notification settings - Fork 2
Description
Expected behavior
As an operator
In order to avoid crash loop that go unnoticed and mask error root cause such as https://github.com/orange-cloudfoundry/paas-templates/issues/2398
I need k3s-wrapper-boshrelease to back off when entering a crash loop
Observed behavior
tail -f -n 200 /var/vcap/monit/monit.log
#> UTC Aug 1 10:41:31] info : 'k3s-server' start: /var/vcap/jobs/k3s-server/bin/ctl
#> [UTC Aug 1 10:41:41] info : 'k3s-server' process is running with pid 366216
#> [UTC Aug 1 10:42:41] error : 'k3s-server' process is not running
#> [UTC Aug 1 10:42:41] info : 'k3s-server' trying to restart
#> [UTC Aug 1 10:42:41] info : 'k3s-server' start: /var/vcap/jobs/k3s-server/bin/ctl
#> [UTC Aug 1 10:42:52] info : 'k3s-server' process is running with pid 366278
#> [UTC Aug 1 10:43:42] error : 'k3s-server' process is not running
#> [UTC Aug 1 10:43:42] info : 'k3s-server' trying to restart
#> [UTC Aug 1 10:43:42] info : 'k3s-server' start: /var/vcap/jobs/k3s-server/bin/ctl
#> [UTC Aug 1 10:43:52] info : 'k3s-server' process is running with pid 366344
#> [UTC Aug 1 10:44:12] error : 'k3s-server' process is not running
#> [UTC Aug 1 10:44:12] info : 'k3s-server' trying to restartPossible fix
Use monit support for slow process start
https://web.archive.org/web/20110816041503/https://mmonit.com/monit/documentation/monit.html
if 2 restarts within 3 cycles then timeout
SERVICE TIMEOUT
monit provides a service timeout mechanism for situations where a service simply refuses to start or respond over a longer period.
The timeout mechanism is based on number if service restarts and number of poll-cycles. For example, if a service had x restarts within y poll-cycles (where x <= y) then Monit will perform an action (for example unmonitor the service). If a timeout occurs Monit will send an alert message if you have register interest for this event.
The syntax for the timeout statement is as follows (keywords are in capital):
IF RESTART CYCLE(S) THEN
Here is an example where Monit will unmonitor the service if it was restarted 2 times within 3 cycles:
if 2 restarts within 3 cycles then unmonitor
To have Monit check the service again after a monitoring was disabled, run 'monit monitor ' from the command line.
Example for setting custom exec on timeout:
if 5 restarts within 5 cycles then exec "/foo/bar"
Example for stopping the service:
if 7 restarts within 10 cycles then stop
See inspiration in monit usage from cloudfoundry https://github.com/search?q=org%3Acloudfoundry+if+restart+cycles+within+then+path%3A%2F%28%5E%7C%5C%2F%29monit%24%2F&type=code
https://github.com/cloudfoundry/healthchecker-release
This repository is a BOSH release for healthchecker that is a go executable designed to perform TCP/HTTP based health checks of processes managed by monit in BOSH releases. Since the version of monit included in BOSH does not support specific tcp/http health checks, we designed this utility to perform health checking and restart processes if they become unreachable.