Skip to content

Add crashloop back off for k3s-server release #63

@gberche-orange

Description

@gberche-orange

Expected behavior

As an operator
In order to avoid crash loop that go unnoticed and mask error root cause such as https://github.com/orange-cloudfoundry/paas-templates/issues/2398
I need k3s-wrapper-boshrelease to back off when entering a crash loop

Observed behavior

tail -f -n 200 /var/vcap/monit/monit.log

#> UTC Aug  1 10:41:31] info     : 'k3s-server' start: /var/vcap/jobs/k3s-server/bin/ctl
#> [UTC Aug  1 10:41:41] info     : 'k3s-server' process is running with pid 366216
#> [UTC Aug  1 10:42:41] error    : 'k3s-server' process is not running
#> [UTC Aug  1 10:42:41] info     : 'k3s-server' trying to restart
#> [UTC Aug  1 10:42:41] info     : 'k3s-server' start: /var/vcap/jobs/k3s-server/bin/ctl
#> [UTC Aug  1 10:42:52] info     : 'k3s-server' process is running with pid 366278
#> [UTC Aug  1 10:43:42] error    : 'k3s-server' process is not running
#> [UTC Aug  1 10:43:42] info     : 'k3s-server' trying to restart
#> [UTC Aug  1 10:43:42] info     : 'k3s-server' start: /var/vcap/jobs/k3s-server/bin/ctl
#> [UTC Aug  1 10:43:52] info     : 'k3s-server' process is running with pid 366344
#> [UTC Aug  1 10:44:12] error    : 'k3s-server' process is not running
#> [UTC Aug  1 10:44:12] info     : 'k3s-server' trying to restart

Possible fix

Use monit support for slow process start

https://web.archive.org/web/20110816041503/https://mmonit.com/monit/documentation/monit.html

if 2 restarts within 3 cycles then timeout

SERVICE TIMEOUT

monit provides a service timeout mechanism for situations where a service simply refuses to start or respond over a longer period.

The timeout mechanism is based on number if service restarts and number of poll-cycles. For example, if a service had x restarts within y poll-cycles (where x <= y) then Monit will perform an action (for example unmonitor the service). If a timeout occurs Monit will send an alert message if you have register interest for this event.

The syntax for the timeout statement is as follows (keywords are in capital):

IF RESTART CYCLE(S) THEN

Here is an example where Monit will unmonitor the service if it was restarted 2 times within 3 cycles:

if 2 restarts within 3 cycles then unmonitor

To have Monit check the service again after a monitoring was disabled, run 'monit monitor ' from the command line.

Example for setting custom exec on timeout:

if 5 restarts within 5 cycles then exec "/foo/bar"

Example for stopping the service:

if 7 restarts within 10 cycles then stop

See inspiration in monit usage from cloudfoundry https://github.com/search?q=org%3Acloudfoundry+if+restart+cycles+within+then+path%3A%2F%28%5E%7C%5C%2F%29monit%24%2F&type=code

https://github.com/cloudfoundry/healthchecker-release

This repository is a BOSH release for healthchecker that is a go executable designed to perform TCP/HTTP based health checks of processes managed by monit in BOSH releases. Since the version of monit included in BOSH does not support specific tcp/http health checks, we designed this utility to perform health checking and restart processes if they become unreachable.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions