|
| 1 | +# Meta |
| 2 | +[meta]: #meta |
| 3 | +- Name: Readiness Healthchecks |
| 4 | +- Start Date: 2023-06-26 |
| 5 | +- Author(s): @ameowlia, @mariash |
| 6 | +- Status: Draft |
| 7 | +- RFC Pull Request: https://github.com/cloudfoundry/community/pull/630 |
| 8 | + |
| 9 | + |
| 10 | +## Summary |
| 11 | + |
| 12 | +Add a readiness healthcheck option for apps. When the readiness healthcheck |
| 13 | +passes, the app instance (AI) is marked "ready" and the AI will be routable. |
| 14 | +When the readiness healthcheck fails, the AI is marked as "not ready" and its |
| 15 | +route will be removed from gorouter's route table. |
| 16 | + |
| 17 | +## Problem |
| 18 | + |
| 19 | +With the current implementation of application healthchecks, when the |
| 20 | +application healthcheck detects that an AI is unhealthy, then Diego will stop |
| 21 | +the AI, delete the AI, and reschedule a new AI. |
| 22 | + |
| 23 | +This is too aggressive from some apps. There could be many reasons why a single |
| 24 | +request could fail, but the app is actually running fine. Additionally, many |
| 25 | +applications have a warm up period where they are not ready to receive requests |
| 26 | +immediately. For example, apps might need to populate caches, load data, or wait |
| 27 | +for external services before they mark themselves as routable. In these cases, |
| 28 | +the app should be kept alive, but in a non-routable state. |
| 29 | + |
| 30 | +## Proposal |
| 31 | + |
| 32 | +### Summary |
| 33 | +We intend to support readiness healthchecks. (This was requested previously in |
| 34 | +this [issue](https://github.com/cloudfoundry/cloud_controller_ng/issues/1706).) |
| 35 | +This would be an additional healthcheck that app developers could configure. |
| 36 | +When the readiness healthcheck passes, the AI is marked "ready" and the AI will |
| 37 | +be routable. When the readiness healthcheck fails, the AI is marked as "not |
| 38 | +ready" and its route will be removed from gorouter's route table. This new |
| 39 | +readiness healthcheck will give users a healthcheck option that is less drastic |
| 40 | +than the current option. |
| 41 | + |
| 42 | +## Types of readiness healthcheck |
| 43 | + |
| 44 | +[Similar to liveness healthchecks](https://docs.cloudfoundry.org/devguide/deploy-apps/healthchecks.html), readiness healthcheck can be of type "http", "port", or "process". |
| 45 | +However, when a user selects the "process" healthcheck type, nothing will be passed to the LRP, because once a process exits the AI |
| 46 | +is marked as crashed and Diego will attempt a restart. The default readiness healthcheck type is "process", which is backwards compatible. |
| 47 | + |
| 48 | +## Rolling deploys |
| 49 | + |
| 50 | +Rolling deploys should take into account the AI routable status. Old AI should |
| 51 | +be replaced with the new once new is running and routable. |
| 52 | + |
| 53 | +### Architecture Overview |
| 54 | +This feature will require changes in the following releases |
| 55 | + |
| 56 | +* CF CLI |
| 57 | +* Cloud Controller |
| 58 | +* Diego |
| 59 | +* Routing |
| 60 | + |
| 61 | +1. The cloud controller will store this new data, before passing it onto the BBS |
| 62 | + as part of the desired LRP. |
| 63 | +2. The Diego executor will see these new readiness healthchecks on the desired |
| 64 | + LRP and will run the healthchecker binary in the app container with |
| 65 | + configuration provided. |
| 66 | +3. When the readiness healthcheck succeeds, the actual LRP will be marked as |
| 67 | + "ready". When the readiness healthcheck fails, the actual LRP will be marked |
| 68 | + as "not ready". |
| 69 | +4. When the route emitter gets route information, it will inspect if the AI is |
| 70 | + ready or not ready. It will emit registration or unregistration messages as |
| 71 | + appropriate for the gorouter to consume. |
| 72 | + |
| 73 | +### CC Design |
| 74 | +Users will be able to set the readiness healthcheck via the app manifest. |
| 75 | + |
| 76 | +``` |
| 77 | +applications: |
| 78 | +- name: test-app |
| 79 | + processes: |
| 80 | + - type: web |
| 81 | + health-check-http-endpoint: /health |
| 82 | + health-check-invocation-timeout: 2 |
| 83 | + health-check-type: http |
| 84 | + timeout: 80 |
| 85 | + readiness-health-check-http-endpoint: /ready # 👈 new property |
| 86 | + readiness-health-check-invocation-timeout: 2 # 👈 new property |
| 87 | + readiness-health-check-type: http # 👈 new property |
| 88 | + readiness-health-check-interval: 5 # 👈 new property |
| 89 | +``` |
| 90 | + |
| 91 | +New `routable` field in CC API [process stats |
| 92 | +object](https://v3-apidocs.cloudfoundry.org/version/3.141.0/index.html#the-process-stats-object) |
| 93 | +with values `true` or `false` will display the routable status of the process. |
| 94 | + |
| 95 | +### LRP Design |
| 96 | + |
| 97 | +The readiness healthcheck data will be apart of the desired LRP object. |
| 98 | + |
| 99 | +```json |
| 100 | +"check_definition": { |
| 101 | + "checks": [ |
| 102 | + { |
| 103 | + "http_check": { |
| 104 | + "port": 8080, |
| 105 | + "path": "/health", |
| 106 | + "request_timeout_ms": 10000 |
| 107 | + }, |
| 108 | + } |
| 109 | + ], |
| 110 | + "readiness_checks": [ # 👈 new property |
| 111 | + { |
| 112 | + "tcp_check": { |
| 113 | + "port": 8080, |
| 114 | + "connect_timeout_ms": 10000 |
| 115 | + }, |
| 116 | + } |
| 117 | + ], |
| 118 | + "log_source": "" |
| 119 | + }, |
| 120 | +``` |
| 121 | + |
| 122 | +### CLI Changes |
| 123 | + |
| 124 | +The `routable` field of the process stats API object property will be used in |
| 125 | +CF CLI `cf app` output. |
| 126 | + |
| 127 | +``` |
| 128 | + state routable since cpu memory disk logging details |
| 129 | +#0 running yes 2023-06-27T15:07:14Z 0.6% 46.8M of 192M 179M of 1G 0/s of unlimited |
| 130 | +#1 running no 2023-06-27T15:11:43Z 0.0% 0 of 0 0 of 0 0/s of 0/s |
| 131 | +#2 running yes 2023-06-27T15:11:43Z 0.0% 0 of 192M 0 of 1G 0/s of 0/s |
| 132 | +``` |
| 133 | + |
| 134 | +New CLI options will be added to `cf push` command: |
| 135 | + |
| 136 | +* `--readiness-endpoint` will set the endpoint for http readiness |
| 137 | + checks. |
| 138 | +* `--readiness-health-check-type` will set the type of the readiness |
| 139 | + check: "http", "port", or "process". |
| 140 | +* `--readiness-health-check-interval` will set the interval of the readiness |
| 141 | + check. Must be an integer greater than 0." |
| 142 | + |
| 143 | +New CLI commands will be added: |
| 144 | + |
| 145 | +* `get-readiness-health-check` shows the readiness health check performed on an |
| 146 | + app instance |
| 147 | +* `set-readiness-health-check` updates the readiness health check performed on |
| 148 | + an app instance |
| 149 | + |
| 150 | +### Logging and Metrics |
| 151 | + |
| 152 | +#### App logs |
| 153 | + |
| 154 | +When AI readiness healthcheck succeeds a log line is printed to AI logs: |
| 155 | +"Container passed the readiness health check. Container marked ready and added |
| 156 | +to route pool". When AI readiness healthcheck fails a log line is printed to AI |
| 157 | +logs: "Container failed the readiness health check. Container marked not ready |
| 158 | +and removed from route pool". |
| 159 | + |
| 160 | +#### App events |
| 161 | + |
| 162 | +When AI readiness healthcheck succeeds a new application event is emitted: |
| 163 | +"app.ready". When AI readiness healthcheck fails a new event is emitted: |
| 164 | +"app.notready". |
| 165 | + |
| 166 | +### Open Questions |
| 167 | +* What metrics would be helpful for app devs and operators? |
| 168 | + |
| 169 | +This work is ongoing. All comments and concerns are welcomed from the community. |
| 170 | +Either add a comment here or reach out in slack in #wg-app-runtime-platform. |
| 171 | + |
| 172 | + |
0 commit comments