Skip to content

Commit 923e897

Browse files
authored
Merge pull request #630 from cloudfoundry/rfc-readiness-healthchecks
[RFC 630] add readiness healthchecks for apps
2 parents 747a7b4 + 380127d commit 923e897

File tree

1 file changed

+172
-0
lines changed

1 file changed

+172
-0
lines changed
Lines changed: 172 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,172 @@
1+
# Meta
2+
[meta]: #meta
3+
- Name: Readiness Healthchecks
4+
- Start Date: 2023-06-26
5+
- Author(s): @ameowlia, @mariash
6+
- Status: Draft
7+
- RFC Pull Request: https://github.com/cloudfoundry/community/pull/630
8+
9+
10+
## Summary
11+
12+
Add a readiness healthcheck option for apps. When the readiness healthcheck
13+
passes, the app instance (AI) is marked "ready" and the AI will be routable.
14+
When the readiness healthcheck fails, the AI is marked as "not ready" and its
15+
route will be removed from gorouter's route table.
16+
17+
## Problem
18+
19+
With the current implementation of application healthchecks, when the
20+
application healthcheck detects that an AI is unhealthy, then Diego will stop
21+
the AI, delete the AI, and reschedule a new AI.
22+
23+
This is too aggressive from some apps. There could be many reasons why a single
24+
request could fail, but the app is actually running fine. Additionally, many
25+
applications have a warm up period where they are not ready to receive requests
26+
immediately. For example, apps might need to populate caches, load data, or wait
27+
for external services before they mark themselves as routable. In these cases,
28+
the app should be kept alive, but in a non-routable state.
29+
30+
## Proposal
31+
32+
### Summary
33+
We intend to support readiness healthchecks. (This was requested previously in
34+
this [issue](https://github.com/cloudfoundry/cloud_controller_ng/issues/1706).)
35+
This would be an additional healthcheck that app developers could configure.
36+
When the readiness healthcheck passes, the AI is marked "ready" and the AI will
37+
be routable. When the readiness healthcheck fails, the AI is marked as "not
38+
ready" and its route will be removed from gorouter's route table. This new
39+
readiness healthcheck will give users a healthcheck option that is less drastic
40+
than the current option.
41+
42+
## Types of readiness healthcheck
43+
44+
[Similar to liveness healthchecks](https://docs.cloudfoundry.org/devguide/deploy-apps/healthchecks.html), readiness healthcheck can be of type "http", "port", or "process".
45+
However, when a user selects the "process" healthcheck type, nothing will be passed to the LRP, because once a process exits the AI
46+
is marked as crashed and Diego will attempt a restart. The default readiness healthcheck type is "process", which is backwards compatible.
47+
48+
## Rolling deploys
49+
50+
Rolling deploys should take into account the AI routable status. Old AI should
51+
be replaced with the new once new is running and routable.
52+
53+
### Architecture Overview
54+
This feature will require changes in the following releases
55+
56+
* CF CLI
57+
* Cloud Controller
58+
* Diego
59+
* Routing
60+
61+
1. The cloud controller will store this new data, before passing it onto the BBS
62+
as part of the desired LRP.
63+
2. The Diego executor will see these new readiness healthchecks on the desired
64+
LRP and will run the healthchecker binary in the app container with
65+
configuration provided.
66+
3. When the readiness healthcheck succeeds, the actual LRP will be marked as
67+
"ready". When the readiness healthcheck fails, the actual LRP will be marked
68+
as "not ready".
69+
4. When the route emitter gets route information, it will inspect if the AI is
70+
ready or not ready. It will emit registration or unregistration messages as
71+
appropriate for the gorouter to consume.
72+
73+
### CC Design
74+
Users will be able to set the readiness healthcheck via the app manifest.
75+
76+
```
77+
applications:
78+
- name: test-app
79+
processes:
80+
- type: web
81+
health-check-http-endpoint: /health
82+
health-check-invocation-timeout: 2
83+
health-check-type: http
84+
timeout: 80
85+
readiness-health-check-http-endpoint: /ready # 👈 new property
86+
readiness-health-check-invocation-timeout: 2 # 👈 new property
87+
readiness-health-check-type: http # 👈 new property
88+
readiness-health-check-interval: 5 # 👈 new property
89+
```
90+
91+
New `routable` field in CC API [process stats
92+
object](https://v3-apidocs.cloudfoundry.org/version/3.141.0/index.html#the-process-stats-object)
93+
with values `true` or `false` will display the routable status of the process.
94+
95+
### LRP Design
96+
97+
The readiness healthcheck data will be apart of the desired LRP object.
98+
99+
```json
100+
"check_definition": {
101+
"checks": [
102+
{
103+
"http_check": {
104+
"port": 8080,
105+
"path": "/health",
106+
"request_timeout_ms": 10000
107+
},
108+
}
109+
],
110+
"readiness_checks": [ # 👈 new property
111+
{
112+
"tcp_check": {
113+
"port": 8080,
114+
"connect_timeout_ms": 10000
115+
},
116+
}
117+
],
118+
"log_source": ""
119+
},
120+
```
121+
122+
### CLI Changes
123+
124+
The `routable` field of the process stats API object property will be used in
125+
CF CLI `cf app` output.
126+
127+
```
128+
state routable since cpu memory disk logging details
129+
#0 running yes 2023-06-27T15:07:14Z 0.6% 46.8M of 192M 179M of 1G 0/s of unlimited
130+
#1 running no 2023-06-27T15:11:43Z 0.0% 0 of 0 0 of 0 0/s of 0/s
131+
#2 running yes 2023-06-27T15:11:43Z 0.0% 0 of 192M 0 of 1G 0/s of 0/s
132+
```
133+
134+
New CLI options will be added to `cf push` command:
135+
136+
* `--readiness-endpoint` will set the endpoint for http readiness
137+
checks.
138+
* `--readiness-health-check-type` will set the type of the readiness
139+
check: "http", "port", or "process".
140+
* `--readiness-health-check-interval` will set the interval of the readiness
141+
check. Must be an integer greater than 0."
142+
143+
New CLI commands will be added:
144+
145+
* `get-readiness-health-check` shows the readiness health check performed on an
146+
app instance
147+
* `set-readiness-health-check` updates the readiness health check performed on
148+
an app instance
149+
150+
### Logging and Metrics
151+
152+
#### App logs
153+
154+
When AI readiness healthcheck succeeds a log line is printed to AI logs:
155+
"Container passed the readiness health check. Container marked ready and added
156+
to route pool". When AI readiness healthcheck fails a log line is printed to AI
157+
logs: "Container failed the readiness health check. Container marked not ready
158+
and removed from route pool".
159+
160+
#### App events
161+
162+
When AI readiness healthcheck succeeds a new application event is emitted:
163+
"app.ready". When AI readiness healthcheck fails a new event is emitted:
164+
"app.notready".
165+
166+
### Open Questions
167+
* What metrics would be helpful for app devs and operators?
168+
169+
This work is ongoing. All comments and concerns are welcomed from the community.
170+
Either add a comment here or reach out in slack in #wg-app-runtime-platform.
171+
172+

0 commit comments

Comments
 (0)