Skip to content

Commit 18114ee

Browse files
authored
healthcheck: A brand new Healthcheck that realizes vip/vip-port/backend 3-tier topology, and supports flexible actioner/checker configurations for each layer.
* healthcheck: main and manager module Signed-off-by: ywc689 <ywc689@163.com> * healthcheck: metric server Signed-off-by: ywc689 <ywc689@163.com> * healthcheck: comm module Signed-off-by: ywc689 <ywc689@163.com> * healthcheck: service-lister Signed-off-by: ywc689 <ywc689@163.com> * healthcheck: virtual-address module Signed-off-by: ywc689 <ywc689@163.com> * healthcheck: virtual-service module Signed-off-by: ywc689 <ywc689@163.com> * healthcheck: checkere module Signed-off-by: ywc689 <ywc689@163.com> * healthcheck: fix metric server problem 1. Stop metric server after all checkers done. 2. Fix state duration update and display problem. Signed-off-by: ywc689 <ywc689@163.com> * healthcheck: fix crash problem in stopping Signed-off-by: ywc689 <ywc689@163.com> * healthcheck: add more checker method frames, fix thunder herd problem and minus count problem Signed-off-by: ywc689 <ywc689@163.com> * healthcheck: sort metric data Signed-off-by: ywc689 <ywc689@163.com> * healthcheck: support cache entry deletion in metric server Signed-off-by: ywc689 <ywc689@163.com> * healthcheck: implement tcp checker Signed-off-by: ywc689 <ywc689@163.com> * healthcheck: align metric server outputs Signed-off-by: ywc689 <ywc689@163.com> * healthcheck: implement udp checker Signed-off-by: ywc689 <ywc689@163.com> * healthcheck: implement ping checker Signed-off-by: ywc689 <ywc689@163.com> * healthcheck: terminate VAs, VSs and Checkers concurrently and asynchronously Signed-off-by: ywc689 <ywc689@163.com> * healthcheck: implement udpping checker Signed-off-by: ywc689 <ywc689@163.com> * healthcheck: fix some counting problems 1. VA's VS counting turbulence problem. 2. VS's Backend counting turbulence problem. 3. up/down retry counting skew problem. Signed-off-by: ywc689 <ywc689@163.com> * healthcheck: implement http checker Signed-off-by: ywc689 <ywc689@163.com> * healthcheck: add more testing codes for checker methods Signed-off-by: ywc689 <ywc689@163.com> * healthcheck: implement BackendUpdate actioner Signed-off-by: ywc689 <ywc689@163.com> * healthcheck: udp checker return Healthy rather than Unknown when state undetermined Signed-off-by: ywc689 <ywc689@163.com> * healthcheck2: name more empty actioners Signed-off-by: ywc689 <ywc689@163.com> * healthcheck: implement KernelRouteAddDel actioner Signed-off-by: ywc689 <ywc689@163.com> * healthcheck: implement Script actioner Signed-off-by: ywc689 <ywc689@163.com> * healthcheck: use extra params for actioner creation to avoid contaminating configs from file Signed-off-by: ywc689 <ywc689@163.com> * healthcheck: implement DpvsAddrAddDel actioner Signed-off-by: ywc689 <ywc689@163.com> * healthcheck: implement DpvsAddrKernelRouteAddDel actioner Signed-off-by: ywc689 <ywc689@163.com> * healthcheck: implement blank actioner Signed-off-by: ywc689 <ywc689@163.com> * export types for parsing config from yaml file, and provide config sample/template files Signed-off-by: ywc689 <ywc689@163.com> * healthcheck: add HTTP APIs for config file lookup and validation Signed-off-by: ywc689 <ywc689@163.com> * healthcheck: implement config file reloader Signed-off-by: ywc689 <ywc689@163.com> * healthcheck: conf validation support tpye-proned params Signed-off-by: ywc689 <ywc689@163.com> * healthcheck: Fix config validation failure caused by auto check param Signed-off-by: ywc689 <ywc689@163.com> * healthcheck2: fix 2 minor problems found in code review by Copilot Signed-off-by: ywc689 <ywc689@163.com> * healthcheck2: fix several problems for config update 1. Merge correct default configs for VAs and VSs. 2. Call VA actioner creation with DpvsAgentAddr param, which is essential for some ationer types. 3. Make actioner params config independent of actioner method. 4. Checkers send health state notice on config updates only when necessary. Signed-off-by: ywc689 <ywc689@163.com> * healthcheck2: add checker and va-down-policy annotations to conf metric server outputs Signed-off-by: ywc689 <ywc689@163.com> * healthcheck2: fix more problems for config dynamic update 1. Set VA state to Unhealthy when no VS configured. 2. Fix update failure problems of VA down-policy and Checker interval. 3. Reset VA/VS state to default state (Unhealthy/Healthy respectively) before changing actioner. 4. Do update even if service deployment revision version is unchanged to support dynamic config updates. 5. Add logs before configs are actually updated. Signed-off-by: ywc689 <ywc689@163.com> * healthcheck2: fix problems in checker retry and update config sample file Signed-off-by: ywc689 <ywc689@163.com> * healthcheck2: add license banner message Signed-off-by: ywc689 <ywc689@163.com> * healthcheck2: reduce config file reload interval from 177s to 53s Signed-off-by: ywc689 <ywc689@163.com> * healthcheck2: add doc README.md Signed-off-by: ywc689 <ywc689@163.com> * healthcheck2: set backend weight to 0 when unhealthy Signed-off-by: ywc689 <ywc689@163.com> * healthcheck2: add script for service backend replace test Signed-off-by: ywc689 <ywc689@163.com> * healthcheck2: fix high cpu utilization problem caused by busy polling in running task and metric server The healthcheck CPU utilization is up to 300% even if no healthcheck tasks are running. As shown in the pprof-cpu report in ./test/cpu-profile001.svg, the problem is caused by the unnecessary "default" clause which make the "select" loop busy polling. Signed-off-by: ywc689 <ywc689@163.com> * healthcheck2: add stress test doc Signed-off-by: ywc689 <ywc689@163.com> * healthcheck: replace healthcheck with healthcheck2 Signed-off-by: ywc689 <ywc689@163.com> * healthcheck: fix issues reported by Copilot code review Signed-off-by: ywc689 <ywc689@163.com> --------- Signed-off-by: ywc689 <ywc689@163.com>
1 parent 8f62a3a commit 18114ee

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

66 files changed

+8191
-2619
lines changed

tools/healthcheck/.gitignore

Lines changed: 0 additions & 3 deletions
This file was deleted.

tools/healthcheck/Makefile

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -11,6 +11,7 @@ all: $(TARGET)
1111

1212
$(TARGET): go-proxy
1313
-$(GO) mod tidy
14+
$(GO) vet
1415
$(GO_BUILD) -o $@
1516

1617
go-proxy:
@@ -19,6 +20,15 @@ go-proxy:
1920
clean:
2021
$(GO_CLEAN)
2122

23+
MODULE_NAME := $(shell grep '^module ' go.mod | awk '{print $$2}')
24+
GO_PATH := $(shell $(GO) env GOPATH)
25+
code-gen: $(GO_PATH)/bin/deepcopy-gen
26+
$(GO_PATH)/bin/deepcopy-gen --alsologtostderr -i $(MODULE_NAME)/pkg/manager/ -O zz_deepcopy_generated --go-header-file=license.txt --trim-path-prefix=$(MODULE_NAME)
27+
$(GO_PATH)/bin/deepcopy-gen --alsologtostderr -i $(MODULE_NAME)/pkg/comm/ -O zz_deepcopy_generated --go-header-file=license.txt --trim-path-prefix=$(MODULE_NAME)
28+
29+
$(GO_PATH)/bin/deepcopy-gen:
30+
go install k8s.io/code-generator/cmd/deepcopy-gen@v0.29.12
31+
2232
license: license.txt
2333
ifeq ($(shell addlicense 2>&1|grep Usage),)
2434
$(error "`addlicense` command not found. You can install it with `go install github.com/google/addlicense`")

tools/healthcheck/README.md

Lines changed: 225 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,225 @@
1+
DPVS Healthcheck Program
2+
------
3+
4+
# Introduction
5+
6+
The program implements a 3-tier health check framework for DPVS. It works in cooperation with dpvs-agent, polling loadbalancer services periodically and performing actions according to check results.
7+
From top to botoom, there are 3 check layers tracking health states of `VA`, `VS`, and `Checker` respectively.
8+
9+
* **VA**(Virtual Address): Loadbalancer virtual service address denoted by IP.
10+
* **VS** Virtual Server): Loadbalancer virtual service denoted by triplet <IP,Protocol,Port>.
11+
* **Checker**: Checker runner for a specific backend server, invoking its health check method regularly.
12+
13+
A `VA` comprises of mutiple `VS`s, and a `VS` consists of multiple `Checkers`. The `Checker` checks backend server's health state periodically, and reports the result to its `VS`. The `VS` tracks health states for all its backends, performs UP/DOWN actions on each backend, sums a health state of itself from backends states and reports the result to `VA`. The `VA` collects all its `VS` health states, calculates a health state of itself with respect to `DownPolicy` configured by user, and executes corresponding action when its health state is changed.
14+
15+
The diagram below shows an example of the health check deployment layout.
16+
17+
```
18+
VA: 192.168.88.1
19+
|
20+
--------------------------------------------------------------------------------------
21+
| | |
22+
VS: 192.168.88.1-TCP-80 192.168.88.1-TCP-443 192.168.88.1-UDP-80
23+
| | |
24+
---------------------- ----------------------- |
25+
| | | | |
26+
Checker: 192.168.88.30-TCP-8080 192.168.88.68-TCP-8080 192.168.88.30-TCP-443 192.168.88.68-TCP-443 192.168.88.68-UDP-6000
27+
```
28+
29+
Check methods supported are:
30+
31+
* **none**: Do nothing, used as a placeholder.
32+
* **tcp**: Check via TCP probe, including a SYN probe procedure and possible data exchange.
33+
* **udp**: Check via UDP probe relying on ICMP error message such as `Destination Unreachable` and possible data exchange.
34+
* **ping**: Check via ICMP/ICMPv6 echo request/reply.
35+
* **udpping**: Firstly, perform a ping check, and if succeed, then do a udp check.
36+
* **http**: Check via HTTP/HTTPS probe, supporting versatile user configurations.
37+
38+
Action methods supported by `VS` are:
39+
* **BackendUpdate**: Update backend's weight and `inhibited` flag in DPVS according to given health state. Also return new service lists if the ojects to update expired.
40+
41+
Action methods supported by`VA` are:
42+
* **Blank**: Do nothing, used as a placeholder.
43+
* **KernelRouteAddDel**: Add/Remove IP address from a specified linux network interface.
44+
* **DpvsAddrAddDel**: Add/Remove IP address from a specified DPVS interface.
45+
* **DpvsAddrKernelRouteAddDel**: Do both `KernelRouteAddDel` and `DpvsAddrAddDel`.
46+
* **Script**: Run a script provided by user.
47+
48+
Check/Action methods can extend easily under the framework of the healthcheck program.
49+
50+
# Configurations
51+
52+
### 1. Application Configurations
53+
54+
Application configurations are provided with commandline parameters. Run `./healthcheck -h` to see the supported configuration items.
55+
56+
```sh
57+
# ./healthcheck -h
58+
Usage of ./healthcheck:
59+
-alsologtostderr
60+
log to standard error as well as files
61+
-checker-notify-channel-size uint
62+
Channel size for checker state change notice and resync. (default 100)
63+
-conf-check-uri string
64+
Http URI for checking if config file valid. (default "/conf/check")
65+
-conf-uri string
66+
Http URI for showing current effective configs. (default "/conf")
67+
-config-file string
68+
File path of healthcheck config file. (default "/etc/healthcheck.conf")
69+
-config-reload-interval duration
70+
Time interval to reload healthcheck config file. (default 7s)
71+
-debug
72+
Enable gops for debug.
73+
-dpvs-agent-addr string
74+
Server address of dpvs-agent. (default ":8082")
75+
-dpvs-service-list-interval duration
76+
Time interval to refetch dpvs services. (default 15s)
77+
-log_backtrace_at value
78+
when logging hits line file:N, emit a stack trace
79+
-log_dir string
80+
If non-empty, write log files in this directory
81+
-log_link string
82+
If non-empty, add symbolic links in this directory to the log files
83+
-logbuflevel int
84+
Buffer log messages logged at this level or lower (-1 means don't buffer; 0 means buffer INFO only; ...). Has limited applicability on non-prod platforms.
85+
-logtostderr
86+
log to standard error instead of files
87+
-metric-notify-channel-size uint
88+
Channel size for metric data sent from checkers to metric server. (default 1000)
89+
-metric-delay duration
90+
Max delayed time to send changed metric to metric server. (default 2s)
91+
-metric-server-addr string
92+
Server address for exporting healthcheck state and statistics. (default ":6601")
93+
-metric-server-uri string
94+
Http URI for exporting healthcheck state and statistics. (default "/metrics")
95+
-stderrthreshold value
96+
logs at or above this threshold go to stderr (default 2)
97+
-v value
98+
log level for V logs
99+
-vmodule value
100+
comma-separated list of pattern=N settings for file-filtered logging
101+
-vs-notify-channel-size uint
102+
Channel size for virtual service state change notice and resync. (default 100)
103+
```
104+
105+
> Notes: The commandline parameters above may evolve with the project iteration. Please refer to the helper information from your program for the supported parameters.
106+
107+
### 2. Checker Configurations
108+
109+
The healthcheck program supports a yaml format file for checker configurations. The file layout and all supported configurations are maintained in [healthcheck.conf.template](./conf/healthcheck.conf.template).
110+
111+
A `global` config block for `VS` and `VA` can be included in the file, and if not, the default configurations in codes are used. Besides, you can set different config value from the global for a specific `VA` and `VS` in `virtual-addresses` and `virtual-servers` config blocks respectively. If set, it overwirtes the global configuations. We provide two config files as examples.
112+
113+
* [healthcheck.conf.simple](./conf/healthcheck.conf.simple): A simplest config file with all items are their default value.
114+
* [healthcheck.conf.sample](./conf/healthcheck.conf.sample): A config file specifies global config block and some specific object related blocks.
115+
116+
An empty configuration file is allowed, in which case the default configurations in codes are used. Note that the healthcheck program must start with an existing config file specified by `-config-file` commandline parameter.
117+
118+
We can validate the config file with an HTTP API specified with `-conf-check-uri` commandline parameter, whose default value is `/conf/check`. Take [healthcheck.conf.sample](./conf/healthcheck.conf.sample) for example.
119+
120+
```
121+
# curl http://10.61.240.28:6601/conf/check
122+
Config File /etc/healthcheck.conf: VALID
123+
......
124+
```
125+
126+
Besides, we can retrieve the effective configurations via HTTP API specified by `-conf-uri` commandline parameter, whose default value is `/conf`. Still take [healthcheck.conf.sample](./conf/healthcheck.conf.sample) for example.
127+
128+
```
129+
# curl http://10.61.240.28:6601/conf
130+
# Check Method Annotations: 1-none, 2-tcp, 3-udp, 4-ping, 5-udpping, 6-http, 10000-auto, 65535-passive
131+
# VA DownPolicy Annotations: 1-oneOf, 2-allOf
132+
133+
global:
134+
virtual-address:
135+
disable: false
136+
down-policy: 2
137+
actioner: KernelRouteAddDel
138+
action-timeout: 2s
139+
action-sync-time: 1m0s
140+
action-params:
141+
ifname: dpdk0.102.kni
142+
virtual-server:
143+
method: 10000
144+
interval: 3s
145+
down-retry: 1
146+
up-retry: 1
147+
timeout: 2s
148+
method-params: {}
149+
actioner: BackendUpdate
150+
action-timeout: 2s
151+
action-sync-time: 15s
152+
action-params: {}
153+
virtual-addresses:
154+
192.168.88.1:
155+
disable: false
156+
down-policy: 1
157+
actioner: DpvsAddrKernelRouteAddDel
158+
action-timeout: 2s
159+
action-sync-time: 30s
160+
action-params:
161+
dpvs-ifname: dpdk0.102
162+
ifname: dpdk0.102.kni
163+
2001::1:
164+
disable: true
165+
down-policy: 2
166+
actioner: KernelRouteAddDel
167+
action-timeout: 2s
168+
action-sync-time: 1m0s
169+
action-params:
170+
ifname: dpdk0.102.kni
171+
virtual-servers:
172+
192.168.88.1-TCP-8080:
173+
method: 4
174+
interval: 5s
175+
down-retry: 0
176+
up-retry: 0
177+
timeout: 1s
178+
method-params: {}
179+
actioner: BackendUpdate
180+
action-timeout: 1s
181+
action-sync-time: 10s
182+
action-params: {}
183+
```
184+
185+
# Metric Observation
186+
187+
A metric collection mechanism in built in the program. We can get the metric data from the metric server specified by `-metric-server-addr` and `-metric-server-uri` commandline parameters. The metric data divides into two categories.
188+
189+
* Thread Statistics: The current running, stopping and finished Go Routines for `VA`, `VS`, `Checker` and healthcheck methods.
190+
* Object Statistics: Organized as a three layer structure, each line shows current health state and statistic data for a specific item in the layer.
191+
192+
The object statistics is shown in 6-tuple format. Its meanings varies for different layer, as shown in table below.
193+
194+
| | up, down | up_notices, down_notices | fail1, fail2 |
195+
| ------------------ | ------------------ | -------------------------------- | ---------------------------|
196+
| Checker | probe state counts | state change notices | check timeout, check error |
197+
| VirtualService(VS) | success actions | received va state change notices | failed up/down actions |
198+
| VirtualAddress(VA) | success actions | received vs state change notices | failed up/down actions |
199+
200+
This is the metric report from my test environments, which shows the health states and statistics of my DPVS server at 2025-05-09 14:31:02.
201+
202+
```
203+
#curl http://10.61.240.28:6601/metrics
204+
2025-05-09 14:31:02.802071741 +0800 CST m=+1026.188881757
205+
206+
Thread Statistics:
207+
running stopping finished
208+
VirtualAddress 1 0 0
209+
VirtualService 3 0 0
210+
Checker 4 0 0
211+
HealthCheck 1 0 1227
212+
213+
object state statistics extra(optional)
214+
---------------------------------------------------------------------------------------------------------------------------
215+
192.168.88.1 Healthy 17m0s 1,0,3,0,0,0
216+
TCP 192.168.88.1:80 Healthy 16m58s 1,0,1,0,0,0
217+
-> 192.168.88.30:80 Healthy 17m1s 341,0,1,0,0,0
218+
TCP 192.168.88.1:8080 Healthy 16m59s 1,0,1,0,0,0
219+
-> 192.168.88.130:80 Healthy 16m59s 204,0,1,0,0,0
220+
UDP 192.168.88.1:80 Healthy 17m0s 1,0,2,0,0,0
221+
-> 192.168.88.130:7000 Healthy 16m58s 340,0,1,0,0,0
222+
-> 192.168.88.30:6000 Healthy 17m3s 342,0,1,0,0,0
223+
Notes:
224+
statistics denotation: up,down,up_notices,down_notices,fail(up,timeout),fail(down,error)
225+
```
Lines changed: 37 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,37 @@
1+
---
2+
global:
3+
virtual-address:
4+
disable: false
5+
down-policy: 2
6+
action-sync-time: 60s
7+
actioner: KernelRouteAddDel
8+
action-params:
9+
ifname: lo
10+
virtual-server:
11+
method: 10000
12+
interval: 3s
13+
down-retry: 1
14+
up-retry: 1
15+
timeout: 2s
16+
action-timeout: 2s
17+
action-sync-time: 15s
18+
virtual-addresses:
19+
192.168.88.1:
20+
disable: false
21+
action-sync-time: 30s
22+
down-policy: 1
23+
actioner: DpvsAddrKernelRouteAddDel
24+
action-params:
25+
ifname: lo
26+
dpvs-ifname: dpdk0.102
27+
"2001::1":
28+
disable: true
29+
virtual-servers:
30+
192.168.88.1-TCP-8080:
31+
method: 4
32+
interval: 5s
33+
down-retry: 999999 ## zero retry
34+
up-retry: 999999 ## zero retry
35+
timeout: 1s
36+
action-timeout: 1s
37+
action-sync-time: 10s
Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,8 @@
1+
---
2+
global:
3+
virtual-address:
4+
actioner: KernelRouteAddDel
5+
action-params:
6+
ifname: lo
7+
virtual-addresses:
8+
virtual-servers:
Lines changed: 94 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,94 @@
1+
###### Action Parameters
2+
ActionParamsBlank: none
3+
ActionParamsBackendUpdate: none
4+
ActionParamsKernelRouteAddDel:
5+
ifname: string, lo
6+
ActionParamsDpvsAddrAddDel:
7+
dpvs-ifname: string, ""
8+
ActionParamsDpvsAddrKernelRouteAddDel:
9+
ifname: string, lo
10+
dpvs-ifname: string, ""
11+
ActionParamScript:
12+
script: string(filepath), ""
13+
args: string, ""
14+
15+
###### Checker Parameters
16+
CheckParamsNone: none
17+
CheckParamsTCP:
18+
send: string, ""
19+
receive: string, ""
20+
proxy-protocol: string, ""|v1|v2
21+
CheckParamsUDP:
22+
send: string, ""
23+
receive: string, ""
24+
proxy-protocol: string, ""|v2
25+
CheckParamsUDPPing:
26+
send: string, ""
27+
receive: string, ""
28+
proxy-protocol: string, ""|v2
29+
CheckParamsHTTP:
30+
method: enum(string),GET|PUT|POST|HEAD
31+
host: string
32+
uri: string
33+
https: bool
34+
tls-verify: bool
35+
proxy: proxy
36+
proxy-protocol: ""|v1|v2
37+
request-header: map[string]string
38+
request: string
39+
response-codes: [HttpCodeRange]array
40+
response: string
41+
42+
###### Virtual Address Configuration
43+
VACONF:
44+
disable: bool, true|*false
45+
down-policy: enum(int), VAPolicyOneOf(1)|*VAPolicyAllOf(2)
46+
action-timeout: duration, 2s
47+
action-sync-time: duration, 60s
48+
actioner: enum(string), Blank|*KernelRouteAddDel|DpvsAddrAddDel|DpvsAddrKernelRouteAddDel|Script
49+
action-params: ActionParamsBlank|ActionParamsKernelRouteAddDel|ActionParamsDpvsAddrAddDel|ActionParamsDpvsAddrKernelRouteAddDel|ActionParamScript
50+
51+
###### Virtual Server Action Configuration
52+
VSACTIONCONF:
53+
action-timeout: duration, 2s
54+
action-sync-time: duration, 15s
55+
actioner: string, *BackendUpdate
56+
action-params: ActionParamsBackendUpdate
57+
58+
###### Checker Configuration
59+
CHECKERCONF:
60+
method: enum(string), none(1)|tcp(2)|udp(3)|ping(4)|udpping(5)|http(6)|*auto(10000)
61+
interval: duration, 3s
62+
down-retry: uint, 1 (999999 for zero retry)
63+
up-retry: uint, 1 (999999 for zero retry)
64+
timeout: duration, 2s
65+
method-params: CheckParamsNone|CheckParamsTCP|CheckParamsUDP|CheckParamsUDPPing|CheckParamsHTTP
66+
67+
68+
#######################################################################################################
69+
## Overall Configuration Layout
70+
#######################################################################################################
71+
global:
72+
virtual-address:
73+
VACONF
74+
virtual-server:
75+
VSACTIONCONF
76+
CHECKERCONF
77+
78+
virtual-addresses:
79+
VIP:
80+
VACONF
81+
VIP:
82+
VACONF
83+
...
84+
85+
virtual-servers:
86+
VIP-PROTO-PORT
87+
VSACTIONCONF
88+
CHECKERCONF
89+
VIP-PROTO-PORT
90+
VSACTIONCONF
91+
CHECKERCONF
92+
...
93+
94+
#######################################################################################################

0 commit comments

Comments
 (0)