Skip to content

Commit 9389bc9

Browse files
authored
add host status checks and states + test/docs updates (#27)
2 parents 607f255 + df5983b commit 9389bc9

File tree

15 files changed

+1669
-108
lines changed

15 files changed

+1669
-108
lines changed

README.md

Lines changed: 53 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -9,6 +9,7 @@
99
## Features
1010

1111
- **Extensible Check System**: Modular check types (ping, with more planned) via a Registry/Factory pattern.
12+
- **Host Status Aggregation**: Each host has an aggregate status (`up`, `down`, `degraded`, `unknown`) computed from all its checks. A check must be alive and have reported within the last 5 minutes to count as healthy.
1213
- **Ping Monitoring**: Sends ICMP Echo Requests to check host availability.
1314
- **Latency Logging**: Uses RRD to store latency data over time.
1415
- **Graphs Generation**: Generates historical latency graphs (15 minutes, 4 hours, 8 hours, etc.) for each host.
@@ -101,6 +102,58 @@ Ensure the following are installed:
101102
- **Port** (`--port`): Port on which the API and front-end are served.
102103
- **Logging Level** (`--log-level`): Set the verbosity of logs (e.g., `debug`, `info`, `warn`, `error`, `fatal`, `panic`).
103104

105+
## Host Status
106+
107+
Each host has an aggregate status derived from all its enabled checks:
108+
109+
| Status | Color | Meaning |
110+
| ------------ | ------ | ------------------------------------------------------------ |
111+
| **up** | Green | All checks are alive and reported within the last 5 minutes. |
112+
| **degraded** | Yellow | Some checks are healthy, others are down or stale. |
113+
| **down** | Red | All checks are down (but at least one has reported before). |
114+
| **unknown** | Gray | No checks configured, or no check has ever reported. |
115+
116+
A check result is considered **stale** if its last successful RRD update is older than 5 minutes. Stale checks are treated the same as down checks for the purpose of host status aggregation.
117+
118+
## API
119+
120+
### `GET /api`
121+
122+
Returns JSON with the status of all hosts:
123+
124+
```json
125+
{
126+
"google": {
127+
"address": "8.8.8.8",
128+
"status": "up",
129+
"checks": {
130+
"ping": {
131+
"alive": true,
132+
"metrics": {
133+
"latency_us": 12345
134+
},
135+
"lastupdate": 1700000000
136+
}
137+
}
138+
},
139+
"router": {
140+
"status": "unknown",
141+
"checks": {}
142+
}
143+
}
144+
```
145+
146+
The `status` field is one of `up`, `down`, `degraded`, or `unknown` (see [Host Status](#host-status) above).
147+
148+
### `GET /metrics`
149+
150+
Exposes Prometheus-formatted metrics:
151+
152+
```
153+
check_alive{host="google", address="8.8.8.8", check="ping"} 1
154+
check_metric{host="google", address="8.8.8.8", check="ping", metric="latency_us"} 12345
155+
```
156+
104157
## Data Directory Layout
105158

106159
RRD files and graph images are organized into per-host subdirectories:

pkg/rrd/helpers.go

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,12 @@
11
package rrd
22

3+
// Standard colors for RRD graph elements.
34
const RED = "FF0000"
45
const GREEN = "00FF00"
56

7+
// expandTimeLength converts a short time duration code (e.g. "15m", "1h", "4d")
8+
// into a human-readable string for use in graph titles and comments.
9+
// Returns the input unchanged if it does not match a known code.
610
func expandTimeLength(timeLength string) string {
711
switch timeLength {
812
case "15m":

0 commit comments

Comments
 (0)