Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
106 changes: 98 additions & 8 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,14 +8,17 @@

## Features

- **Extensible Check System**: Modular check types (ping, with more planned) via a Registry/Factory pattern.
- **Extensible Check System**: Modular check types via a Registry/Factory pattern. Each check type implements a common `Check` interface and declares its own metrics through a `Descriptor`.
- **Built-in Check Types**:
- **ping**: ICMP echo requests for host availability and latency.
- **wifi_stations**: Scrapes a Prometheus metrics endpoint for connected WiFi client counts per radio interface.
- **Multi-Metric Checks**: Checks can produce multiple metrics stored as separate data sources in a single RRD file. Multi-metric checks render as stacked area graphs.
- **Host Status Aggregation**: Each host has an aggregate status (`up`, `down`, `degraded`, `unknown`) computed from all its checks. A check must be alive and have reported within the last 5 minutes to count as healthy.
- **Ping Monitoring**: Sends ICMP Echo Requests to check host availability.
- **Latency Logging**: Uses RRD to store latency data over time.
- **Graphs Generation**: Generates historical latency graphs (15 minutes, 4 hours, 8 hours, etc.) for each host.
- **RRD Storage**: Uses Round Robin Databases for time-series data, with configurable archives from 1-minute resolution (1 week) to 8-hour resolution (5 years).
- **Graph Generation**: Generates historical graphs at multiple time scales (15 minutes through 5 years) for each check type on each host.
- **Simple Web Interface**: Serves an HTML/JS front-end to display host status and dynamically loaded graphs. Available in table and flame graph formats.
- **REST API**: Exposes JSON data of all hosts and their status at `GET /api`.
- **Prometheus Support**: Exposes metrics in prometheus format at `GET /metrics`.
- **Prometheus Support**: Exposes metrics in Prometheus format at `GET /metrics`.

## Requirements

Expand Down Expand Up @@ -102,6 +105,65 @@ Ensure the following are installed:
- **Port** (`--port`): Port on which the API and front-end are served.
- **Logging Level** (`--log-level`): Set the verbosity of logs (e.g., `debug`, `info`, `warn`, `error`, `fatal`, `panic`).

### Host Configuration

Hosts are defined in a JSON file. Each host can specify an address and a set of checks. Hosts without an explicit `checks` block default to a ping check.

```json
{
"router": {},
"google": {
"address": "8.8.8.8",
"checks": {
"ping": { "timeout": "5s" }
}
},
"ap1": {
"checks": {
"ping": {},
"wifi_stations": {
"radios": ["phy0-ap0", "phy1-ap0"]
}
}
},
"disabled-example": {
"checks": {
"ping": { "enabled": false }
}
}
}
```

### Check Types

#### ping

Sends ICMP echo requests to check host availability and measure latency.

| Option | Type | Default | Description |
| --------- | ------ | ------- | ------------------------------ |
| `timeout` | string | `"3s"` | Ping timeout (Go duration) |
| `count` | number | `1` | Number of ping packets to send |
| `enabled` | bool | `true` | Set to `false` to disable |

#### wifi_stations

Scrapes a Prometheus metrics endpoint for `wifi_stations{ifname="..."}` gauge values, reporting connected client counts per radio interface. Each configured radio becomes a separate data source in the RRD, rendered as a stacked area graph.

| Option | Type | Default | Description |
| --------- | -------- | ---------------------------- | ------------------------------------------ |
| `radios` | []string | _(required)_ | List of `ifname` label values to monitor |
| `url` | string | `http://{host}:9100/metrics` | Full URL override for the metrics endpoint |
| `timeout` | string | `"5s"` | HTTP scrape timeout (Go duration) |
| `enabled` | bool | `true` | Set to `false` to disable |

The target host expects a Prometheus node exporter (or compatible) exposing metrics like:

```
wifi_stations{ifname="phy0-ap0"} 3
wifi_stations{ifname="phy1-ap0"} 7
```

## Host Status

Each host has an aggregate status derived from all its enabled checks:
Expand Down Expand Up @@ -136,6 +198,26 @@ Returns JSON with the status of all hosts:
}
}
},
"ap1": {
"status": "up",
"checks": {
"ping": {
"alive": true,
"metrics": {
"latency_us": 237
},
"lastupdate": 1700000000
},
"wifi_stations": {
"alive": true,
"metrics": {
"phy0-ap0": 3,
"phy1-ap0": 7
},
"lastupdate": 1700000000
}
}
},
"router": {
"status": "unknown",
"checks": {}
Expand All @@ -152,6 +234,8 @@ Exposes Prometheus-formatted metrics:
```
check_alive{host="google", address="8.8.8.8", check="ping"} 1
check_metric{host="google", address="8.8.8.8", check="ping", metric="latency_us"} 12345
check_alive{host="ap1", address="", check="ping"} 1
check_metric{host="ap1", address="", check="ping", metric="latency_us"} 237
```

## Data Directory Layout
Expand All @@ -165,20 +249,26 @@ data/
│ │ └── ping.rrd
│ ├── google/
│ │ └── ping.rrd
│ ├── ap1/
│ │ ├── ping.rrd
│ │ └── wifi_stations.rrd
│ └── ...
└── graphs/
└── imgs/
├── router/
│ ├── router_ping_15m.png
│ ├── router_ping_1h.png
│ └── ...
├── google/
│ ├── google_ping_15m.png
├── ap1/
│ ├── ap1_ping_15m.png
│ ├── ap1_ping_1h.png
│ ├── ap1_wifi_stations_15m.png
│ ├── ap1_wifi_stations_1h.png
│ └── ...
└── ...
```

Each check type gets its own RRD file (e.g., `ping.rrd`), making it straightforward to add new check types in the future without filename collisions.
Each check type gets its own RRD file (e.g., `ping.rrd`, `wifi_stations.rrd`). Multi-metric checks store all their data sources in a single RRD file.

## Makefile Targets

Expand Down
13 changes: 10 additions & 3 deletions pkg/check/check.go
Original file line number Diff line number Diff line change
Expand Up @@ -8,9 +8,11 @@
// provides a uniform shape regardless of check type: success/failure,
// a set of named metrics, and an optional error.
//
// Each check type provides a Descriptor that declares what metrics it
// produces, allowing the system to generically wire up storage and
// visualization without per-type knowledge.
// Each check instance provides a Descriptor via Describe() that declares
// what metrics it produces. This allows check instances with config-dependent
// metrics (e.g. wifi_stations with a variable number of radios) to declare
// their own metric shape, while checks with static metrics (e.g. ping) can
// simply return a package-level constant.
//
// The Registry provides type discovery, allowing check types to be
// registered by name and instantiated from configuration at runtime.
Expand All @@ -25,6 +27,11 @@ type Check interface {
// Type returns the registered name of this check type (e.g. "ping", "http").
Type() string

// Describe returns the Descriptor for this check instance, declaring
// what metrics it produces. The descriptor may vary between instances
// of the same check type depending on configuration.
Describe() Descriptor

// Run executes the check and returns a Result.
// The provided context can be used for cancellation and timeouts.
Run(ctx context.Context) Result
Expand Down
10 changes: 5 additions & 5 deletions pkg/check/descriptor.go
Original file line number Diff line number Diff line change
Expand Up @@ -21,11 +21,11 @@ type MetricDef struct {
Scale int
}

// Descriptor declares static metadata about a check type, including
// what metrics it produces. This is registered alongside the Factory
// so the system can generically wire up storage and graphs without
// per-type knowledge.
// Descriptor declares metadata about a check instance, including what
// metrics it produces. Each check instance returns its own Descriptor
// via Check.Describe(), allowing config-dependent metric shapes (e.g.
// a wifi_stations check with a variable number of radios).
type Descriptor struct {
// Metrics lists the metrics this check type produces.
// Metrics lists the metrics this check instance produces.
Metrics []MetricDef
}
6 changes: 6 additions & 0 deletions pkg/check/ping/ping.go
Original file line number Diff line number Diff line change
Expand Up @@ -92,6 +92,12 @@ func (p *Ping) Type() string {
return TypeName
}

// Describe returns the Descriptor for this ping check instance.
// Ping always produces the same metrics regardless of configuration.
func (p *Ping) Describe() check.Descriptor {
return Desc
}

// Run executes the ping check and returns a Result.
func (p *Ping) Run(ctx context.Context) check.Result {
now := time.Now()
Expand Down
Loading