File tree Expand file tree Collapse file tree 3 files changed +31
-1
lines changed Expand file tree Collapse file tree 3 files changed +31
-1
lines changed Original file line number Diff line number Diff line change 3232 url : /arch-controller/
3333 - title : Fault Tolerance
3434 url : /arch-fault-tolerance/
35+ - title : Node Monitoring
36+ url : /arch-node-monitoring/
Original file line number Diff line number Diff line change 1+ ---
2+ permalink : /arch-node-monitoring/
3+ title : " Node Monitoring"
4+ classes : wide
5+ ---
6+
7+ The AppWrapper controller can optionally monitor Kubernetes Nodes and
8+ dynamically adjust the ` lendingLimits ` on a designated ` ClusterQueue `
9+ to account for dynamically unavailable resources. This capability is
10+ designed to enable cluster admins of an
11+ [ MLBatch cluster] ( https://github.com/project-codeflare/mlbatch ) to fully
12+ automate the small scale quota adjustments required to maintain full cluster
13+ utilization in the presence of isolated node failures and/or
14+ minor maintenance activities. The monitoring detects both Nodes that
15+ are marked as ` Unscheduable ` via standard Kubernetes mechanisms and Nodes
16+ that have resources that Autopilot has flagged as unhealthy (see [ Fault Tolerance] ( /arch-fault-tolerance ) ).
17+ The ` lendingLimit ` of a designated slack capacity ` ClusterQueue ` is
18+ automatically adjusted to reflect the current dynamically unavailable resources.
19+
20+ Node monitoring is enabled by the following additional configuration:
21+ ``` yaml
22+ slackQueueName : " slack-queue"
23+ autopilot :
24+ monitorNodes : true
25+ ` ` `
26+
27+ See [node_health_monitor.go]({{ site.gh_main_url }}/internal/controller/appwrapper/node_health_monitor.go)
28+ for the implementation.
Original file line number Diff line number Diff line change @@ -6,7 +6,7 @@ classes: wide
66
77### Prerequisites
88
9- You'll need ` go ` v1.22.0 + installed on your development machine.
9+ You'll need ` go ` v1.22.4 + installed on your development machine.
1010
1111You'll need a container runtime and cli (eg ` docker ` or ` rancher-desktop ` ).
1212
You can’t perform that action at this time.
0 commit comments