Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
20 changes: 20 additions & 0 deletions charts/hami/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -137,6 +137,16 @@ This document provides detailed descriptions of all configurable values paramete
| `scheduler.service.monitorPort` | Monitor port | `31993` |
| `scheduler.service.monitorTargetPort` | Monitor target port | `9395` |

### Scheduler ServiceMonitor Configuration

| Parameter | Description | Default Value |
|-----------|-------------|---------------|
| `scheduler.servicemonitor.enabled` | Whether to enable ServiceMonitor for Prometheus monitoring | `false` |
| `scheduler.servicemonitor.labels` | Additional labels for ServiceMonitor | `{}` |
| `scheduler.servicemonitor.annotations` | Additional annotations for ServiceMonitor | `{}` |
| `scheduler.servicemonitor.interval` | Scrape interval for metrics collection | `"15s"` |
| `scheduler.servicemonitor.honorLabels` | Whether to honor labels from the target | `false` |

## Device Plugin Configuration

| Parameter | Description | Default Value |
Expand All @@ -158,6 +168,16 @@ This document provides detailed descriptions of all configurable values paramete
| `devicePlugin.monitor.image.pullSecrets` | Monitor image pull secrets | `[]` |
| `devicePlugin.monitor.ctrPath` | Container path | `/usr/local/vgpu/containers` |

### Device Plugin ServiceMonitor Configuration

| Parameter | Description | Default Value |
|-----------|-------------|---------------|
| `devicePlugin.monitor.servicemonitor.enabled` | Whether to enable ServiceMonitor for Prometheus monitoring | `false` |
| `devicePlugin.monitor.servicemonitor.labels` | Additional labels for ServiceMonitor | `{}` |
| `devicePlugin.monitor.servicemonitor.annotations` | Additional annotations for ServiceMonitor | `{}` |
| `devicePlugin.monitor.servicemonitor.interval` | Scrape interval for metrics collection | `"15s"` |
| `devicePlugin.monitor.servicemonitor.honorLabels` | Whether to honor labels from the target | `false` |

### Device Plugin Other Configuration

| Parameter | Description | Default Value |
Expand Down
33 changes: 33 additions & 0 deletions charts/hami/templates/device-plugin/servicemonitor.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,33 @@
{{- if .Values.devicePlugin.monitor.servicemonitor.enabled }}
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
{{- if .Values.devicePlugin.monitor.servicemonitor.annotations }}
annotations:
{{ toYaml .Values.devicePlugin.monitor.servicemonitor.annotations | nindent 4 }}
{{- end }}
name: {{ include "hami-vgpu.device-plugin" . }}
namespace: {{ include "hami-vgpu.namespace" . }}
labels:
{{- include "hami-vgpu.labels" . | nindent 4 }}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Considering add default label release: prometheus for prometheus to select this ServiceMonitor.

{{- if .Values.devicePlugin.monitor.servicemonitor.labels }}
{{ toYaml .Values.devicePlugin.monitor.servicemonitor.labels | indent 4 }}
{{- end }}
spec:
endpoints:
- path: /metrics
port: monitorport
scheme: http
interval: {{ .Values.devicePlugin.monitor.servicemonitor.interval | default "15s" }}
honorLabels: {{ .Values.devicePlugin.monitor.servicemonitor.honorLabels | default false }}
namespaceSelector:
matchNames:
- {{ include "hami-vgpu.namespace" . }}
selector:
matchLabels:
app.kubernetes.io/component: hami-device-plugin
{{- include "hami-vgpu.labels" . | nindent 6 }}
{{- if .Values.devicePlugin.service.labels }}
{{ toYaml .Values.devicePlugin.service.labels | indent 6 }}
{{- end }}
{{- end }}
33 changes: 33 additions & 0 deletions charts/hami/templates/scheduler/servicemonitor.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,33 @@
{{- if .Values.scheduler.servicemonitor.enabled }}
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
{{- if .Values.scheduler.servicemonitor.annotations }}
annotations:
{{ toYaml .Values.scheduler.servicemonitor.annotations | nindent 4 }}
{{- end }}
name: {{ include "hami-vgpu.scheduler" . }}
namespace: {{ include "hami-vgpu.namespace" . }}
labels:
{{- include "hami-vgpu.labels" . | nindent 4 }}
{{- if .Values.scheduler.servicemonitor.labels }}
{{ toYaml .Values.scheduler.servicemonitor.labels | indent 4 }}
{{- end }}
spec:
endpoints:
- path: /metrics
port: monitor
scheme: http
interval: {{ .Values.scheduler.servicemonitor.interval | default "15s" }}
honorLabels: {{ .Values.scheduler.servicemonitor.honorLabels | default false }}
namespaceSelector:
matchNames:
- {{ include "hami-vgpu.namespace" . }}
selector:
matchLabels:
app.kubernetes.io/component: hami-scheduler
{{- include "hami-vgpu.labels" . | nindent 6 }}
{{- if .Values.scheduler.service.labels }}
{{ toYaml .Values.scheduler.service.labels | indent 6 }}
{{- end }}
{{- end }}
14 changes: 14 additions & 0 deletions charts/hami/values.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -235,6 +235,13 @@ scheduler:
httpTargetPort: 443
labels: {}
annotations: {}
# scheduler ServiceMonitor configuration
servicemonitor:
enabled: false
labels: {}
annotations: {}
interval: "15s"
honorLabels: false

devicePlugin:
enabled: true
Expand Down Expand Up @@ -283,6 +290,13 @@ devicePlugin:
pullSecrets: []
ctrPath: /usr/local/vgpu/containers
resyncInterval: "5m"
# ServiceMonitor configuration
servicemonitor:
enabled: false
labels: {}
annotations: {}
interval: "15s"
honorLabels: false
deviceSplitCount: 10
deviceMemoryScaling: 1
deviceCoreScaling: 1
Expand Down