@@ -91,7 +91,109 @@ oc adm policy add-scc-to-user hostmount-anyuid system:serviceaccount:nfs-provisi
9191
9292### Prometheus Setup
9393
94- TODO
94+ We follow the setup provided by the ` prometheus-community/kube-prometheus-stack ` Helm chart.
95+
96+ ``` bash
97+ helm repo add prometheus-community https://prometheus-community.github.io/helm-charts && helm repo update
98+ ```
99+
100+ The charts will install: Prometheus, Grafana, Alert Manager, Prometheus Node Exporter and Kube State Metrics. We set up the chart with the following:
101+
102+ - Persistent storage for Prometheus, Grafana and Alert Manager;
103+ - Override the Prometheus Node Exporter port;
104+ - Disable CRDs creation as they are already present.
105+
106+ You may leave the CRDs creation on, along with the default Node Exporter pod. These changes are needed when deploying a separate Prometheus instance in OpenShift.
107+
108+ ``` bash
109+ cat << EOF >> config.yaml
110+ crds:
111+ enabled: false
112+
113+ prometheus-node-exporter:
114+ service:
115+ port: 9110
116+
117+ alertmanager:
118+ alertmanagerSpec:
119+ persistentVolumeClaimRetentionPolicy:
120+ whenDeleted: Retain
121+ whenScaled: Retain
122+ storage:
123+ volumeClaimTemplate:
124+ spec:
125+ storageClassName: nfs-client-pokprod
126+ accessModes: ["ReadWriteOnce"]
127+ resources:
128+ requests:
129+ storage: 50Gi
130+
131+ prometheus:
132+ prometheusSpec:
133+ persistentVolumeClaimRetentionPolicy:
134+ whenDeleted: Retain
135+ whenScaled: Retain
136+ storageSpec:
137+ volumeClaimTemplate:
138+ spec:
139+ storageClassName: nfs-client-pokprod
140+ accessModes: ["ReadWriteOnce"]
141+ resources:
142+ requests:
143+ storage: 50Gi
144+ emptyDir:
145+ medium: Memory
146+
147+ grafana:
148+ persistence:
149+ enabled: true
150+ type: sts
151+ storageClassName: "nfs-client-pokprod"
152+ accessModes:
153+ - ReadWriteOnce
154+ size: 20Gi
155+ finalizers:
156+ - kubernetes.io/pvc-protection
157+ EOF
158+
159+ helm upgrade -i kube-prometheus-stack -n prometheus prometheus-community/kube-prometheus-stack --create-namespace -f config.yaml
160+ ```
161+
162+ If deploying on OpenShift based systems, you need to assign the privileged security context to the service accounts that are created by the helm chart.
163+
164+ ``` bash
165+ oc adm policy add-scc-to-user privileged system:serviceaccount:prometheus:kube-prometheus-stack-admission system:serviceaccount:prometheus:kube-prometheus-stack-alertmanager system:serviceaccount:prometheus:kube-prometheus-stack-grafana system:serviceaccount:prometheus:kube-prometheus-stack-kube-state-metrics system:serviceaccount:prometheus:kube-prometheus-stack-operator system:serviceaccount:prometheus:kube-prometheus-stack-prometheus system:serviceaccount:prometheus:kube-prometheus-stack-prometheus-node-exporter
166+ ```
167+
168+ You should expect the following pods:
169+
170+ ``` bash
171+ kubectl get pods
172+ ```
173+ ``` bash
174+ NAME READY STATUS RESTARTS AGE
175+ alertmanager-kube-prometheus-stack-alertmanager-0 2/2 Running 0 16m
176+ kube-prometheus-stack-grafana-0 3/3 Running 0 16m
177+ kube-prometheus-stack-kube-state-metrics-6f76b98d89-pxs69 1/1 Running 0 16m
178+ kube-prometheus-stack-operator-7fbfc985bb-mm9bk 1/1 Running 0 16m
179+ kube-prometheus-stack-prometheus-node-exporter-44llp 1/1 Running 0 16m
180+ kube-prometheus-stack-prometheus-node-exporter-95gp8 1/1 Running 0 16m
181+ kube-prometheus-stack-prometheus-node-exporter-dxf5f 1/1 Running 0 16m
182+ kube-prometheus-stack-prometheus-node-exporter-f45dx 1/1 Running 0 16m
183+ kube-prometheus-stack-prometheus-node-exporter-pfrzk 1/1 Running 0 16m
184+ kube-prometheus-stack-prometheus-node-exporter-zpfzb 1/1 Running 0 16m
185+ prometheus-kube-prometheus-stack-prometheus-0 2/2 Running 0 16m
186+ ```
187+
188+ To access the Grafana dashboard on ` localhost:3000 ` :
189+
190+ ``` bash
191+ kubectl --namespace prometheus get secrets kube-prometheus-stack-grafana -o jsonpath=" {.data.admin-password}" | base64 -d ; echo
192+ ```
193+ ``` bash
194+ export POD_NAME=$( kubectl --namespace prometheus get pod -l " app.kubernetes.io/name=grafana,app.kubernetes.io/instance=kube-prometheus-stack" -oname)
195+ kubectl --namespace prometheus port-forward $POD_NAME 3000
196+ ```
95197
96198### MLBatch Cluster Setup
97199
@@ -180,7 +282,23 @@ We reserve 8 GPUs out of 24 for MLBatch's slack queue.
180282
181283### Autopilot Extended Setup
182284
183- TODO
285+ It is possible to configure Autopilot so that it will test PVC creation and deletion given a storage class name.
286+
287+ ``` bash
288+ cat << EOF >> autopilot-extended.yaml
289+ env:
290+ - name: "PERIODIC_CHECKS"
291+ value: "pciebw,remapped,dcgm,ping,gpupower,pvc"
292+ - name: "PVC_TEST_STORAGE_CLASS"
293+ value: "nfs-client-pokprod"
294+ EOF
295+ ```
296+
297+ Then reapply the helm chart, this will start a rollout update.
298+
299+ ``` bash
300+ helm upgrade autopilot autopilot/autopilot --install --namespace=autopilot --create-namespace -f autopilot-extended.yaml
301+ ```
184302
185303### MLBatch Teams Setup
186304
0 commit comments