@@ -237,3 +237,103 @@ ensure they are in a state consistent with the terraform IaC definitions.
237
237
238
238
[ Strategies for Upgrading ARC] ( https://www.kenmuse.com/blog/strategies-for-upgrading-arc/ )
239
239
outlines how ARC should be upgraded and why.
240
+
241
+ ## Grafana tokens
242
+
243
+ The cluster has multiple services communicating with Grafana Cloud:
244
+ - the metrics container
245
+ - per-node monitoring (Grafana Alloy, Prometheus node exporter)
246
+ - per-cluster monitoring (Opencost, Alloy)
247
+
248
+ The full description of the services can be found on the [ k8s-monitoring Helm
249
+ chart repository] ( https://github.com/grafana/k8s-monitoring-helm ) .
250
+
251
+ Authentication to Grafana Cloud is handled through ` Cloud access policies ` .
252
+ Currently, the cluster uses 2 kind of tokens:
253
+
254
+ - ` llvm-premerge-metrics-grafana-api-key `
255
+ Used by: metrics container
256
+ Scopes: ` metrics:write `
257
+
258
+ - ` llvm-premerge-grafana-token `
259
+ Used by: Alloy, Prometheus node exporter & other services.
260
+ Scopes: ` metrics:read ` , ` metrics:write ` , ` logs:write `
261
+
262
+ We've setup 2 cloud policies with matching names so scopes are already set up.
263
+ If you need to rotate tokens, you need to:
264
+
265
+ 1 . Login to Grafana Cloud
266
+ 2 . Navigate to ` Home > Administration > Users and Access > Cloud Access Policies `
267
+ 3 . Create a new token in the desired cloud access policy.
268
+ 4 . Log in ` GCP > Security > Secret Manager `
269
+ 5 . Click on the secret to update.
270
+ 6 . Click on ` New version `
271
+ 7 . Paste the token displayed in Grafana and tick ` Disable all past versions ` .
272
+
273
+ At this stage, you should have a ** single** enabled secret on GCP. If you
274
+ display the value, you should see the Grafana token.
275
+
276
+ Then, go in the ` llvm-zorg ` repository. Make sure you pulled the last changes
277
+ in ` main ` , and then as usual, run ` terraform apply ` .
278
+
279
+ At this stage, you made sure newly created services will use the token, but
280
+ existing deployment still rely on the old tokens. You need to manually restart
281
+ the deployments on both ` us-west1 ` and ` us-central1-a ` clusters.
282
+
283
+ Run:
284
+
285
+ ``` bash
286
+ gcloud container clusters get-credentials llvm-premerge-cluster-us-west --location us-west1
287
+ kubectl scale --replicas=0 --namespace grafana deployments \
288
+ grafana-k8s-monitoring-opencost \
289
+ grafana-k8s-monitoring-kube-state-metrics \
290
+ grafana-k8s-monitoring-alloy-events
291
+
292
+ gcloud container clusters get-credentials llvm-premerge-cluster-us-central --location us-central1-a
293
+ kubectl scale --replicas=0 --namespace grafana deployments \
294
+ grafana-k8s-monitoring-opencost \
295
+ grafana-k8s-monitoring-kube-state-metrics \
296
+ grafana-k8s-monitoring-alloy-events
297
+ kubectl scale --replicas=0 --namespace metrics
298
+ ```
299
+
300
+ :warning : metrics namespace only exists in the ` us-central1-a ` cluster.
301
+
302
+ Wait until the command ` kubectl get deployments --namespace grafana ` shows
303
+ all deployments have been scaled down to zero. Then run:
304
+
305
+ ``` bash
306
+ gcloud container clusters get-credentials llvm-premerge-cluster-us-west --location us-west1
307
+ kubectl scale --replicas=0 --namespace grafana deployments \
308
+ grafana-k8s-monitoring-opencost \
309
+ grafana-k8s-monitoring-kube-state-metrics \
310
+ grafana-k8s-monitoring-alloy-events
311
+
312
+ gcloud container clusters get-credentials llvm-premerge-cluster-us-central --location us-central1-a
313
+ kubectl scale --replicas=1 --namespace grafana deployments \
314
+ grafana-k8s-monitoring-opencost \
315
+ grafana-k8s-monitoring-kube-state-metrics \
316
+ grafana-k8s-monitoring-alloy-events
317
+ kubectl scale --replicas=1 --namespace metrics metrics
318
+ ```
319
+
320
+ You can check the restarted service logs for errors. If the token is invalid
321
+ or the scope bad, you should see some ` 401 ` error codes.
322
+
323
+ ``` bash
324
+ kubectl logs -n metrics deployment/metrics
325
+ kubectl logs -n metrics deployment/grafana-k8s-monitoring-opencost
326
+ ```
327
+
328
+ At this stage, all long-lived services should be using the new tokens.
329
+ ** DO NOT DELETE THE OLD TOKEN YET** .
330
+ The existing CI jobs can be quite long-lived. We need to wait for them to
331
+ finish. New CI jobs will pick up the new tokens.
332
+
333
+ After 24 hours, log back in
334
+ ` Administration > User and Access > Cloud Access policies ` and expand the
335
+ token lists.
336
+ You should see the new tokens ` Last used at ` being about a dozen minutes at
337
+ most, while old tokens should remain unused for several hours.
338
+ If this is the case, congratulations, you've successfully rotated security
339
+ tokens! You can now safely delete the old unused tokens.
0 commit comments