Skip to content

Commit a0946e7

Browse files
authored
[CI] Update documentation about token rotation (#539)
Adding documentation about usage of Grafana Cloud tokens in the infra, and how to rotate them if required.
1 parent 1d27130 commit a0946e7

File tree

1 file changed

+100
-0
lines changed

1 file changed

+100
-0
lines changed

premerge/cluster-management.md

Lines changed: 100 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -237,3 +237,103 @@ ensure they are in a state consistent with the terraform IaC definitions.
237237

238238
[Strategies for Upgrading ARC](https://www.kenmuse.com/blog/strategies-for-upgrading-arc/)
239239
outlines how ARC should be upgraded and why.
240+
241+
## Grafana tokens
242+
243+
The cluster has multiple services communicating with Grafana Cloud:
244+
- the metrics container
245+
- per-node monitoring (Grafana Alloy, Prometheus node exporter)
246+
- per-cluster monitoring (Opencost, Alloy)
247+
248+
The full description of the services can be found on the [k8s-monitoring Helm
249+
chart repository](https://github.com/grafana/k8s-monitoring-helm).
250+
251+
Authentication to Grafana Cloud is handled through `Cloud access policies`.
252+
Currently, the cluster uses 2 kind of tokens:
253+
254+
- `llvm-premerge-metrics-grafana-api-key`
255+
Used by: metrics container
256+
Scopes: `metrics:write`
257+
258+
- `llvm-premerge-grafana-token`
259+
Used by: Alloy, Prometheus node exporter & other services.
260+
Scopes: `metrics:read`, `metrics:write`, `logs:write`
261+
262+
We've setup 2 cloud policies with matching names so scopes are already set up.
263+
If you need to rotate tokens, you need to:
264+
265+
1. Login to Grafana Cloud
266+
2. Navigate to `Home > Administration > Users and Access > Cloud Access Policies`
267+
3. Create a new token in the desired cloud access policy.
268+
4. Log in `GCP > Security > Secret Manager`
269+
5. Click on the secret to update.
270+
6. Click on `New version`
271+
7. Paste the token displayed in Grafana and tick `Disable all past versions`.
272+
273+
At this stage, you should have a **single** enabled secret on GCP. If you
274+
display the value, you should see the Grafana token.
275+
276+
Then, go in the `llvm-zorg` repository. Make sure you pulled the last changes
277+
in `main`, and then as usual, run `terraform apply`.
278+
279+
At this stage, you made sure newly created services will use the token, but
280+
existing deployment still rely on the old tokens. You need to manually restart
281+
the deployments on both `us-west1` and `us-central1-a` clusters.
282+
283+
Run:
284+
285+
``` bash
286+
gcloud container clusters get-credentials llvm-premerge-cluster-us-west --location us-west1
287+
kubectl scale --replicas=0 --namespace grafana deployments \
288+
grafana-k8s-monitoring-opencost \
289+
grafana-k8s-monitoring-kube-state-metrics \
290+
grafana-k8s-monitoring-alloy-events
291+
292+
gcloud container clusters get-credentials llvm-premerge-cluster-us-central --location us-central1-a
293+
kubectl scale --replicas=0 --namespace grafana deployments \
294+
grafana-k8s-monitoring-opencost \
295+
grafana-k8s-monitoring-kube-state-metrics \
296+
grafana-k8s-monitoring-alloy-events
297+
kubectl scale --replicas=0 --namespace metrics
298+
```
299+
300+
:warning: metrics namespace only exists in the `us-central1-a` cluster.
301+
302+
Wait until the command `kubectl get deployments --namespace grafana` shows
303+
all deployments have been scaled down to zero. Then run:
304+
305+
```bash
306+
gcloud container clusters get-credentials llvm-premerge-cluster-us-west --location us-west1
307+
kubectl scale --replicas=0 --namespace grafana deployments \
308+
grafana-k8s-monitoring-opencost \
309+
grafana-k8s-monitoring-kube-state-metrics \
310+
grafana-k8s-monitoring-alloy-events
311+
312+
gcloud container clusters get-credentials llvm-premerge-cluster-us-central --location us-central1-a
313+
kubectl scale --replicas=1 --namespace grafana deployments \
314+
grafana-k8s-monitoring-opencost \
315+
grafana-k8s-monitoring-kube-state-metrics \
316+
grafana-k8s-monitoring-alloy-events
317+
kubectl scale --replicas=1 --namespace metrics metrics
318+
```
319+
320+
You can check the restarted service logs for errors. If the token is invalid
321+
or the scope bad, you should see some `401` error codes.
322+
323+
```bash
324+
kubectl logs -n metrics deployment/metrics
325+
kubectl logs -n metrics deployment/grafana-k8s-monitoring-opencost
326+
```
327+
328+
At this stage, all long-lived services should be using the new tokens.
329+
**DO NOT DELETE THE OLD TOKEN YET**.
330+
The existing CI jobs can be quite long-lived. We need to wait for them to
331+
finish. New CI jobs will pick up the new tokens.
332+
333+
After 24 hours, log back in
334+
`Administration > User and Access > Cloud Access policies` and expand the
335+
token lists.
336+
You should see the new tokens `Last used at` being about a dozen minutes at
337+
most, while old tokens should remain unused for several hours.
338+
If this is the case, congratulations, you've successfully rotated security
339+
tokens! You can now safely delete the old unused tokens.

0 commit comments

Comments
 (0)