Skip to content

[CI] Update documentation about token rotation #539

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Aug 11, 2025
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
100 changes: 100 additions & 0 deletions premerge/cluster-management.md
Original file line number Diff line number Diff line change
Expand Up @@ -237,3 +237,103 @@ ensure they are in a state consistent with the terraform IaC definitions.

[Strategies for Upgrading ARC](https://www.kenmuse.com/blog/strategies-for-upgrading-arc/)
outlines how ARC should be upgraded and why.

## Grafana tokens

The cluster has multiple services communicating with Grafana Cloud:
- the metrics container
- per-node monitoring (Grafana Alloy, Prometheus node exporter)
- per-cluster monitoring (Opencost, Alloy)

The full description of the services can be found on the [k8s-monitoring Helm
chart repository](https://github.com/grafana/k8s-monitoring-helm).

Authentication to Grafana Cloud is handled through `Cloud access policies`.
Currently, the cluster uses 2 kind of tokens:

- `llvm-premerge-metrics-grafana-api-key`
Used by: metrics container
Scopes: `metrics:write`

- `llvm-premerge-grafana-token`
Used by: Alloy, Prometheus node exporter & other services.
Scopes: `metrics:read`, `metrics:write`, `logs:write`

We've setup 2 cloud policies with matching names so scopes are already set up.
If you need to rotate tokens, you need to:

1. Login to Grafana Cloud
2. Navigate to `Home > Administration > Users and Access > Cloud Access Policies`
3. Create a new token in the desired cloud access policy.
4. Log in `GCP > Security > Secret Manager`
5. Click on the secret to update.
6. Click on `New version`
7. Paste the token displayed in Grafana and tick `Disable all past versions`.

At this stage, you should have a **single** enabled secret on GCP. If you
display the value, you should see the Grafana token.

Then, go in the `llvm-zorg` repository. Make sure you pulled the last changes
in `main`, and then as usual, run `terraform apply`.

At this stage, you made sure newly created services will use the token, but
existing deployment still rely on the old tokens. You need to manually restart
the deployments on both `us-west1` and `us-central1-a` clusters.

Run:

``` bash
gcloud container clusters get-credentials llvm-premerge-cluster-us-west --location us-west1
kubectl scale --replicas=0 --namespace grafana deployments \
grafana-k8s-monitoring-opencost \
grafana-k8s-monitoring-kube-state-metrics \
grafana-k8s-monitoring-alloy-events

gcloud container clusters get-credentials llvm-premerge-cluster-us-central --location us-central1-a
kubectl scale --replicas=0 --namespace grafana deployments \
grafana-k8s-monitoring-opencost \
grafana-k8s-monitoring-kube-state-metrics \
grafana-k8s-monitoring-alloy-events
kubectl scale --replicas=0 --namespace metrics
```

:warning: metrics namespace only exists in the `us-central1-a` cluster.

Wait until the command `kubectl get deployments --namespace grafana` shows
all deployments have been scaled down to zero. Then run:

```bash
gcloud container clusters get-credentials llvm-premerge-cluster-us-west --location us-west1
kubectl scale --replicas=0 --namespace grafana deployments \
grafana-k8s-monitoring-opencost \
grafana-k8s-monitoring-kube-state-metrics \
grafana-k8s-monitoring-alloy-events

gcloud container clusters get-credentials llvm-premerge-cluster-us-central --location us-central1-a
kubectl scale --replicas=1 --namespace grafana deployments \
grafana-k8s-monitoring-opencost \
grafana-k8s-monitoring-kube-state-metrics \
grafana-k8s-monitoring-alloy-events
kubectl scale --replicas=1 --namespace metrics metrics
```

You can check the restarted service logs for errors. If the token is invalid
or the scope bad, you should see some `401` error codes.

```bash
kubectl logs -n metrics deployment/metrics
kubectl logs -n metrics deployment/grafana-k8s-monitoring-opencost
```

At this stage, all long-lived services should be using the new tokens.
**DO NOT DELETE THE OLD TOKEN YET**.
The existing CI jobs can be quite long-lived. We need to wait for them to
finish. New CI jobs will pick up the new tokens.

After 24 hours, log back in
`Administration > User and Access > Cloud Access policies` and expand the
token lists.
You should see the new tokens `Last used at` being about a dozen minutes at
most, while old tokens should remain unused for several hours.
If this is the case, congratulations, you've successfully rotated security
tokens! You can now safely delete the old unused tokens.