Skip to content

[CI] Update documentation about token rotation #539

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Aug 11, 2025
Merged
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
67 changes: 67 additions & 0 deletions premerge/cluster-management.md
Original file line number Diff line number Diff line change
Expand Up @@ -237,3 +237,70 @@ ensure they are in a state consistent with the terraform IaC definitions.

[Strategies for Upgrading ARC](https://www.kenmuse.com/blog/strategies-for-upgrading-arc/)
outlines how ARC should be upgraded and why.

## Grafana tokens

The cluster has multiple services communicating with Grafana Cloud:
- the metrics container
- per-node monitoring (Grafana Alloy, Prometheus node exporter)
- per-cluster monitoring (Opencost, Alloy)

Authentication to Grafana Cloud is handled through `Cloud access policies`.
Currently, the cluster uses 2 kind of tokens:

- `llvm-premerge-metrics-grafana-api-key`
Used by: metrics container
Scopes: `metrics:write`

- `llvm-premerge-testing-grafana-token`
Used by: Alloy, Prometheus node exporter & other services.
Scopes: `metrics:read`, `metrics:write`, `logs:write`

We've setup 2 cloud policies with matching names so scopes are already set up.
If you need to rotate tokens, you need to:

1. Login to Grafana Cloud
2. Navigate to `Home > Administration > Users and Access > Cloud Access Policies`
3. Create a new token in the desired cloud access policy.
4. Log in `GCP > Security > Secret Manager`
5. Click on the secret to update.
6. Click on `New version`
7. Paste the token displayed in Grafana and tick `Disable all past versions`.

At this stage, you should have a **single** enabled secret on GCP. If you
display the value, you should see the Grafana token.

Then, go in the `llvm-zorg` repository. Make sure you pulled the last changes
in `main`, and then as usual, run `terraform apply`.

At this stage, you made sure newly created services will use the token, but
existing deployment still rely on the old tokens.
To solve this:

1. Log in `GCP > Kubernetes Engine > Workloads`
2. Filter by `Type:Deployment`
3. You should see multiple deployments starting with `grafana-`, and one
`metrics` deployment.
Depending on the token you updated, you need to update either the
`metrics`, or the `grafana-*` deployments.
4. For each service to update, click on the service name
5. Click on `Actions > Scale > Edit Replicas`.
6. The number of replicas should be `1`. Set it to `0`.
7. Press `Scale`, and refresh until all pods are deleted.
8. Click again on `Actions > Scale > Edit Replicas`, and scale up to `1`.

You can check the restarted service logs for errors. If the token is invalid
or the scope bad, you should see some `401` error codes.

At this stage, all long-lived services should be using the new tokens.
**DO NOT DELETE THE OLD TOKEN YET**.
The existing CI jobs can be quite long-lived. We need to wait for them to
finish. New CI jobs will pick up the new tokens.

After 24 hours, log back in
`Administration > User and Access > Cloud Access policies` and expand the
token lists.
You should see the new tokens `Last used at` being about a dozen minutes at
most, while old tokens should remain unused for several hours.
If this is the case, congratulations, you've successfully rotated security
tokens! You can now safely delete the old unused tokens.