Skip to content

Conversation

ndk
Copy link
Contributor

@ndk ndk commented Jun 23, 2025

The proposal has been reworked. Closed in favour of #2125

#1469 · feat: declarative Grafana Service Account management

Design proposal: #003


Why

The Grafana Operator lets you manage Grafana through Kubernetes CRs, but service accounts were still a manual step (GUI or HTTP API). This PR lets you declare SAs in the Grafana CR so the operator can:

  • Manage service accounts in Grafana
  • Generate API tokens
  • Store each token in a Kubernetes Secret
  • Clean everything up when the CR changes or is removed

What's inside

  • New field spec.grafanaServiceAccounts in the Grafana CR.
  • GrafanaServiceAccountReconciler runs after Grafana is ready.
  • Full managing flow for service accounts, tokens, and secrets.
  • Operator records managed items in status.serviceAccounts and exposes conditions.
  • Chainsaw e2e: tests/e2e/grafanaserviceaccount/chainsaw-test.yaml.

Design notes

  • Grafana's API has no labels, so we track ownership only via status.serviceAccounts. User‑managed SAs are never touched.
  • Controllers patch the CR's status (no full updates) to lower conflict risk.
  • The SA reconciler reads the live CR (not the cached client) to avoid stale reads during fast test loops (WATCH gap issue). This is worth paying close attention to because even though it's a valid way, it seems like "the last resort"
  • Permissions is out of the scope
  • orgId is default

Out of scope (for now)

  • Multi‑org support. All calls target the default Grafana org.
  • Cross‑namespace Secret writes.
  • Automatic token rotation (expires) and Enterprise‑only permission rules.
  • Permissions (it's a Enterprise/Cloud only thing).

Known limitations

  • If the controller creates an SA but crashes before patching status, the next run will try to create the SA again and hit a 409 conflict.
  • Status patching reduces write conflicts, but doesn't eliminate. Using single responsibility approach and/or SSA is a possible future improvement.
  • Token secrets are always written to the Grafana namespace.

TODO

CR example

apiVersion: grafana.integreatly.org/v1beta1
kind: Grafana
metadata:
  name: mygrafana
spec:
  grafanaServiceAccounts:
    accounts:
    - id: myaccount
      name: myaccount
      role: Viewer
      isDisabled: false
      tokens:
      - name: tokena
      - name: tokenb
    - id: myaccount2
      name: myaccount2
      role: Admin
      isDisabled: true
      tokens:
      - name: tokenc
        expires: "2025-12-31T23:59:59Z"
      - name: tokend

@github-actions github-actions bot added documentation Issues relating to documentation, missing, non-clear etc. feature this PR introduces a new feature labels Jun 23, 2025
@ndk ndk force-pushed the feat_grafana_sa_opt_step branch 3 times, most recently from 48731e2 to dc10936 Compare June 25, 2025 12:55
@ndk ndk force-pushed the feat_grafana_sa_opt_step branch from 5256a5e to ff42690 Compare July 2, 2025 08:09
Copy link
Collaborator

@Baarsgaard Baarsgaard left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some of these I don't expect to have fixed and should be considered notes for further discussion.
When we're aligned on a final implementation I will review more thoroughly :)

Comment on lines +97 to +100
// GenerateTokenSecret, if true, will create one default API token in a Secret if no Tokens are specified.
// If false, no token is created unless explicitly listed in Tokens.
// +kubebuilder:default=true
GenerateTokenSecret bool `json:"generateTokenSecret,omitempty"`
Copy link
Collaborator

@Baarsgaard Baarsgaard Jul 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I cannot imagine a scenario where a default service account is wanted as opposed to fully defining the spec?
The GrafanaServiceAccounts struct is nice to have for potential future configuration values, if we can think of extra aside from GenerateTokenSecret.

Copy link
Contributor Author

@ndk ndk Jul 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, maybe it's time to revise the proposal? :)
https://github.com/grafana/grafana-operator/blob/master/docs/docs/proposals/003-grafanaserviceaccount-crd.md

apiVersion: grafana.integreatly.org/v1beta1
kind: Grafana
metadata:
  name: grafana-sa
  namespace: grafana-namespace
spec:
  grafanaServiceAccounts: #Not sure if this is the right place to place it but thats easily fixed when implementing.
    createServiceAccount:
        generateTokenSecret: [true/false] #Will create the k8s-secret with a default name if true. Defaults to true.
    accounts: #Since its possible today to have multiple service accounts it should be a list of accounts.
        - id: grafana-sa
          name: grafana-service-account
          roles: [Viewer/Editor/Admin]
          tokens: #This is a list of the tokens that belongs to this GSA and that the operator should create k8s-secrets with tokens for with the names specified. If not specified it would default to creating a token in a k8s-secret with a default name if spec.createServiceAccount.generateTokenSecret is true.
              - Name: grafana-sa-token-<name-of-GSA>
              expires: <Absolute date for expiration, defaults to Never>
          permissions:    #This is to try and match what values can be set when creating GSA in the GUI where you can set different permissions for users and groups.
              - user: <users in the cluster/root user etc>
              permission: [Edit/Admin]

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, the proposal is just a general idea on how we thought it'd be good to implement it at the time of writing. We're not bound to implement it as-is if we find areas to improve upon

Comment on lines +148 to +155
{
sar := newGrafanaServiceAccountReconciler(r.Client, r.Scheme)
err := sar.reconcile(ctx, cr)
if err != nil {
log.Error(err, "Failed to reconcile grafana service accounts")
return ctrl.Result{RequeueAfter: RequeueDelay}, nil
}
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I cannot figure out how you rotate expiring secrets when the Grafana CR is not requeued?
From what I know, unless we explicitly return a RequeueAfter, no reconcile will happen unless the spec is changed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Token rotation is in TODO list yet. And yes, RequeueAfter is the main option.

Comment on lines +56 to +64
type GrafanaServiceAccountTokenSpec struct {
// Name is the name of the Kubernetes Secret (and token identifier in Grafana). The secret will contain the token value.
// +kubebuilder:validation:Required
Name string `json:"name"`

// Expires is the optional expiration time for the token. After this time, the operator may rotate the token.
// +kubebuilder:validation:Optional
Expires *metav1.Time `json:"expires,omitempty"`
}
Copy link
Collaborator

@Baarsgaard Baarsgaard Jul 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I expect some users would want to have secrets created outside the current namespace.
Example: allowing teams access to a Grafana instance by creating External instances with limited permissions on the GrafanaServiceAccount.

@ndk ndk force-pushed the feat_grafana_sa_opt_step branch from ff42690 to 10b8a27 Compare July 6, 2025 08:52
@theSuess
Copy link
Collaborator

theSuess commented Jul 9, 2025

As discussed in the weekly, I took a look at the broad strokes of this PR today.

All in all, I'm happy with how it looks and the overall flow of the implementation so good job on that!

One thing that stood out to me was the complexity that is introduced by keeping track of existing/removed service accounts in the status. This would be way easier when if we had a resource for each service account as we could use finalizers and wouldn't have to keep track of state in the CR.

The reason why we went with service accounts in the Grafana spec instead is based in security concerns, as everyone, with the permission to create service accounts, would be able to take over arbitrary Grafana instances.

An idea on how we could get the best of both worlds could be:

  • Implement a separate resource for GrafanaServiceAccount
  • When reconciling the instance:
    • Create GrafanaServiceAccount based on entries in the grafana spec
    • Delete all GrafanaServiceAccount resources pointing to the current instance & created by the operator (identifiable through owner references)

This has the following benefits:

Happy to discuss this further here or in the next weekly. Would also appreciate comments on this from the other maintainers (maybe even @nissessenap as he was very involved in the original proposal 👀 )

@Baarsgaard
Copy link
Collaborator

Baarsgaard commented Jul 10, 2025

Just making sure I understand your suggestions, the flow would be something like:

  1. Grafana CR is reconciled.
  • Creates Ingress/Route, PVC, Deployment, Etc.
  • As a last step it now also creates GrafanaServiceAccount CRs
  1. A dedicated controller picks up the GrafanaServiceAccount CRs and reconciles the accounts using most of the current implementation.

This would solve a lot of challenges and remove some of the uniqueness.

  • Would simplify handling Token rotation as the controller can return a RequeueAfter: Expiry-10*time.Second and ensure tokens never fully expire by rotating them a bit early.
  • The Service accounts can be tracked using a NamespacedResourceList and individually have a complex status and conditions.
  • The Grafana controller reconciles the difference between the Grafana CR and the created GrafanaServiceAccount CRs.
    This will likely require special handling if a user attempts to delete a GrafanaServiceAccount CRs directly without updated the Grafana CR
  • The GrafanaServiceAccount CRD would be similar to GrafanaNotificationPolicyRoute in that it does not implement GrafanaCommonSpec

@theSuess
Copy link
Collaborator

Yes, with one small trick when reconciling the difference between the accounts in the Grafana CR and the GrafanaServiceAccount CRs: we don't need to do any set matching but can just apply all service accounts + delete all resources with owner-reference set to grafana instance but not present in the list. When a user directly deletes a service account resource, it'd trigger a reconcile of the instance which will immediately recreate it

@nissessenap
Copy link
Collaborator

Hi all, sounds like an interesting idea @theSuess , thanks for the explanation of the flow @Baarsgaard .

I can't remember where, but there were some arguments that we shouldn't have to be forced in to use OPA or similar tools to be able to have a "secure" grafana-operator instance.

It sounds like this solution would solve the concerns raced in #1469 (comment), but still use the majority of the implementation done by @ndk in this PR.

How do you feel about it @ndk?

I assume we would use the GrafanaServiceAccount CR would use the allowCrossNamespaceImport and instanceSelector just like all other CRs.

So I think it sounds like a simple workaround but still having the same base functionality as discussed in the design.

@ndk
Copy link
Contributor Author

ndk commented Jul 10, 2025

Let me know once you've settled on the final approach. I've already switched from a standalone CR to the embedded one once. :)

FYI I'm a backend engineer, not a DevOps specialist, so I'm keen to learn about any pitfalls you see here but don't have a strong preference myself.

@Baarsgaard
Copy link
Collaborator

Baarsgaard commented Jul 13, 2025

Spent some time drafting what Gateway AllowedRoutes might look like on a Grafana CR.

apiVersion: grafana.integreatly.org/v1beta1
kind: Grafana
metadata:
  name: usermirror
  namespace: userspace
spec:
  # allowMatcher:
  # allowedResources:
  # resourceMatcher:
  allowedMatchers: # First match wins
    - namespaces:
        from: Selector # All, Selector, Same, None. No default
        selector: # labelSelector
          matchExpressions:
            - key: kubernetes.io/metadata.name # Example from Kong ingress. Default?
              operator: In # NotIn
              values: ["data", "alerts"]
          matchLabels: {} # Default is {}. Matches all labels
    - kinds:
        # group is In AllowedRoutes, should be Default or omitted entirely
        # If omitted, `kind:` could be removed as well, or be kept as future proofing
        - group: "grafana.integreatly.org/v1beta1"
          kind: GrafanaContactPoint
---
# New Default?
# Maintains current behaviour but only allows GrafanaServiceAccounts within the namespace
spec:
  allowedMatchers:
    - namespaces:
        from: Same
        kinds: # Optional, matches all if undefined/empty
          - kind: "GrafanaServiceAccount"
    - namespaces:
        from: All
---
spec:
  allowedMatchers:
    # Matches all and allows None
    # If set as the first in the list, no resources can be applied, is pretty useless.
    - namespaces:
        from: None
---
spec:
  allowedMatchers:
    # Matches all and allows resources within the namespace.
    # Blocks any resource outside the namespace with allowCrossNamespaceImport: true
    - namespaces:
        from: Same

As I don't expect Matchers to be immutable, the only "unknown" is if there's a better way to handle finalizing in cases where a resource no longer matches.
We could with the existing method of attempt removal anyways, and if users have overlapping UIDs we expect resyncronizing will fix it eventually.

@ndk
Copy link
Contributor Author

ndk commented Aug 4, 2025

The proposal has been reworked. Closed in favour of #2125

@ndk ndk closed this Aug 4, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Issues relating to documentation, missing, non-clear etc. feature this PR introduces a new feature
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants