-
Notifications
You must be signed in to change notification settings - Fork 520
OPRUN-4172: Deprecate serviceAccount field in OLMv1 ClusterExtension API #1860
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
OPRUN-4172: Deprecate serviceAccount field in OLMv1 ClusterExtension API #1860
Conversation
Signed-off-by: Rashmi Gottipati <[email protected]>
Signed-off-by: Rashmi Gottipati <[email protected]>
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
@rashmigottipati: This pull request references OPRUN-4172 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.21.0" version, but no target version was set. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
Signed-off-by: Rashmi Gottipati <[email protected]>
Signed-off-by: Rashmi Gottipati <[email protected]>
This looks like it makes changes to APIs on OpenShift but doesn't include an API reviewer in the set of approvers. Please reach out on the #forum-api-review channel in slack to request an API reviewer be assigned to review this enhancement. Thanks! |
Thanks @everettraven. I’ve reached out in #forum-api-review on Slack to request an API reviewer for this enhancement. I’ll update the PR’s approvers list once a reviewer is assigned. |
- "@grokspawn" | ||
approvers: | ||
- "@trgeiger" | ||
- "@grokspawn" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- "@grokspawn" | |
- "@grokspawn" | |
- "@joelanford" |
|
||
Deprecate the `.spec.serviceAccount` field from the ClusterExtension API in the operator-controller. This field was originally introduced to enforce least privilege by requiring users to provide a ServiceAccount with the necessary RBAC permissions to manage extension content. This proposal removes that requirement and simplifies controller logic and behavior. | ||
|
||
**Note**: This deprecation is an interim step. While the intent is to eventually remove the `.spec.serviceAccount` field, we will keep it as optional and mark it as deprecated during this transition period spanning multiple releases. This approach allows users time to adapt without disruption before the field is fully removed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the graduation plan is a good place to discuss this, and maybe I'm just imagining that we're going to end up having a lively discussion on the path forward and not wanting to have to update two places.
Maybe it's just me... it seems like a good callout aside from that. 😄
## Motivation | ||
|
||
The original intent of `.spec.serviceAccount` was to support a least-privilege model by allowing users to provide custom `ServiceAccount` with fine-grained permissions. In practice, this design introduced considerable operational and technical complexity, including: | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd understood that it also arose from the need to implement a rollout revision approach (using boxcutter) which required cluster admin permissions anyway, so we couldn't reasonably reduce the privileges of a "package manager" from cluster admin levels.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do you specifically need to use boxcutter for a revision rollout approach?
What are the limitations in doing revision rollouts, regardless of underlying implementation, that makes it impossible for you to continue to adhere to the least-privilege principle?
|
||
- Make `.spec.serviceAccount` field **optional** in the `ClusterExtension` API. | ||
- Update the controller to **ignore** the field during reconciliation. | ||
- Log a warning when `.spec.serviceAccount` is set. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need to adopt an approach similar to how k8s tracks deprecation (see: https://kubernetes.io/blog/2020/09/03/warnings/#metrics). For example, we could implement a clusterextension_requested_deprecated_api
metric, or an admission webhook to warn when we try to create a ClusterExtension with a serviceaccount.
// Deprecated: This field is ignored and will be removed in a future release. | ||
// +optional | ||
// +kubebuilder:validation:XValidation:message="serviceAccount is deprecated and ignored" | ||
ServiceAccount *ClusterExtensionServiceAccount `json:"serviceAccount,omitempty"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We considered moving this to a pointer type to be a potentially breaking change we could avoid, so we're using go1.24's omitzero
capability to omit this field from the struct if it is set to only 'default' values: https://github.com/operator-framework/operator-controller/pull/2242/files#diff-387659018fbee6e27fcf5ab561b45489e14b62ae73ecf00fed3a0f4c017714d1R75
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Beforehand, was an empty string (""
) a valid value here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No. It would've failed the DNS1123 label check.
- The controller now uses a static `rest.Config` created from its own identity (its default ServiceAccount). | ||
- The RestConfigMapper will now be directly set to `ClusterAdminRestConfigMapper(mgr.GetConfig())`, i.e. the controller always uses its own identity (cluster-admin level config) for all reconciliation and watching operations. | ||
This config is passed to the helm.ActionConfigGetter, ensuring that all Helm operations (install, upgrade, uninstall) use the same identity. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to include the removal of the preflight permissions checks here, or is that not required because those are not GA'd?
|
||
### Graduation Criteria / Deprecation Plan | ||
|
||
We will deprecate the .spec.serviceAccount field over the course of three OpenShift releases following the kubernetes' deprecation policy: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here's where we really need arch guidance. We've been informed that GA'd OCP APIs have functionally never been removed, even though nominally we could -- after a reasonable period -- up-version the API and remove the field, including a conversion webhook to handle upgrades.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not 100% sure what our process here for an up-version of an API is in OpenShift, but if there is one, I imagine it wouldn't be too dissimilar from Kubernetes (https://kubernetes.io/docs/reference/using-api/deprecation-policy/#rest-resources-aka-api-objects).
You cannot remove the API version until it has "aged out" - in which case you are confident that all users have switched over to the newer, non-deprecated API. In the upstream case it looks like release N+9. You also must have a field in the next version of the API to map the old field value to handle round-tripping without data loss.
As far as I know, OpenShift has never removed a stable API version so I would default to not being able to do this.
@JoelSpeed may have more familiarity with whether or not this is something we might have a way to allow.
I know OLM is an upstream project that we sync the APIs downstream, and we generally aren't as restrictive there because we don't entirely control the API surface, but IIUC we consider OLM a core component of the OpenShift payload so I would expect that we end up treating it with a bit more caution.
- The controller’s ServiceAccount needs broader permissions, potentially increasing risk if compromised. | ||
- Cluster administrators must carefully manage the controller’s permissions. | ||
- Possible security concerns when consolidating privileges into a single SA as opposed to separate, per-extension identities. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lost signal if an operator upgrade requires permissions not provided by the existing, configured serviceaccount (since OLMv1 can do anything, we might perpetrate the upgrade and then have the operator fail).
Signed-off-by: Rashmi Gottipati <[email protected]>
@rashmigottipati: The following test failed, say
Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
- "@everettraven" | ||
- "@trgeiger" | ||
- "@grokspawn" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note, there should generally be one approver that is responsible for helping drive to consensus amongst the set of stakeholders (generally represented in the reviewers section).
With multiple approvers it gets difficult to know who is responsible for actually helping to get this "over the line".
The approver should generally be someone from your team, usually a team lead or a staff engineer.
I should also be placed under a specific subgroup of approvers called api-approvers
- Custom `rest.Config` generation and dynamic HTTP clients | ||
- Token expiration edge cases | ||
|
||
Given the limited benefit and high complexity of this approach, we propose simplifying the model: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be helpful if you could clarify what the "limited benefit" here is.
You start this section talking about how the intention was originally for following a least-privilege principle which, on the surface, sounds like it would actually be a pretty big benefit.
After all, looking at the documentation for OLMv1 it seems like "secure by default" is a core tenant tenet: https://github.com/operator-framework/operator-controller/blob/292c0dbc236747e57f77fd914830a865e87fad6b/docs/project/olmv1_design_decisions.md?plain=1#L179-L186
As a reviewer, I'd like to better understand:
- Why this
tenanttenet was originally seen as a win for end users - Why this
tenanttenet turned out to not be a win for end users
## Motivation | ||
|
||
The original intent of `.spec.serviceAccount` was to support a least-privilege model by allowing users to provide custom `ServiceAccount` with fine-grained permissions. In practice, this design introduced considerable operational and technical complexity, including: | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do you specifically need to use boxcutter for a revision rollout approach?
What are the limitations in doing revision rollouts, regardless of underlying implementation, that makes it impossible for you to continue to adhere to the least-privilege principle?
- Token acquisition, rotation, and error handling | ||
- Custom `rest.Config` generation and dynamic HTTP clients | ||
- Token expiration edge cases |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you actively experiencing issues with being able to handle these scenarios?
From what I remember this logic is all already implemented and is flexed during testing. I'm curious how many times the logic to do these things have been problematic from both an end-user perspective and a maintenance perspective.
- Optional | ||
- Deprecated via struct tags and documentation | ||
- Marked as deprecated in the CRD schema using OpenAPI extensions: | ||
- x-kubernetes-deprecated: true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this exists?
As far as I am aware, there is not yet a way to mark a specific field in a CRD schema as deprecated. https://pkg.go.dev/k8s.io/apiextensions-apiserver/pkg/apis/apiextensions#JSONSchemaProps would be your canonical source of things you can set for a particular schema property.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like maybe it hasn't landed yet: kubernetes/kubernetes#131817
So the only option is to comment that these are deprecated (in addition to programmatic capture of "a deprecated thing is being used").
// Deprecated: This field is ignored and will be removed in a future release. | ||
// +optional | ||
// +kubebuilder:validation:XValidation:message="serviceAccount is deprecated and ignored" | ||
ServiceAccount *ClusterExtensionServiceAccount `json:"serviceAccount,omitempty"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Beforehand, was an empty string (""
) a valid value here?
|
||
## Drawbacks | ||
- Users lose fine-grained permission control per extension via separate ServiceAccounts. | ||
- The controller’s ServiceAccount needs broader permissions, potentially increasing risk if compromised. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
True, but there are likely other ServiceAccounts on the cluster that already have broad enough permissions to do damage. To do that damage you have to first get access to the cluster.
This could certainly be considered a drawback to increase the permissions associated with the OLM Service Account, but I think a bigger drawback I don't see explicitly mentioned here is the potential for OLM to be the thing that injected a malicious actor into the cluster because it no longer will fail to stamp out permissions it wasn't explicitly told it could.
While it may still be difficult, we've seen in the past malicious actors sneak something into a popular project that is then shipped out en masse. With this change, OLM could be used to get that access to a cluster.
|
||
1. Keep `.spec.serviceAccount` and Improve Token Handling | ||
Instead of removing the field, we could improve token management (like automatic refresh and better error handling). | ||
- However, this would add more complexity and maintenance work without fixing the main issue: impersonation is complicated and rarely helpful. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reading through this, I'm not convinced on the "rarely helpful" part. Please do update this document appropriately to explicitly state why this has been found to be rarely helpful.
- Cluster administrators must carefully manage the controller’s permissions. | ||
- Possible security concerns when consolidating privileges into a single SA as opposed to separate, per-extension identities. | ||
|
||
## Alternatives (Not Implemented) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some alternatives I've not seen mentioned as being considered. Curious if you've thought of them:
- Building better client tooling that makes managing permission requirements easier for users looking to install cluster extensions.
- Building a permission requesting system that makes it easier for users to request installation of a cluster extension and OLM creating a "permission request" that an admin can either approve or deny to automatically stamp out the RBAC needed to complete the installation.
- This would make the API more complex, go against common Kubernetes practices, and could confuse users. | ||
|
||
3. Do Nothing – Keep Supporting the Field as Is | ||
- The added complexity and risks from impersonation outweigh the benefits, so leaving it unchanged is not a good option. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reading through this, I still don't have a solid understanding of how painful this added complexity and risks of impersonation actually are
- Remove all logic that: | ||
- Token Acquisition: Eliminate use of TokenRequest API to fetch short-lived tokens for user-provided ServiceAccounts. | ||
- Rest Config Mapping: Remove the ServiceAccountRestConfigMapper, which dynamically generated rest.Config objects for impersonation. | ||
- Synthetic Permissions: Remove conditional logic for SyntheticPermissions that depended on impersonated clients. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How do "Synthetic Permissions" play a role in service account based permissions? What are "Synthetic Permissions"?
This PR proposes the deprecation of the
serviceAccount
field in the OLMv1ClusterExtension
API.Key points:
serviceAccount
field in theClusterExtension
APIClusterExtension
API