Skip to content

Conversation

wking
Copy link
Member

@wking wking commented Oct 8, 2025

In #1215, the /metrics endpoint began requiring client auth. The only authentication system was Bearer tokens, and the only authorization system was validating that the token belonged to
system:serviceaccount:openshift-monitoring:prometheus-k8s.

That worked well for standalone clusters, where the ServiceMonitor scraper is the Prometheus from the openshift-monitoring namespace. But it broke scraping on HyperShift, where the ServiceMonitor does not request any client authorization. Getting ServiceAccount tokens (and keeping them fresh) from the hosted cluster into a Prometheus scraper running on the management cluster is hard. But for other ServiceMonitors, HyperShift configures keySecret asking the scraper to get its client certificate from a metrics-client Secret that the HostedControlPlane controller maintains in the management cluster namespace.

This commit adds a new --serving-client-certificate-authorities-file option, which HyperShift can set when it configures the CVO Deployment, while mounting the root CA ConfigMap into the Pod.

Now that there are three files (serving key, serving cert, client CAs) that we might be watching for changes, I've taken the opportunity to refactor the checksumming and change-tracking to use a map from filename to checksum. That way we can extend more easily if further configuration files are added in the future, without having to pass around a series of paths and checksums as distinct arguments.

I'm a bit sad about AppendCertsFromPEM only returning a boolean, leaving us unsure about whether all the certificates in the file were parsed, or, if there were parsing issues, what those issues were. But with HyperShift hopefully reliably setting up CA Secrets that do not cause parsing issues, I'm ok not coding that defensively. If the standard library grows a parser that is more transparent about parsing issues, we can pivot to that once it exists.

@openshift-ci-robot openshift-ci-robot added jira/severity-important Referenced Jira bug's severity is important for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. labels Oct 8, 2025
@openshift-ci-robot
Copy link
Contributor

@wking: This pull request references Jira Issue OCPBUGS-62858, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.21.0) matches configured target version for branch (4.21.0)
  • bug is in the state New, which is one of the valid states (NEW, ASSIGNED, POST)

No GitHub users were found matching the public email listed for the QA contact in Jira ([email protected]), skipping review request.

The bug has been updated to refer to the pull request using the external bug tracker.

In response to this:

In #1215, the /metrics endpoint began requiring client auth. The only authentication system was Bearer tokens, and the only authorization system was validating that the token belonged to
system:serviceaccount:openshift-monitoring:prometheus-k8s.

That worked well for standalone clusters, where the ServiceMonitor scraper is the Prometheus from the openshift-monitoring namespace. But it broke scraping on HyperShift, where the ServiceMonitor does not request any client authorization. Getting ServiceAccount tokens (and keeping them fresh) from the hosted cluster into a Prometheus scraper running on the management cluster is hard. But for other ServiceMonitors, HyperShift configures keySecret asking the scraper to get its client certificate from a metrics-client Secret that the HostedControlPlane controller maintains in the management cluster namespace.

This commit adds a new --serving-client-certificate-authorities-file option, which HyperShift can set when it configures the CVO Deployment, while mounting the root CA ConfigMap into the Pod.

Now that there are three files (serving key, serving cert, client CAs) that we might be watching for changes, I've taken the opportunity to refactor the checksumming and change-tracking to use a map from filename to checksum. That way we can extend more easily if further configuration files are added in the future, without having to pass around a series of paths and checksums as distinct arguments.

I'm a bit sad about AppendCertsFromPEM only returning a boolean, leaving us unsure about whether all the certificates in the file were parsed, or, if there were parsing issues, what those issues were. But with HyperShift hopefully reliably setting up CA Secrets that do not cause parsing issues, I'm ok not coding that defensively. If the standard library grows a parser that is more transparent about parsing issues, we can pivot to that once it exists.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

Copy link
Contributor

openshift-ci bot commented Oct 8, 2025

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: wking

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Oct 8, 2025
@wking wking force-pushed the metrics-server-trust-client-CAs branch from 34b5010 to f4f37d7 Compare October 8, 2025 20:59
…ile option

In 313f8fb (CVO protects /metrics with authorization, 2025-07-22, openshift#1215) and
833a491 (CVO protects /metrics with authorization, 2025-07-22, openshift#1215), the
/metrics endpoint began requiring client auth.  The only
authentication system was Bearer tokens, and the only authorization
system was validating that the token belonged to
system:serviceaccount:openshift-monitoring:prometheus-k8s.

That worked well for standalone clusters, where the ServiceMonitor
scraper is the Prometheus from the openshift-monitoring namespace.
But it broke scraping on HyperShift [1], where the ServiceMonitor does
not request any client authorization [2].  Getting ServiceAccount
tokens (and keeping them fresh [3]) from the hosted cluster into a
Prometheus scraper running on the management cluster is hard.  But for
other ServiceMonitors, HyperShift configures keySecret [4] asking the
scraper to get its client certificate from a metrics-client Secret
that the HostedControlPlane controller maintains in the management
cluster namespace [5].

This commit adds a new --serving-client-certificate-authorities-file
option, which HyperShift can set when it configures the CVO Deployment
[6], while mounting the root CA ConfigMap into the Pod.

Now that there are three files (serving key, serving cert, client CAs)
that we might be watching for changes, I've taken the opportunity to
refactor the checksumming and change-tracking to use a map from
filename to checksum.  That way we can extend more easily if further
configuration files are added in the future, without having to pass
around a series of paths and checksums as distinct arguments.

I'm a bit sad about AppendCertsFromPEM only returning a boolean [7],
leaving us unsure about whether all the certificates in the file were
parsed, or, if there were parsing issues, what those issues were.  But
with HyperShift hopefully reliably setting up CA Secrets that do not
cause parsing issues, I'm ok not coding that defensively.  If the
standard library grows a parser that is more transparent about parsing
issues, we can pivot to that once it exists.

[1]: https://issues.redhat.com/browse/OCPBUGS-62858
[2]: https://github.com/openshift/hypershift/blob/2728a4e2899ee58690e3d4e58b319eb5e78643af/control-plane-operator/controllers/hostedcontrolplane/v2/assets/cluster-version-operator/servicemonitor.yaml#L18
[3]: https://kubernetes.io/docs/concepts/configuration/secret/#serviceaccount-token-secrets
[4]: https://github.com/openshift/hypershift/blob/2728a4e2899ee58690e3d4e58b319eb5e78643af/control-plane-operator/controllers/hostedcontrolplane/v2/assets/kube-apiserver/servicemonitor.yaml#L24-L26
[5]: https://github.com/openshift/hypershift/blob/2728a4e2899ee58690e3d4e58b319eb5e78643af/control-plane-operator/controllers/hostedcontrolplane/pki/sa.go#L35-L36
[6]: https://github.com/openshift/hypershift/blob/2728a4e2899ee58690e3d4e58b319eb5e78643af/control-plane-operator/controllers/hostedcontrolplane/v2/assets/cluster-version-operator/deployment.yaml#L25-L35
[7]: https://pkg.go.dev/crypto/x509#CertPool.AppendCertsFromPEM
Copy link
Contributor

openshift-ci bot commented Oct 9, 2025

@wking: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/okd-scos-e2e-aws-ovn d7ca459 link false /test okd-scos-e2e-aws-ovn

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Copy link
Contributor

@DavidHurta DavidHurta left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The logic looks solid 👍 I have no objections regarding the core of the PR.

A few nits and notes.

}

func (a *authHandler) ServeHTTP(w http.ResponseWriter, r *http.Request) {

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: empty line

Comment on lines 705 to 706
// If no errors occur, returns true if both files have changed and
// neither is an empty file. Otherwise returns false and any error.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The description is now outdated - return values, parameters.

for fName := range checksums {
err := requireFileExistsAndNotEmpty(fName)
if err != nil {
return false, nil, fmt.Errorf("Server configuration file must exist and be non-empty. Certificates will not be rotated. %w", err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I am not sure whether the error will always contain the filename.

Suggested change
return false, nil, fmt.Errorf("Server configuration file must exist and be non-empty. Certificates will not be rotated. %w", err)
return false, nil, fmt.Errorf("Server configuration file %s must exist and be non-empty. Certificates will not be rotated. %w", fName, err)

// metrics serving options
cmd.PersistentFlags().StringVar(&opts.ServingCertFile, "serving-cert-file", opts.ServingCertFile, "The X.509 certificate file for serving metrics over HTTPS. You must set both --serving-cert-file and --serving-key-file unless you set --listen empty.")
cmd.PersistentFlags().StringVar(&opts.ServingKeyFile, "serving-key-file", opts.ServingKeyFile, "The X.509 key file for serving metrics over HTTPS. You must set both --serving-cert-file and --serving-key-file unless you set --listen empty.")
cmd.PersistentFlags().StringVar(&opts.ServingClientCAFile, "serving-client-certificate-authorities-file", opts.ServingClientCAFile, "The X.509 certificates file containing a series of PEM encoded certificates for Certificate Authorities trusted to sign metrics client certificates. If set, only clients who provide mTLS client certs signed by those CAs will be trusted. If unset, only clients with valid tokens for the prometheus-k8s ServiceAccount in the openshift-monitoring namespace will be trusted.")
Copy link
Contributor

@DavidHurta DavidHurta Oct 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We currently have a --metrics-ca-bundle-file flag. For consistency, maybe we can rename the new flag to something like --serving-client-ca-bundle-file (if my terminology is correct). Not that important, but any potential renaming of flags is not fun after they are integrated into other systems. I will not block on this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. jira/severity-important Referenced Jira bug's severity is important for the branch this PR is targeting. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants