-
Notifications
You must be signed in to change notification settings - Fork 205
OCPBUGS-62861: pkg/cvo/metrics: Do not require auth when --hypershift is set #1241
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@wking: This pull request references Jira Issue OCPBUGS-62861, which is valid. The bug has been moved to the POST state. 3 validation(s) were run on this bug
No GitHub users were found matching the public email listed for the QA contact in Jira ([email protected]), skipping review request. The bug has been updated to refer to the pull request using the external bug tracker. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: wking The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
f35f20a
to
e16ed47
Compare
In 313f8fb (CVO protects /metrics with authorization, 2025-07-22, openshift#1215) and 833a491 (CVO protects /metrics with authorization, 2025-07-22, openshift#1215), the /metrics endpoint began requiring client auth. The only authentication system was Bearer tokens, and the only authorization system was validating that the token belonged to system:serviceaccount:openshift-monitoring:prometheus-k8s. That worked well for standalone clusters, where the ServiceMonitor scraper is the Prometheus from the openshift-monitoring namespace. But it broke scraping on HyperShift [1], where the ServiceMonitor does not request any client authorization [2]. Getting ServiceAccount tokens (and keeping them fresh [3]) from the hosted cluster into a Prometheus scraper running on the management cluster is hard. This commit buys time to sort out a HyperShift metrics authentication strategy by wiring the existing --hypershift option to code that disables the authentication requirement in that environment. Standalone clusters will continue to require prometheus-k8s ServiceAccount tokens.
e16ed47
to
a526efe
Compare
Looking for a signal we can use for verification, #1215 ran e2e-hypershift. It passed (although openshift/hypershift#6965 is in flight to get the tests failing on this kind of issue in the future), but digging into gathered artifacts: $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/pr-logs/pull/openshift_cluster-version-operator/1215/pull-ci-openshift-cluster-version-operator-main-e2e-hypershift/1952739873462947840/artifacts/e2e-hypershift/dump-management-cluster/artifacts/artifacts.tar | tar -xOz logs/artifacts/output/hostedcluster-d44932313dd1be2d3560-mgmt/namespaces/openshift-monitoring/pods/prometheus-k8s-0/prometheus/prometheus/logs/current.log | grep cluster-version-operator
2025-08-05T15:53:48.316469617Z time=2025-08-05T15:53:48.316Z level=ERROR source=manager.go:176 msg="error reloading target set" component="scrape manager" err="invalid config id:serviceMonitor/e2e-clusters-ghd95-node-pool-6dl4k/cluster-version-operator/0"
2025-08-05T15:53:48.316543150Z time=2025-08-05T15:53:48.316Z level=ERROR source=manager.go:176 msg="error reloading target set" component="scrape manager" err="invalid config id:serviceMonitor/e2e-clusters-bmg8g-proxy-jplkn/cluster-version-operator/0"
2025-08-05T15:53:48.316617911Z time=2025-08-05T15:53:48.316Z level=ERROR source=manager.go:176 msg="error reloading target set" component="scrape manager" err="invalid config id:serviceMonitor/e2e-clusters-qnv7p-create-cluster-sxsvl/cluster-version-operator/0" Not all that clear to me what it thought was invalid about the config. Maybe that's the scraping 401 that we're trying to address? Maybe not? |
David's got a more straightforward take in this approach in #1242, so let's pivot to that. |
Looks like
Converting from Unix to UTC: $ date --utc --iso=s --date '@1754409030'
2025-08-05T15:50:30+00:00 Which indeed predates the 2025-08-05T15:53:48 error log. So hooray, I understand what those logs are about. But I'm back to now knowing how to verify if this fix is working or not. |
@wking: The following tests failed, say
Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
@DavidHurta: Closed this PR. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
This is addressing the same HyperShift-scraping issue as #1240. While 1240 is trying to find a long-term path, it requires HyperShift-repo changes to wire up, and those haven't been written yet. This pull request buys time by by wiring the existing
--hypershift
option to code that disables the authentication requirement in that environment. Standalone clusters will continue to requireprometheus-k8s
ServiceAccount tokens.