Skip to content

Conversation

wking
Copy link
Member

@wking wking commented Oct 8, 2025

This is addressing the same HyperShift-scraping issue as #1240. While 1240 is trying to find a long-term path, it requires HyperShift-repo changes to wire up, and those haven't been written yet. This pull request buys time by by wiring the existing --hypershift option to code that disables the authentication requirement in that environment. Standalone clusters will continue to require prometheus-k8s ServiceAccount tokens.

@openshift-ci-robot openshift-ci-robot added jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. labels Oct 8, 2025
@openshift-ci-robot
Copy link
Contributor

@wking: This pull request references Jira Issue OCPBUGS-62861, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.21.0) matches configured target version for branch (4.21.0)
  • bug is in the state New, which is one of the valid states (NEW, ASSIGNED, POST)

No GitHub users were found matching the public email listed for the QA contact in Jira ([email protected]), skipping review request.

The bug has been updated to refer to the pull request using the external bug tracker.

In response to this:

This is addressing the same HyperShift-scraping issue as #1240. While 1240 is trying to find a long-term path, it requires HyperShift-repo changes to wire up, and those haven't been written yet. This pull request buys time by by wiring the existing --hypershift option to code that disables the authentication requirement in that environment. Standalone clusters will continue to require prometheus-k8s ServiceAccount tokens.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

Copy link
Contributor

openshift-ci bot commented Oct 8, 2025

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: wking

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Oct 8, 2025
@wking wking force-pushed the disable-metrics-auth-on-hypershift branch from f35f20a to e16ed47 Compare October 8, 2025 21:58
In 313f8fb (CVO protects /metrics with authorization, 2025-07-22, openshift#1215) and
833a491 (CVO protects /metrics with authorization, 2025-07-22, openshift#1215), the
/metrics endpoint began requiring client auth.  The only
authentication system was Bearer tokens, and the only authorization
system was validating that the token belonged to
system:serviceaccount:openshift-monitoring:prometheus-k8s.

That worked well for standalone clusters, where the ServiceMonitor
scraper is the Prometheus from the openshift-monitoring namespace.
But it broke scraping on HyperShift [1], where the ServiceMonitor does
not request any client authorization [2].  Getting ServiceAccount
tokens (and keeping them fresh [3]) from the hosted cluster into a
Prometheus scraper running on the management cluster is hard.

This commit buys time to sort out a HyperShift metrics authentication
strategy by wiring the existing --hypershift option to code that
disables the authentication requirement in that environment.
Standalone clusters will continue to require prometheus-k8s
ServiceAccount tokens.
@wking wking force-pushed the disable-metrics-auth-on-hypershift branch from e16ed47 to a526efe Compare October 8, 2025 22:17
@wking
Copy link
Member Author

wking commented Oct 8, 2025

Looking for a signal we can use for verification, #1215 ran e2e-hypershift. It passed (although openshift/hypershift#6965 is in flight to get the tests failing on this kind of issue in the future), but digging into gathered artifacts:

$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/pr-logs/pull/openshift_cluster-version-operator/1215/pull-ci-openshift-cluster-version-operator-main-e2e-hypershift/1952739873462947840/artifacts/e2e-hypershift/dump-management-cluster/artifacts/artifacts.tar | tar -xOz logs/artifacts/output/hostedcluster-d44932313dd1be2d3560-mgmt/namespaces/openshift-monitoring/pods/prometheus-k8s-0/prometheus/prometheus/logs/current.log | grep cluster-version-operator
2025-08-05T15:53:48.316469617Z time=2025-08-05T15:53:48.316Z level=ERROR source=manager.go:176 msg="error reloading target set" component="scrape manager" err="invalid config id:serviceMonitor/e2e-clusters-ghd95-node-pool-6dl4k/cluster-version-operator/0"
2025-08-05T15:53:48.316543150Z time=2025-08-05T15:53:48.316Z level=ERROR source=manager.go:176 msg="error reloading target set" component="scrape manager" err="invalid config id:serviceMonitor/e2e-clusters-bmg8g-proxy-jplkn/cluster-version-operator/0"
2025-08-05T15:53:48.316617911Z time=2025-08-05T15:53:48.316Z level=ERROR source=manager.go:176 msg="error reloading target set" component="scrape manager" err="invalid config id:serviceMonitor/e2e-clusters-qnv7p-create-cluster-sxsvl/cluster-version-operator/0"

Not all that clear to me what it thought was invalid about the config. Maybe that's the scraping 401 that we're trying to address? Maybe not?

@wking
Copy link
Member Author

wking commented Oct 8, 2025

David's got a more straightforward take in this approach in #1242, so let's pivot to that.

@wking
Copy link
Member Author

wking commented Oct 9, 2025

Looks like error reloading target set...invalid config id: means a scrape pool without a scrape config. I can try to cross-ref the Prometheus errors against the e2e run completing and tearing down the namespace. From the test-case's destroy.log:

{"level":"info","ts":1754409030.0325892,"msg":"Deleting hosted cluster","namespace":"e2e-clusters-qnv7p","name":"create-cluster-sxsvl"}

Converting from Unix to UTC:

$ date --utc --iso=s --date '@1754409030'
2025-08-05T15:50:30+00:00

Which indeed predates the 2025-08-05T15:53:48 error log. So hooray, I understand what those logs are about. But I'm back to now knowing how to verify if this fix is working or not.

Copy link
Contributor

openshift-ci bot commented Oct 9, 2025

@wking: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-aws-ovn-techpreview a526efe link true /test e2e-aws-ovn-techpreview
ci/prow/okd-scos-e2e-aws-ovn a526efe link false /test okd-scos-e2e-aws-ovn

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@wking
Copy link
Member Author

wking commented Oct 9, 2025

Closing in favor of the #1242 approach, now picked forward onto main in #1243.

@DavidHurta
Copy link
Contributor

Closing in favor of the #1242 approach, now picked forward onto main in #1243.

The PR is not closed. Closing. If I misunderstood, please reopen.

/close

@openshift-ci openshift-ci bot closed this Oct 9, 2025
Copy link
Contributor

openshift-ci bot commented Oct 9, 2025

@DavidHurta: Closed this PR.

In response to this:

Closing in favor of the #1242 approach, now picked forward onto main in #1243.

The PR is not closed. Closing. If I misunderstood, please reopen.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants