WIP: OCPBUGS-59768: add kube-rbac-proxy sidecars to olm and catalog operators #1061

anik120 · 2025-09-03T19:11:44Z

Secure metrics endpoint access of both controllers using kube-rbac-proxy

openshift-ci-robot · 2025-09-03T19:11:53Z

@anik120: This pull request references Jira Issue OCPBUGS-59768, which is invalid:

expected the bug to target the "4.21.0" version, but no target version was set

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

In response to this:

Problem:
OLM metrics endpoints were exposed without proper authentication, allowing unauthorized access to sensitive operational data.

Root Cause:
kube-rbac-proxy configuration was missing critical authentication arguments:

--client-ca-file

--auth-mode=rbac

Solution:

Add missing authentication arguments to kube-rbac-proxy in package-server-manager

Mount client CA certificate from new metrics-client-ca ConfigMap

Update ServiceMonitor to use scrapeClass: tls-client-certificate-auth

Create ConfigMap with service CA injection for automatic CA bundle sync

Security Impact:

Before: Any HTTPS client could access metrics endpoints

After: Only authorized Prometheus service account (system:serviceaccount:openshift-monitoring:prometheus-k8s) can access metrics

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

dtfranz · 2025-09-04T07:12:26Z

/retest

anik120 · 2025-09-04T19:10:03Z

Looks like the cluster failed to launch in the previous failure because of infra issues

/retest

anik120 · 2025-09-08T14:34:28Z

/retest

perdasilva · 2025-09-09T07:50:28Z

/retest

perdasilva · 2025-09-09T09:09:52Z

/jira refresh

openshift-ci-robot · 2025-09-09T09:10:02Z

@perdasilva: This pull request references Jira Issue OCPBUGS-59768, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug

bug is open, matching expected state (open)
bug target version (4.21.0) matches configured target version for branch (4.21.0)
bug is in the state ASSIGNED, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @Xia-Zhao-rh

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

perdasilva · 2025-09-09T09:17:15Z

/approve
/lgtm

openshift-ci · 2025-09-09T09:18:10Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: anik120, perdasilva

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~DOWNSTREAM_OWNERS~~ [perdasilva]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

jianzhangbjz · 2025-09-10T00:39:00Z

Hi @Xia-Zhao-rh , could you help verify it? Thanks!
/assign @Xia-Zhao-rh

tmshort · 2025-09-10T03:53:33Z

/retest-required

Xia-Zhao-rh · 2025-09-10T08:58:21Z

cluster launch failed

level=error msg=Cluster initialization failed because one or more operators are not functioning properly.
level=error msg=The cluster should be accessible for troubleshooting as detailed in the documentation linked below,
level=error msg=https://docs.openshift.com/container-platform/latest/support/troubleshooting/troubleshooting-installations.html
level=error msg=The 'wait-for install-complete' subcommand can then be used to continue the installation
level=error msg=failed to initialize the cluster: Multiple errors are preventing progress:
level=error msg=* Cluster operator ingress is not available
level=error msg=* Could not update configmap "openshift-operator-lifecycle-manager/metrics-client-ca" (686 of 957): resource may have been deleted
level=error msg=* Could not update operatorgroup "openshift-monitoring/openshift-cluster-monitoring" (853 of 957): resource may have been deleted
level=error msg=* Could not update role "openshift-operator-lifecycle-manager/operator-lifecycle-manager-metrics" (933 of 957): resource may have been deleted: timed out waiting for the condition
Installer exit with code 6

anik120 · 2025-09-10T18:36:34Z

@perdasilva @Xia-Zhao-rh apologies, looks like the manifests ordering wasn't correct the first time around. It looks correct now, tests are passing.

anik120 · 2025-09-10T18:53:56Z

/test unit-registry

anik120 · 2025-09-10T19:54:24Z

/retest

Xia-Zhao-rh · 2025-09-11T04:52:07Z

Hi, @anik120
The pod of package-server-manager is CrashLoopBackOff
error message

xzha@xzha1-mac openshift-tests-private % oc logs package-server-manager-8655984f8f-p7x6w 
Defaulted container "kube-rbac-proxy" out of: kube-rbac-proxy, package-server-manager
Error: unknown flag: --auth-mode

And catalog-operator-metrics/olm-operator-metrics endpoints still expose all the information without any authorization

xzha@xzha1-mac openshift-tests-private % oc get ConfigMap metrics-client-ca
NAME                DATA   AGE
metrics-client-ca   1      86m

xzha@xzha1-mac openshift-tests-private % oc exec catalog-operator-6d57dfc67c-g2sjs  -- curl -k https://10.128.0.37:8443/metrics | head
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0# HELP catalog_source_count Number of catalog sources
# TYPE catalog_source_count gauge
catalog_source_count 4
# HELP catalog_source_snapshots_total The number of times the catalog operator has requested a snapshot of data from a catalog source
# TYPE catalog_source_snapshots_total counter
catalog_source_snapshots_total{name="certified-operators",namespace="openshift-marketplace"} 0
catalog_source_snapshots_total{name="community-operators",namespace="openshift-marketplace"} 0
catalog_source_snapshots_total{name="redhat-marketplace",namespace="openshift-marketplace"} 0
catalog_source_snapshots_total{name="redhat-operators",namespace="openshift-marketplace"} 0
# HELP catalogsource_ready State of a CatalogSource. 1 indicates that the CatalogSource is in a READY state. 0 indicates CatalogSource is in a Non READY state.


xzha@xzha1-mac openshift-tests-private % oc exec olm-operator-648957f75c-s5ps7  -- curl -k https://10.128.0.36:8443/metrics | head
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  9167    0  9167    0     0   263k      0 --:--:-- --:--:-- --:--:--  263k
# HELP csv_count Number of CSVs successfully registered
# TYPE csv_count gauge
csv_count 1
# HELP csv_succeeded Successful CSV install
# TYPE csv_succeeded gauge
csv_succeeded{name="packageserver",namespace="openshift-operator-lifecycle-manager",version="0.0.1-snapshot"} 1
# HELP csv_upgrade_count Monotonic count of CSV upgrades
# TYPE csv_upgrade_count counter
csv_upgrade_count 0
# HELP go_gc_duration_seconds A summary of the wall-time pause (stop-the-world) duration in garbage collection cycles.

perdasilva · 2025-09-12T15:07:35Z

/hold on comment above

openshift-ci · 2025-09-12T16:24:45Z

New changes are detected. LGTM label has been removed.

openshift-ci-robot · 2025-09-13T21:31:23Z

@anik120: This pull request references Jira Issue OCPBUGS-59768, which is valid.

3 validation(s) were run on this bug

bug is open, matching expected state (open)
bug target version (4.21.0) matches configured target version for branch (4.21.0)
bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @Xia-Zhao-rh

The bug has been updated to refer to the pull request using the external bug tracker.

In response to this:

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

Secure metrics endpoint access of both controllers using kube-rbac-proxy

openshift-ci · 2025-09-14T05:12:55Z

@anik120: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
ci/prow/verify	`dabdd4f`	link	true	`/test verify`
ci/prow/okd-scos-e2e-aws-ovn	`dabdd4f`	link	false	`/test okd-scos-e2e-aws-ovn`
ci/prow/verify-deps	`dabdd4f`	link	true	`/test verify-deps`

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

anik120 · 2025-09-14T22:50:32Z

staging/operator-lifecycle-manager/pkg/lib/server/server.go

 func (sc *serverConfig) getAddress(tlsEnabled bool) string {
 	if tlsEnabled {
-		return ":8443"
+		return "127.0.0.1:8443"
 	}
-	return ":8080"
+	return "127.0.0.1:8080"
 }


Apologies for the churn on this one @perdasilva @Xia-Zhao-rh, but I've finally managed to get some time with it and come up with the solution.

We essentially need to make sure the metrics end points are available only on localhost (and not any and every service endpoint), and then safeguard those endpoints using kube-rbac-proxy (the same way that package-server-manager does it, only it was easier to set that up there because it was built using kubebuilder (and therefore has the strandard --metrics endpoint setup).

Which means we need to make this change upstream, and then pull it in downstream. With these changes, I was able to confirm the endpoints are secured from unauthorized access:

$ oc get pods -n openshift-operator-lifecycle-manager NAME READY STATUS RESTARTS AGE catalog-operator-5d8bbf7ccd-nx2mc 2/2 Running 0 6s olm-operator-fc4fb8f95-vxlnd 2/2 Running 0 7s package-server-manager-74cfbb9d4-6b896 2/2 Running 3 (99m ago) 112m $ oc get svc -n openshift-operator-lifecycle-manager NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE catalog-operator-metrics ClusterIP 172.30.19.132 <none> 8443/TCP 108m olm-operator-metrics ClusterIP 172.30.57.17 <none> 8443/TCP 108m

Port Mapping Details:

ports: - name: https-metrics port: 8443 # External service port protocol: TCP targetPort: 9443 # kube-rbac-proxy container port

☝🏽 Services correctly route external port 8443 to kube-rbac-proxy port 9443

$ oc exec -n openshift-operator-lifecycle-manager catalog-operator-5d8bbf7ccd-nx2mc -c catalog-operator -- curl -k https://catalog-operator-metrics.openshift-operator-lifecycle-manager.svc.cluster.local:8443/metrics Unauthorized $ oc exec -n openshift-operator-lifecycle-manager olm-operator-fc4fb8f95-vxlnd -c olm-operator -- curl -k https://olm-operator-metrics.openshift-operator-lifecycle-manager.svc.cluster.local:8443/metrics Unauthorized

RBAC properly configured to allow prometheus-k8s service account access:

$ oc logs -n openshift-operator-lifecycle-manager catalog-operator-5d8bbf7ccd-nx2mc -c kube-rbac-proxy --tail=3 I0914 16:37:40.257036 1 kube-rbac-proxy.go:347] Reading certificate files I0914 16:37:40.257404 1 kube-rbac-proxy.go:395] Starting TCP socket on 0.0.0.0:9443 I0914 16:37:40.257658 1 kube-rbac-proxy.go:402] Listening securely on 0.0.0.0:9443 $ oc get rolebinding operator-lifecycle-manager-metrics -n openshift-operator-lifecycle-manager -o yaml subjects: - kind: ServiceAccount name: prometheus-k8s namespace: openshift-monitoring

ServiceMonitor correctly configured for authenticated HTTPS scraping:

spec: endpoints: - bearerTokenFile: /var/run/secrets/kubernetes.io/serviceaccount/token interval: 30s port: https-metrics scheme: https tlsConfig: caFile: /etc/prometheus/configmaps/serving-certs-ca-bundle/service-ca.crt serverName: catalog-operator-metrics.openshift-operator-lifecycle-manager.svc

I'll make a PR upstream (that'll need to set up kube-rbac-proxy for upstream too), and then once that's downstream-ed this PR will be ready to go.

spec: endpoints: - bearerTokenFile: /var/run/secrets/kubernetes.io/serviceaccount/token

Integration with OCP monitoring shouldn't rely on service account's bearer token and should use mutual TLS instead (see https://rhobs-handbook.netlify.app/products/openshiftmonitoring/collecting_metrics.md/#exposing-metrics-for-prometheus for details).
As always I'm speaking for the ideal implementation and I understand that it may require several iterations to reach the final state :)

@simonpasquier looks like there previously was an attempt to implement mutual TLS authentication, but it had to be reverted.

This PR did not introduce the ServiceMonitor's reliance on bearer tokens. If we need to revisit that discussion we can do so in a new PR.

Bind metrics server endpoints to 127.0.0.1 instead of all interfaces to improve security by preventing direct external access to metrics endpoints. This change supports downstream kube-rbac-proxy integration (see openshift/operator-framework-olm#1061 for more info) by ensuring metrics are only accessible locally within the pod. Signed-off-by: Anik Bhattacharjee <[email protected]>

anik120 · 2025-09-17T17:08:26Z

Fyi looking into another implementation here #1111

anik120 · 2025-09-23T15:33:22Z

Working on getting the implementation in #1111 ready. Closing this in favor of #1111

openshift-ci-robot · 2025-09-23T15:33:30Z

@anik120: This pull request references Jira Issue OCPBUGS-59768. The bug has been updated to no longer refer to the pull request using the external bug tracker. All external bug links have been closed. The bug has been moved to the NEW state.

In response to this:

Secure metrics endpoint access of both controllers using kube-rbac-proxy

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

openshift-ci bot requested review from rashmigottipati and thetechnick September 3, 2025 19:12

openshift-ci-robot added jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. and removed jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Sep 9, 2025

openshift-ci bot requested a review from Xia-Zhao-rh September 9, 2025 09:10

openshift-ci bot assigned perdasilva Sep 9, 2025

openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Sep 9, 2025

openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Sep 9, 2025

openshift-ci bot assigned Xia-Zhao-rh Sep 10, 2025

anik120 force-pushed the rbac-proxy-auth branch from 23298d8 to 8ada54f Compare September 10, 2025 16:08

openshift-ci bot removed the lgtm Indicates that a PR is ready to be merged. label Sep 10, 2025

openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Sep 12, 2025

anik120 force-pushed the rbac-proxy-auth branch from 8ada54f to ab50f29 Compare September 12, 2025 16:24

openshift-ci bot removed the lgtm Indicates that a PR is ready to be merged. label Sep 12, 2025

anik120 force-pushed the rbac-proxy-auth branch 7 times, most recently from 23458cd to 517e260 Compare September 13, 2025 20:48

anik120 changed the title ~~OCPBUGS-59768: metrics endpoints security - add client cert auth to kube-rbac-proxy~~ WIP: OCPBUGS-59768: metrics endpoints security - add client cert auth to kube-rbac-proxy Sep 13, 2025

openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Sep 13, 2025

anik120 force-pushed the rbac-proxy-auth branch 3 times, most recently from 95c7915 to d1dedf5 Compare September 13, 2025 21:29

anik120 changed the title ~~WIP: OCPBUGS-59768: metrics endpoints security - add client cert auth to kube-rbac-proxy~~ WIP: OCPBUGS-59768: add kube-rbac-proxy sidecars to olm and catalog operators Sep 13, 2025

anik120 force-pushed the rbac-proxy-auth branch from d1dedf5 to 51abe03 Compare September 14, 2025 01:46

OCPBUGS-59768: add kube-rbac-proxy sidecars to olm and catalog operators

dabdd4f

Secure metrics endpoint access of both controllers using kube-rbac-proxy

anik120 force-pushed the rbac-proxy-auth branch from 51abe03 to dabdd4f Compare September 14, 2025 01:53

anik120 commented Sep 14, 2025

View reviewed changes

anik120 mentioned this pull request Sep 15, 2025

(fix) bind metrics server to localhost interface only operator-framework/operator-lifecycle-manager#3652

Closed

11 tasks

anik120 closed this Sep 23, 2025

WIP: OCPBUGS-59768: add kube-rbac-proxy sidecars to olm and catalog operators #1061

WIP: OCPBUGS-59768: add kube-rbac-proxy sidecars to olm and catalog operators #1061

Uh oh!

Conversation

anik120 commented Sep 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

openshift-ci-robot commented Sep 3, 2025

Uh oh!

dtfranz commented Sep 4, 2025

Uh oh!

anik120 commented Sep 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

anik120 commented Sep 8, 2025

Uh oh!

perdasilva commented Sep 9, 2025

Uh oh!

perdasilva commented Sep 9, 2025

Uh oh!

openshift-ci-robot commented Sep 9, 2025

Uh oh!

perdasilva commented Sep 9, 2025

Uh oh!

openshift-ci bot commented Sep 9, 2025

Uh oh!

jianzhangbjz commented Sep 10, 2025

Uh oh!

tmshort commented Sep 10, 2025

Uh oh!

Xia-Zhao-rh commented Sep 10, 2025

Uh oh!

anik120 commented Sep 10, 2025

Uh oh!

anik120 commented Sep 10, 2025

Uh oh!

anik120 commented Sep 10, 2025

Uh oh!

Xia-Zhao-rh commented Sep 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

perdasilva commented Sep 12, 2025

Uh oh!

openshift-ci bot commented Sep 12, 2025

Uh oh!

openshift-ci-robot commented Sep 13, 2025

Uh oh!

openshift-ci bot commented Sep 14, 2025

Uh oh!

anik120 Sep 14, 2025

Choose a reason for hiding this comment

Uh oh!

simonpasquier Sep 15, 2025

Choose a reason for hiding this comment

Uh oh!

anik120 Sep 15, 2025

Choose a reason for hiding this comment

Uh oh!

anik120 commented Sep 17, 2025

Uh oh!

anik120 commented Sep 23, 2025

Uh oh!

openshift-ci-robot commented Sep 23, 2025

Uh oh!

Uh oh!

anik120 commented Sep 3, 2025 •

edited

Loading

anik120 commented Sep 4, 2025 •

edited

Loading

Xia-Zhao-rh commented Sep 11, 2025 •

edited

Loading