Skip to content

Conversation

anik120
Copy link
Contributor

@anik120 anik120 commented Sep 3, 2025

Secure metrics endpoint access of both controllers using kube-rbac-proxy

@openshift-ci-robot openshift-ci-robot added jira/severity-moderate Referenced Jira bug's severity is moderate for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Sep 3, 2025
@openshift-ci-robot
Copy link

@anik120: This pull request references Jira Issue OCPBUGS-59768, which is invalid:

  • expected the bug to target the "4.21.0" version, but no target version was set

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

In response to this:

Problem:
OLM metrics endpoints were exposed without proper authentication, allowing unauthorized access to sensitive operational data.

Root Cause:
kube-rbac-proxy configuration was missing critical authentication arguments:

  • --client-ca-file
  • --auth-mode=rbac

Solution:

  • Add missing authentication arguments to kube-rbac-proxy in package-server-manager
  • Mount client CA certificate from new metrics-client-ca ConfigMap
  • Update ServiceMonitor to use scrapeClass: tls-client-certificate-auth
  • Create ConfigMap with service CA injection for automatic CA bundle sync

Security Impact:

  • Before: Any HTTPS client could access metrics endpoints
  • After: Only authorized Prometheus service account (system:serviceaccount:openshift-monitoring:prometheus-k8s) can access metrics

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@dtfranz
Copy link
Contributor

dtfranz commented Sep 4, 2025

/retest

@anik120
Copy link
Contributor Author

anik120 commented Sep 4, 2025

Looks like the cluster failed to launch in the previous failure because of infra issues

/retest

@anik120
Copy link
Contributor Author

anik120 commented Sep 8, 2025

/retest

1 similar comment
@perdasilva
Copy link
Contributor

/retest

@perdasilva
Copy link
Contributor

/jira refresh

@openshift-ci-robot openshift-ci-robot added jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. and removed jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Sep 9, 2025
@openshift-ci-robot
Copy link

@perdasilva: This pull request references Jira Issue OCPBUGS-59768, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.21.0) matches configured target version for branch (4.21.0)
  • bug is in the state ASSIGNED, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @Xia-Zhao-rh

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci openshift-ci bot requested a review from Xia-Zhao-rh September 9, 2025 09:10
@perdasilva
Copy link
Contributor

/approve
/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Sep 9, 2025
Copy link
Contributor

openshift-ci bot commented Sep 9, 2025

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: anik120, perdasilva

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Sep 9, 2025
@jianzhangbjz
Copy link
Contributor

Hi @Xia-Zhao-rh , could you help verify it? Thanks!
/assign @Xia-Zhao-rh

@tmshort
Copy link
Contributor

tmshort commented Sep 10, 2025

/retest-required

@Xia-Zhao-rh
Copy link

cluster launch failed

level=error msg=Cluster initialization failed because one or more operators are not functioning properly.
level=error msg=The cluster should be accessible for troubleshooting as detailed in the documentation linked below,
level=error msg=https://docs.openshift.com/container-platform/latest/support/troubleshooting/troubleshooting-installations.html
level=error msg=The 'wait-for install-complete' subcommand can then be used to continue the installation
level=error msg=failed to initialize the cluster: Multiple errors are preventing progress:
level=error msg=* Cluster operator ingress is not available
level=error msg=* Could not update configmap "openshift-operator-lifecycle-manager/metrics-client-ca" (686 of 957): resource may have been deleted
level=error msg=* Could not update operatorgroup "openshift-monitoring/openshift-cluster-monitoring" (853 of 957): resource may have been deleted
level=error msg=* Could not update role "openshift-operator-lifecycle-manager/operator-lifecycle-manager-metrics" (933 of 957): resource may have been deleted: timed out waiting for the condition
Installer exit with code 6

@openshift-ci openshift-ci bot removed the lgtm Indicates that a PR is ready to be merged. label Sep 10, 2025
@anik120
Copy link
Contributor Author

anik120 commented Sep 10, 2025

@perdasilva @Xia-Zhao-rh apologies, looks like the manifests ordering wasn't correct the first time around. It looks correct now, tests are passing.

@anik120
Copy link
Contributor Author

anik120 commented Sep 10, 2025

/test unit-registry

@anik120
Copy link
Contributor Author

anik120 commented Sep 10, 2025

/retest

@Xia-Zhao-rh
Copy link

Xia-Zhao-rh commented Sep 11, 2025

Hi, @anik120
The pod of package-server-manager is CrashLoopBackOff
error message

xzha@xzha1-mac openshift-tests-private % oc logs package-server-manager-8655984f8f-p7x6w 
Defaulted container "kube-rbac-proxy" out of: kube-rbac-proxy, package-server-manager
Error: unknown flag: --auth-mode

And catalog-operator-metrics/olm-operator-metrics endpoints still expose all the information without any authorization

xzha@xzha1-mac openshift-tests-private % oc get ConfigMap metrics-client-ca
NAME                DATA   AGE
metrics-client-ca   1      86m

xzha@xzha1-mac openshift-tests-private % oc exec catalog-operator-6d57dfc67c-g2sjs  -- curl -k https://10.128.0.37:8443/metrics | head
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0# HELP catalog_source_count Number of catalog sources
# TYPE catalog_source_count gauge
catalog_source_count 4
# HELP catalog_source_snapshots_total The number of times the catalog operator has requested a snapshot of data from a catalog source
# TYPE catalog_source_snapshots_total counter
catalog_source_snapshots_total{name="certified-operators",namespace="openshift-marketplace"} 0
catalog_source_snapshots_total{name="community-operators",namespace="openshift-marketplace"} 0
catalog_source_snapshots_total{name="redhat-marketplace",namespace="openshift-marketplace"} 0
catalog_source_snapshots_total{name="redhat-operators",namespace="openshift-marketplace"} 0
# HELP catalogsource_ready State of a CatalogSource. 1 indicates that the CatalogSource is in a READY state. 0 indicates CatalogSource is in a Non READY state.


xzha@xzha1-mac openshift-tests-private % oc exec olm-operator-648957f75c-s5ps7  -- curl -k https://10.128.0.36:8443/metrics | head
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  9167    0  9167    0     0   263k      0 --:--:-- --:--:-- --:--:--  263k
# HELP csv_count Number of CSVs successfully registered
# TYPE csv_count gauge
csv_count 1
# HELP csv_succeeded Successful CSV install
# TYPE csv_succeeded gauge
csv_succeeded{name="packageserver",namespace="openshift-operator-lifecycle-manager",version="0.0.1-snapshot"} 1
# HELP csv_upgrade_count Monotonic count of CSV upgrades
# TYPE csv_upgrade_count counter
csv_upgrade_count 0
# HELP go_gc_duration_seconds A summary of the wall-time pause (stop-the-world) duration in garbage collection cycles.  

@perdasilva
Copy link
Contributor

/hold on comment above

@openshift-ci openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Sep 12, 2025
@openshift-ci openshift-ci bot removed the lgtm Indicates that a PR is ready to be merged. label Sep 12, 2025
Copy link
Contributor

openshift-ci bot commented Sep 12, 2025

New changes are detected. LGTM label has been removed.

@anik120 anik120 force-pushed the rbac-proxy-auth branch 7 times, most recently from 23458cd to 517e260 Compare September 13, 2025 20:48
@anik120 anik120 changed the title OCPBUGS-59768: metrics endpoints security - add client cert auth to kube-rbac-proxy WIP: OCPBUGS-59768: metrics endpoints security - add client cert auth to kube-rbac-proxy Sep 13, 2025
@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Sep 13, 2025
@anik120 anik120 force-pushed the rbac-proxy-auth branch 3 times, most recently from 95c7915 to d1dedf5 Compare September 13, 2025 21:29
@anik120 anik120 changed the title WIP: OCPBUGS-59768: metrics endpoints security - add client cert auth to kube-rbac-proxy WIP: OCPBUGS-59768: add kube-rbac-proxy sidecars to olm and catalog operators Sep 13, 2025
@openshift-ci-robot
Copy link

@anik120: This pull request references Jira Issue OCPBUGS-59768, which is valid.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.21.0) matches configured target version for branch (4.21.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @Xia-Zhao-rh

The bug has been updated to refer to the pull request using the external bug tracker.

In response to this:

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

Secure metrics endpoint access of both controllers using kube-rbac-proxy
Copy link
Contributor

openshift-ci bot commented Sep 14, 2025

@anik120: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/verify dabdd4f link true /test verify
ci/prow/okd-scos-e2e-aws-ovn dabdd4f link false /test okd-scos-e2e-aws-ovn
ci/prow/verify-deps dabdd4f link true /test verify-deps

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Comment on lines 79 to 84
func (sc *serverConfig) getAddress(tlsEnabled bool) string {
if tlsEnabled {
return ":8443"
return "127.0.0.1:8443"
}
return ":8080"
return "127.0.0.1:8080"
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apologies for the churn on this one @perdasilva @Xia-Zhao-rh, but I've finally managed to get some time with it and come up with the solution.

We essentially need to make sure the metrics end points are available only on localhost (and not any and every service endpoint), and then safeguard those endpoints using kube-rbac-proxy (the same way that package-server-manager does it, only it was easier to set that up there because it was built using kubebuilder (and therefore has the strandard --metrics endpoint setup).

Which means we need to make this change upstream, and then pull it in downstream. With these changes, I was able to confirm the endpoints are secured from unauthorized access:

$ oc get pods -n openshift-operator-lifecycle-manager
 NAME                   READY  STATUS   RESTARTS   AGE
 catalog-operator-5d8bbf7ccd-nx2mc    2/2   Running   0       6s
 olm-operator-fc4fb8f95-vxlnd       2/2   Running   0       7s
 package-server-manager-74cfbb9d4-6b896  2/2   Running   3 (99m ago)  112m

 $ oc get svc -n openshift-operator-lifecycle-manager
 NAME               TYPE    CLUSTER-IP   EXTERNAL-IP  PORT(S)  AGE
 catalog-operator-metrics     ClusterIP  172.30.19.132  <none>    8443/TCP  108m
 olm-operator-metrics       ClusterIP  172.30.57.17  <none>    8443/TCP  108m

Port Mapping Details:

 ports:
 - name: https-metrics
  port: 8443     # External service port
  protocol: TCP
  targetPort: 9443  # kube-rbac-proxy container port

☝🏽 Services correctly route external port 8443 to kube-rbac-proxy port 9443

$ oc exec -n openshift-operator-lifecycle-manager catalog-operator-5d8bbf7ccd-nx2mc -c catalog-operator -- curl -k
 https://catalog-operator-metrics.openshift-operator-lifecycle-manager.svc.cluster.local:8443/metrics
 Unauthorized

 $ oc exec -n openshift-operator-lifecycle-manager olm-operator-fc4fb8f95-vxlnd -c olm-operator -- curl -k
 https://olm-operator-metrics.openshift-operator-lifecycle-manager.svc.cluster.local:8443/metrics
 Unauthorized

RBAC properly configured to allow prometheus-k8s service account access:

 $ oc logs -n openshift-operator-lifecycle-manager catalog-operator-5d8bbf7ccd-nx2mc -c kube-rbac-proxy --tail=3
 I0914 16:37:40.257036    1 kube-rbac-proxy.go:347] Reading certificate files
 I0914 16:37:40.257404    1 kube-rbac-proxy.go:395] Starting TCP socket on 0.0.0.0:9443
 I0914 16:37:40.257658    1 kube-rbac-proxy.go:402] Listening securely on 0.0.0.0:9443
 
 $ oc get rolebinding operator-lifecycle-manager-metrics -n openshift-operator-lifecycle-manager -o yaml
 subjects:
 - kind: ServiceAccount
  name: prometheus-k8s
  namespace: openshift-monitoring

ServiceMonitor correctly configured for authenticated HTTPS scraping:

 spec:
  endpoints:
  - bearerTokenFile: /var/run/secrets/kubernetes.io/serviceaccount/token
   interval: 30s
   port: https-metrics
   scheme: https
   tlsConfig:
    caFile: /etc/prometheus/configmaps/serving-certs-ca-bundle/service-ca.crt
    serverName: catalog-operator-metrics.openshift-operator-lifecycle-manager.svc

I'll make a PR upstream (that'll need to set up kube-rbac-proxy for upstream too), and then once that's downstream-ed this PR will be ready to go.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

 spec:
  endpoints:
  - bearerTokenFile: /var/run/secrets/kubernetes.io/serviceaccount/token

Integration with OCP monitoring shouldn't rely on service account's bearer token and should use mutual TLS instead (see https://rhobs-handbook.netlify.app/products/openshiftmonitoring/collecting_metrics.md/#exposing-metrics-for-prometheus for details).
As always I'm speaking for the ideal implementation and I understand that it may require several iterations to reach the final state :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@simonpasquier looks like there previously was an attempt to implement mutual TLS authentication, but it had to be reverted.

This PR did not introduce the ServiceMonitor's reliance on bearer tokens. If we need to revisit that discussion we can do so in a new PR.

anik120 added a commit to anik120/operator-lifecycle-manager that referenced this pull request Sep 15, 2025
Bind metrics server endpoints to 127.0.0.1 instead of all interfaces
to improve security by preventing direct external access to metrics endpoints.
This change supports downstream kube-rbac-proxy integration (see openshift/operator-framework-olm#1061 for more info)
by ensuring metrics are only accessible locally within the pod.

Signed-off-by: Anik Bhattacharjee <[email protected]>
@anik120
Copy link
Contributor Author

anik120 commented Sep 17, 2025

Fyi looking into another implementation here #1111

@anik120
Copy link
Contributor Author

anik120 commented Sep 23, 2025

Working on getting the implementation in #1111 ready. Closing this in favor of #1111

@anik120 anik120 closed this Sep 23, 2025
@openshift-ci-robot
Copy link

@anik120: This pull request references Jira Issue OCPBUGS-59768. The bug has been updated to no longer refer to the pull request using the external bug tracker. All external bug links have been closed. The bug has been moved to the NEW state.

In response to this:

Secure metrics endpoint access of both controllers using kube-rbac-proxy

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. jira/severity-moderate Referenced Jira bug's severity is moderate for the branch this PR is targeting. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants