Skip to content

Conversation

@knobunc
Copy link

@knobunc knobunc commented Aug 11, 2025

Added the framework for network policies for the router.

The operator has a deny all network policy that for the openshift-ingress-operator namespace and an allow policy for egress to the apiserver and dns ports at any IP.

The operator installs a deny all network policy for the openshift-ingress namespace.

Then for each ingresscontroller that it manages it installs an allow policy for ingress for http and https traffic and metrics.

It has to allow ingress from the router pods to any IP because the route endpoints can be at any ip or pod.

It also needs access to the api server, but that is covered by the wildcard allow policy.

https://issues.redhat.com/browse/NE-1476

@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Aug 11, 2025
@openshift-ci-robot
Copy link
Contributor

openshift-ci-robot commented Aug 11, 2025

@knobunc: This pull request references NE-1476 which is a valid jira issue.

In response to this:

Added the framework for network policies for the router.

The operator has a deny all network policy that for the openshift-ingress-operator namespace and an allow policy for egress to the apiserver and dns ports at any IP.

The operator installs a deny all network policy for the openshift-ingress namespace.

Then for each ingresscontroller that it manages it installs an allow policy for ingress for http and https traffic and metrics.

It has to allow ingress from the router pods to any IP because the route endpoints can be at any ip or pod.

It also needs access to the api server, but that is covered by the wildcard allow policy.

https://issues.redhat.com/browse/NE-1476

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@knobunc knobunc added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Aug 11, 2025
@knobunc knobunc force-pushed the NE-1476-Network-Policies-For-Router branch from 9a40cd0 to b8fc7f7 Compare August 11, 2025 15:36
@openshift-ci openshift-ci bot requested review from Miciah and rfredette August 11, 2025 15:39
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Aug 11, 2025

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign rfredette for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@knobunc knobunc force-pushed the NE-1476-Network-Policies-For-Router branch 2 times, most recently from b392321 to 777addf Compare August 11, 2025 16:18
@knobunc
Copy link
Author

knobunc commented Aug 11, 2025

/retest-required

@knobunc knobunc force-pushed the NE-1476-Network-Policies-For-Router branch 5 times, most recently from 71e5e0a to abb8394 Compare August 13, 2025 01:54
Added the framework for network policies for the router.

The operator has a deny all network policy that for the openshift-ingress-operator namespace and an allow policy for egress to the apiserver and dns ports at any IP.

The operator installs a deny all network policy for the openshift-ingress namespace.

Then for each ingresscontroller that it manages it installs an allow policy for ingress for http and https traffic and metrics.

It has to allow ingress from the router pods to any IP because the route endpoints can be at any ip or pod.

It also needs access to the api server, but that is covered by the wildcard allow policy.

https://issues.redhat.com/browse/NE-1476
@knobunc knobunc force-pushed the NE-1476-Network-Policies-For-Router branch from abb8394 to cf93ea5 Compare August 13, 2025 14:41
@candita
Copy link
Contributor

candita commented Aug 18, 2025

@knobunc the e2e-gcp-operator CI tests are failing with:

failed to create RBAC objects: could not create role "e2e-gcp-operator": Timeout: request did not complete within requested timeout - context deadline exceeded}

@knobunc
Copy link
Author

knobunc commented Aug 19, 2025

@candita thanks. I've been focusing on the e2e-aws-ovn tests and those are consistently failing on:
[sig-network-edge][OCPFeatureGate:GatewayAPIController][Feature:Router][apigroup:gateway.networking.k8s.io] Ensure HTTPRoute object is created [Suite:openshift/conformance/parallel]

I am pretty sure the issue is that the test is creating a pod and it is unreachable without policy, but I haven't had time to chase it all the way down yet.

@lihongan
Copy link
Contributor

cc @melvinjoseph86

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Sep 26, 2025

@knobunc: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-aws-gatewayapi-conformance 777addf link false /test e2e-aws-gatewayapi-conformance
ci/prow/e2e-aws-operator-techpreview cf93ea5 link false /test e2e-aws-operator-techpreview
ci/prow/e2e-aws-ovn-single-node cf93ea5 link false /test e2e-aws-ovn-single-node
ci/prow/e2e-aws-ovn-techpreview cf93ea5 link false /test e2e-aws-ovn-techpreview
ci/prow/okd-scos-e2e-aws-ovn cf93ea5 link false /test okd-scos-e2e-aws-ovn
ci/prow/e2e-gcp-ovn cf93ea5 link false /test e2e-gcp-ovn
ci/prow/e2e-azure-ovn cf93ea5 link false /test e2e-azure-ovn
ci/prow/hypershift-e2e-aks cf93ea5 link true /test hypershift-e2e-aks
ci/prow/e2e-gcp-operator cf93ea5 link true /test e2e-gcp-operator
ci/prow/e2e-aws-ovn-hypershift-conformance cf93ea5 link true /test e2e-aws-ovn-hypershift-conformance
ci/prow/e2e-aws-ovn cf93ea5 link true /test e2e-aws-ovn
ci/prow/e2e-aws-ovn-upgrade cf93ea5 link true /test e2e-aws-ovn-upgrade
ci/prow/e2e-azure-operator cf93ea5 link true /test e2e-azure-operator
ci/prow/e2e-aws-ovn-serial cf93ea5 link true /test e2e-aws-ovn-serial
ci/prow/e2e-aws-operator cf93ea5 link true /test e2e-aws-operator

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@rikatz
Copy link
Member

rikatz commented Oct 7, 2025

/assign
/cc

@openshift-ci openshift-ci bot requested a review from rikatz October 7, 2025 17:33
- ingressclasses
- networkpolicies
verbs:
- '*'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If possible, avoid adding wildcard rules. We have complaints about existing wildcards in RBAC rules, and we should avoid adding more.

- Ingress
- Egress
---
### Allow the operator needs to talk to the apiserver and dns,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean the cluster DNS service (that is, CoreDNS) here?

Suggested change
### Allow the operator needs to talk to the apiserver and dns,
### The operator needs to talk to the apiserver and cluster DNS service,

### Allow the operator needs to talk to the apiserver and dns,
### but it also needs to be able to configure external DNS and the endpoints
### are only known at run time. So it needs wildcard egress.
### Allow access to the metrics ports on the operators
Copy link
Contributor

@Miciah Miciah Oct 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
### Allow access to the metrics ports on the operators
### Allow access to the secure metrics port on the operator.

resources:
- networkpolicies
verbs:
- "*"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Avoid using a wildcard if possible.

# Ingress to http on port 80 and https on port 443 (TCP) and to metrics (1936)
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
# name, namespace,labels and annotations are set at runtime
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks a bit odd not to set the namespace here when you set it in pkg/manifests/assets/gateway-api/gateway-networkpolicy-allow.yaml and pkg/manifests/assets/router/networkpolicy-deny-all.yaml.

Suppose we did support managed router deployments in other namespaces. Then you would need a deny-all policy created in the same namespace as the allow policy.

I would be all right with hard-coding the "openshift-ingress" namespace here and leaving it to either the cluster-admin or future us to handle networkpolicies for routers in other namespaces if that ever becomes something we support.

Comment on lines +17 to +18
- protocol: TCP
port: 1936
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's unfortunate we need to expose the metrics port, but the same port is also the healthcheck endpoint, and so we must expose it to be used with the HostNetwork and NodePortService endpoint publishing strategy types.

Comment on lines +199 to +205
if _, _, err := r.ensureCanaryNamespaceNetworkPolicy(); err != nil {
return result, fmt.Errorf("failed to ensure canary namespace network policy: %v", err)
}

if _, _, err := r.ensureCanaryNetworkPolicy(); err != nil {
return result, fmt.Errorf("failed to ensure canary network policy: %v", err)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the ordering important? That is, is it important that we not create the allow policy if we fail to create the default-deny policy?

Actually, maybe it should be the other way around: Don't create the default-deny policy unless you can create the allow policy. Would that be safer?

In any case, please use %w for wrapped errors.

Suggested change
if _, _, err := r.ensureCanaryNamespaceNetworkPolicy(); err != nil {
return result, fmt.Errorf("failed to ensure canary namespace network policy: %v", err)
}
if _, _, err := r.ensureCanaryNetworkPolicy(); err != nil {
return result, fmt.Errorf("failed to ensure canary network policy: %v", err)
}
if _, _, err := r.ensureCanaryNamespaceNetworkPolicy(); err != nil {
return result, fmt.Errorf("failed to ensure canary namespace network policy: %w", err)
}
// Only create the allow policy if the deny-all policy was created successfully.
if _, _, err := r.ensureCanaryNetworkPolicy(); err != nil {
return result, fmt.Errorf("failed to ensure canary network policy: %w", err)
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 (I did a similar question on cluster-dns-operator without realizing Miciah did it here).

One thing to consider is that Ingress Network policies are usually a fail close behavior, so once creating a netpol that selects a pod works, it should automatically deny any traffic not covered by any other rule. It is better to create the specific rule before everything, and fast fail in case the creation fails

Comment on lines +15 to +17
if np.Name != "ingress-canary" || np.Namespace != "openshift-ingress-canary" {
t.Fatalf("unexpected name/namespace: %s/%s", np.Namespace, np.Name)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider using testify for these assertions.

Comment on lines +26 to +31
want := map[int32]struct{}{8888: {}, 8443: {}}
got := map[int32]struct{}{}
for _, r := range np.Spec.Ingress {
for _, p := range r.Ports {
if p.Protocol != nil && *p.Protocol == corev1.ProtocolTCP && p.Port != nil && p.Port.IntVal != 0 {
got[p.Port.IntVal] = struct{}{}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

struct{} is passé. Consider using "k8s.io/apimachinery/pkg/util/sets".Set[int32].

Comment on lines +72 to +79
// Change in owner label should be detected but preserve other labels.
mutated = original.DeepCopy()
mutated.Labels["ingress.openshift.io/canary"] = "other"
if changed, updated := canaryNetworkPolicyChanged(original, mutated); !changed {
t.Fatalf("expected change detected for label diff")
} else if updated.Labels["ingress.openshift.io/canary"] != "other" {
t.Fatalf("expected owner label updated")
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Despite the comment, the test logic doesn't actually check that other labels are preserved.

Comment on lines +152 to +155
// Ensure the allow NetworkPolicy for Gateway API components.
if _, _, err := r.ensureGatewayAPINetworkPolicy(ctx); err != nil {
return reconcile.Result{}, fmt.Errorf("failed to ensure gateway-api network policy: %w", err)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This logic to create the NetworkPolicy for gateway proxies seems like it belongs in the operator's gatewayclass controller with the logic that installs the Istio control plane.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why? I think that the idea here is correct. You are watching for a Gateway and react creating its NetworkPolicy.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The gatewayapi controller watches featuregates and customresourcedefinitions, and it reconciles the Gateway API CRDs.

The gatewayclass controller watches gatewayclasses, subscriptions, installplans, istios and customresourcedefinitions, and this controller also creates the Istio CR.

The Istio CR tells the OSSM Operator to install an Istiod control-plane, which is the thing that watches and reconciles gateways.

Come to think of it, it would make some sense to have Istiod create the networkpolicies. However, we are trying to fix the problem in cluster-ingress-operator, so it makes sense to have the controller that installs Istiod (which in turn creates the proxy pods) also create the networkpolicies for Istio's proxy pods.

The controllers are confusingly named, and we should refactor them at some point. Part of the history here is that we treat CRD life-cycle management on the one hand and gateway controller on the other hand as separate features. From that perspective, it makes sense to manage networkpolicies as part of the gateway-controller logic, not the CRD-life-cycle logic.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah ok, makes sense. That said, wouldn't make more sense to have a separate controller just for Network Policies based on Gateway, as we have for DNS as well?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR is adding logic to create a generic NetworkPolicy that applies to all gateways that are created in the namespace, which are expected to belong to the Istio CR that the gatewayapi controller creates in the namespace. This NetworkPolicy doesn't depend on the definition of any particular Gateway. It therefore makes some sense for the same controller that creates the Istio CR to create the generic NetworkPolicy as well.

However, in the future, it would make sense to have a separate controller, which could create a NetworkPolicy for each Gateway, based off that Gateway's listeners. So I anticipate that this logic will be refactored in the future, but I think it seems reasonable for now.

if np.Labels == nil {
np.Labels = map[string]string{}
}
np.Labels[operatorcontroller.ControllerDeploymentLabel] = operatorcontroller.OpenShiftDefaultGatewayClassName
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is potentially confusing as there is not necessarily any GatewayClass with the default GatewayClass name. See

// Set the default gatewayclass name for the gatewayclass
// controller. We do not enable the gatewayclass controller,
// which otherwise would create the default gatewayclass.
// However, Istio's deployment controller also uses this
// setting, namely by reconciling gateways that specify this
// gatewayclass (irrespective of the controller name that the
// gatewayclass specifies), so we don't want to leave this at
// the default value, which is "istio". If we didn't set this,
// then our Istiod instance might try to reconcile gateways
// belonging to an unrelated Istiod instance.
"PILOT_GATEWAY_API_DEFAULT_GATEWAYCLASS_NAME": controller.OpenShiftDefaultGatewayClassName,
.

If this logic were in the gatewayclass controller, it could use the actual GatewayClass's name. Of course, it could still be confusing because you could create a GatewayClass with the openshift.io/gateway-controller/v1 controller name, and then create a second GatewayClass with the same controller name; presumably the NetworkPolicy would have only the first GatewayClass as the owner, which isn't too confusing. But then what if you delete the first GatewayClass? Do we need to update the label?

Note that we actually have a similar problem with the Istio CR.

Comment on lines +89 to +100
if cmp.Equal(current.Spec, expected.Spec, cmpopts.EquateEmpty()) && cmp.Equal(current.Labels, expected.Labels, cmpopts.EquateEmpty()) {
return false, nil
}
updated := current.DeepCopy()
updated.Spec = expected.Spec
// Merge/overwrite labels from expected; preserve any unrelated labels.
if updated.Labels == nil {
updated.Labels = map[string]string{}
}
for k, v := range expected.Labels {
updated.Labels[k] = v
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function is not fixed-point. That is, you can get in an update loop where current has some unexpected label that isn't in expected. Then the function merges updated into expected, but it keeps the unexpected label, and then the cmp.Equal(current.Labels, expected.Labels, ...) detects this as a change, and you get a loop.

}

if _, _, err := r.ensureRouterNamespaceNetworkPolicy(ci); err != nil {
return fmt.Errorf("failed to ensure default namespacenetwork policy: %v", err)
Copy link
Contributor

@Miciah Miciah Oct 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
return fmt.Errorf("failed to ensure default namespacenetwork policy: %v", err)
return fmt.Errorf("failed to ensure default namespace network policy: %w", err)

Comment on lines +164 to +168
if !errors.IsNotFound(err) {
return false, nil, fmt.Errorf("failed to get router network policy %s/%s: %v", np.Namespace, np.Name, err)
}
if err := r.client.Create(context.TODO(), np); err != nil {
return false, nil, fmt.Errorf("failed to create router network policy %s/%s: %v", np.Namespace, np.Name, err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the "default namespace network policy", not the "router network policy". Anyway, ensureIngressController wraps these errors, so the wrapping here seems redundant.


// ensureRouterNetworkPolicy ensures the per-ingresscontroller NetworkPolicy that
// allows ingress to router pods and egress to the network exists and is up to date.
func (r *reconciler) ensureRouterNetworkPolicy(ic *operatorv1.IngressController) error {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason this ensure function doesn't return the object or Boolean value? It doesn't strictly need to do so, so I don't object if you prefer to keep it simple, but it deviates from the established pattern for ensure methods.

Comment on lines +76 to +83
// Change in owner label should be detected but preserve other labels.
mutated = original.DeepCopy()
mutated.Labels["ingresscontroller.operator.openshift.io/owning-ingresscontroller"] = "other"
if changed, updated := routerNetworkPolicyChanged(original, mutated); !changed {
t.Fatalf("expected change detected for label diff")
} else if updated.Labels["ingresscontroller.operator.openshift.io/owning-ingresscontroller"] != "other" {
t.Fatalf("expected owner label updated")
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't checking that other labels are preserved.

return true, nil
}

func gatewayAPINetworkPolicyChanged(current, expected *networkingv1.NetworkPolicy) (bool, *networkingv1.NetworkPolicy) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this function name doesn't reflect much on the function behavior. I understand that it checks if 2 policies have changed, but why then it returns a new policy?

Is this the whole case as of "if it is updated, return a version with current + changes so once applying it won't conflict"?

I am wondering if long term we want to start using client.Apply with patch helper more, or state that we will always use Update on these managed objects and users cannot change it (including not adding arbitrary labels).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this function name doesn't reflect much on the function behavior. I understand that it checks if 2 policies have changed, but why then it returns a new policy?

Right, current is the actual NetworkPolicy from the API server, desired is the NetworkPolicy that we want to persist, and updated is the result of merging the two.

The operator owns the object, so we use Update. That said, we do sometimes carve out exceptions (for example, for custom labels or annotations) and deliberately preserve those things.

We have a bunch of updateFoo and fooChanged functions that follow this pattern; Ben is just following the established pattern.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cool, thanks! I was a bit confused at first.

return fmt.Errorf("failed to ensure cluster role binding: %v", err)
}

if err := r.ensureRouterNetworkPolicy(ci); err != nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same comment on network policy made above may apply here, we probably want this policy before the default deny all

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants