Skip to content

Conversation

krunaljain
Copy link
Contributor

@krunaljain krunaljain commented Jun 17, 2025

This PR adds CFP to selectively export CiliumEndpoints, CiliumIdentities and CiliumEndpointSlices in Clustermesh APIserver. Only entities annotated with an internal annotation are expored to etcd when this "scoped-export" mode is enabled in the cilium-config.

Related: cilium/cilium#39876

@ysksuzuki
Copy link
Member

cc @cilium/sig-clustermesh

@krunaljain
Copy link
Contributor Author

Hi @ysksuzuki . The CFP is targeted for Clustermesh but it also touches the Cilium Agent and Cilium Operator (Specifically the CRD controllers for CiliumIdentity, CiliumEndpoint and CiliumEndpointSlice). Do you know the reviewers for these components?

@ysksuzuki
Copy link
Member

Hi @krunaljain , thank you for the CFP. I think the Cluster Mesh team is sufficient as the initial reviewer. If they evaluate the proposal and determine that it has a significant impact on specific subsystems of the agent or operator, they’ll likely ask the necessary teams for additional review.

@giorio94 giorio94 self-requested a review June 18, 2025 06:50
Copy link
Member

@giorio94 giorio94 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @krunaljain and @vakalapa for this proposal.

I totally agree that having a way to limit the amount of information exchanged via Cluster Mesh is beneficial for all those use-cases in which only a limited subset of workloads needs to communicate cross-cluster.

I've did a first pass on the proposal and left a few comments inline. Let me know your thoughts.

Copy link

@marseel marseel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the proposal, I think it's definitely worth exploring and I would love to see an improvement in this area.

I've added one larger comment with most of my thoughts on the topic.
There are three major parts of it:

  • What configurations it would work with
  • What would be the user interface
  • What other pros/cons would it have, including complexity and maintenance cost

I would also be happy to jump on a call if you think it makes sense to discuss :)

@giorio94
Copy link
Member

I would also be happy to jump on a call if you think it makes sense to discuss :)

Happy to join as well if useful.

@krunaljain
Copy link
Contributor Author

Thanks @marseel @giorio94 . Will setup a call to get an alignment on the CFP

@krunaljain krunaljain force-pushed the krunaljain/filtered-export branch 2 times, most recently from be55711 to 57152bc Compare June 23, 2025 21:17
@krunaljain
Copy link
Contributor Author

Hi @marseel @giorio94 @MrFreezeex Updated the CFP as per the discussion. Let me know if you have any questions with the same.

@giorio94
Copy link
Member

Hi @marseel @giorio94 @MrFreezeex Updated the CFP as per the discussion. Let me know if you have any questions with the same.

Two high level comments, before I provide more detailed feedback.

  • The initial parts of the CFP still need to be updated to reflect the global namespace approach.
  • The intended network policy behavior should be formalized outside of the implementation details section. The idea being to clearly state the intended behavior so that it is easy to get a sign-off from the other involved teams. Then, the implementation section can be used to deeply analyze the details. One example of a high-level description (to be refined):

Full cross-cluster communication and network policy support is provided only if both the source (i.e., client) pod and the destination (i.e., server) pod (optionally behind a service) are part of a global namespace. Cross-cluster traffic in which at least either the source or the destination pods do not belong to a global namespace may work in absence of network policies. Cross-cluster traffic from a source pod not part of a global namespace to a destination pod matched by an ingress policy shall be dropped, unless from-world traffic is explicitly allowed by the policy, in which case it may be allowed. Similarly, cross-cluster traffic from a source pod matched by an egress policy to a destination pod not part of a global namespace shall be dropped, unless to-world traffic is explicitly allowed by the policy, in which case it may be allowed.

@krunaljain
Copy link
Contributor Author

@giorio94 updated with recommended changes

Copy link
Member

@MrFreezeex MrFreezeex left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this new iteration!

I left a few comment inline about a few points that are not clear to me.

Also could you include in that CFP how this would interact with MCS-API? I think if the namespace is not global/does not export things it should not export the service info and the ServiceExport condition should have a condition to signal that (something similar to this one with an updated message should work https://github.com/cilium/cilium/blob/53c7424657a088e6e4585ebbc072caf1b010f46c/pkg/clustermesh/mcsapi/serviceimport_controller.go#L376).

Copy link
Member

@giorio94 giorio94 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the updates. One more round of high-level comments inline.

@krunaljain krunaljain requested a review from MrFreezeex July 1, 2025 17:05
@krunaljain krunaljain requested a review from MrFreezeex July 2, 2025 14:16
Copy link
Member

@MrFreezeex MrFreezeex left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! This lgtm from my perspective meaning UX wise and clustermesh logic, although there are still a few threads/suggestion still opened about connectivity & policies from @giorio94.

I was also wondering about possibly adding an annotation similar to service.cilium.io/shared: "false" on the namespace level but I don't think it would be that necessary/convenient on the MCS-API side and the global namespace annotation seems to be a great start already anyway.

@MrFreezeex
Copy link
Member

MrFreezeex commented Jul 2, 2025

Ah and also but that's more a nit. Since we are talking about global/local namespace you could probably rephrase the mention of scoped export mode to replace it by local/global namespace. IMO it would be more understandable for people reading the CFP without having the full history of the discussion

@krunaljain
Copy link
Contributor Author

Ah and also but that's more a nit. Since we are talking about global/local namespace you could probably rephrase the mention of scoped export mode to replace it by local/global namespace. IMO it would be more understandable for people reading the CFP without having the full history of the discussion

Done

@vakalapa
Copy link
Contributor

vakalapa commented Jul 2, 2025

@krunaljain I do not see any mention of the service.cilium.io/shared: "false" behavior we have today. Does this mean this annotation will continue to work and the behavior is respected for the Services in Global namespaces ? And the service.cilium.io/global: "true" annotation is a noop for a service either in global NS or normal NS ?

@krunaljain
Copy link
Contributor Author

krunaljain commented Jul 2, 2025

I do not see any mention of the service.cilium.io/shared: "false" behavior we have today. Does this mean this annotation will continue to work and the behavior is respected for the Services in Global namespaces ? And the service.cilium.io/global: "true" annotation is a noop for a service either in global NS or normal NS ?

@vakalapa Yes that is correct. That annotations is a part of the overall Global Service functionality which will only be honored in global namespaces. This is mentioned explicitly in the Goals section Global service functionality for resources under global namespaces

Copy link
Member

@giorio94 giorio94 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still feel that the network policies section needs clarification.

@krunaljain krunaljain force-pushed the krunaljain/filtered-export branch 4 times, most recently from 3151b80 to 3528774 Compare July 17, 2025 22:03
Copy link
Member

@joestringer joestringer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the proposal! I can see how sharing less information could help with the scalability.

I think it would be worth while for us to discuss the key question I suggested at the bottom here around defaults. This key question could significantly influence the "upgrade" impact that I also proposed. Users are already relying on the existing behavior of Cilium, so we'll want to be careful to consider the options during upgrade and any mitigations we might want to put in place to avoid breaking users as they transition to Cilium version that communicates less information.

I'm also thinking how a user might debug this configuration, and I think we'll want to put some thought into that. I suggested an impact for that below as well.


- Identity ID is exported ✅
- Labels are NOT exported ❌
- **Result**: Destination cluster knows _who_ the source is (by ID) but not _what_ it is (labels). The policy applied through the labels is not enforced.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is assuming the source cluster knows how to route this specific Pod's traffic to the destination cluster (presumably via the specific Node the destination resides on?). Is that a given?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the case where the destination is in a local namespace in a remote cluster, the traffic wouldn't even get routed to the destination, so the point is somewhat moot for that case.

Copy link
Member

@joestringer joestringer Jul 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Throwing some ideas out here, but if the source cluster knows that the source Identity is in a local namespace and the traffic is destined to a Pod in a remote cluster in a global namespace, maybe the source cluster should just drop with a new reason something like DROPPED (Local Pod unroutable on Cluster Mesh). This would act a bit like a reverse path filter.

Copy link
Contributor Author

@krunaljain krunaljain Jul 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@joestringer I think the clustermesh prerequisite ensures there is a node level connectivity between clusters which ensures n/w connectivity. The optimization to drop network traffic off local namespaces can be implemented although will require changes in the underlying bpf programs to consume namespace events. If there is a marked global namespace, then the logic needs to be executed. It might be a bit complex to achieve the same. Hence, directly marked the use case as not supported for v1. Can this be a future goal?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is assuming the source cluster knows how to route this specific Pod's traffic to the destination cluster (presumably via the specific Node the destination resides on?). Is that a given?

For the case where the destination is in a local namespace in a remote cluster, the traffic wouldn't even get routed to the destination, so the point is somewhat moot for that case.

AFAIK, routing strictly speaking would work in the majority of cases, as either the clusters are configured in native routing mode (or ENI), and in that case the network knows how to forward the packets, or it operates in tunnel mode, and in that case it is likely that the IPAM is node-based, and there's a fallback ipcache entry handling the routing to the CIDR associated with that node.

Throwing some ideas out here, but if the source cluster knows that the source Identity is in a local namespace and the traffic is destined to a Pod in a remote cluster in a global namespace, maybe the source cluster should just drop with a new reason something like DROPPED (Local Pod unroutable on Cluster Mesh)

This should be relatively easy to achieve whenever there's a per-node CIDR, as it would be basically a matter of assigning a different identity to the fallback ipcache entry (currently it is world for backwards compatibility) for the nodes in remote clusters. Ingress policies in tunnel mode would require an extra ipcache lookup, though, as it is not possible to rely on the source identity conveyed via the tunnel metadata (which would be the real ID, not this fallback).

If there's no per-node CIDR, users would need to configure an entry that covers the entire cluster CIDR, which can get tricky in cloud environments if the VPC is reused for e.g., multiple clusters (I don't have much experience there on the typical configurations though).

Overall, I agree with @krunaljain that this is something for a follow-up step though.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the idea I was thinking of, it would only be on egress. Since the destination Pod would be in a global namespace I'm assuming that its info is propagated to the source. A quick check of the upper range of the Security Identity could tell that it's part of a remote cluster. The fact that the destination is known from ipcache would imply it's in a global namespace in that cluster.

But yes not a hard requirement, maybe a good extension on top. The problem I'm concerned about is when users encounter problems due to this feature, how will they debug and understand what is misconfigured. There may be some other ideas how to improve that aspect as well.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the idea I was thinking of, it would only be on egress. Since the destination Pod would be in a global namespace I'm assuming that its info is propagated to the source. A quick check of the upper range of the Security Identity could tell that it's part of a remote cluster. The fact that the destination is known from ipcache would imply it's in a global namespace in that cluster.

Yeah, that would work, as long as there's a way to let the datapath know that the source is not in a global namespace (e.g., via a load time config, or through an ipcache flag). The other caveat is that it would not be reliable for special destinations such as node, ingress and alike, as these are currently not scoped by cluster ID.

The reason I find the usage of an "unknown-clustermesh-endpoint" identity attractive is that it would allow writing a policy that allows traffic from/to unknown clustermesh entities, without conflating that with world traffic.

Good point about troubleshooting, I need to think a bit more about that.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree the "unknown-clustermesh-endpoint" identity concept is interesting. Even better if there is one of these per cluster. If for instance we could guarantee in many cases that each cluster has distinct Pod IP range(s), and the pod IP range(s) have a "fallback" identity that consists of just the cluster's name / labels / etc., then you could imagine in a case where you have many large clusters, some Endpoints could have a policy like "allow egress to cluster A" or "allow egress to clusters with label B" which would not require propagating knowledge about every endpoint's location to every other cluster.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, that would be definitely better 👍

@krunaljain krunaljain force-pushed the krunaljain/filtered-export branch 2 times, most recently from 4172470 to 3ca8536 Compare July 18, 2025 03:32
@krunaljain krunaljain force-pushed the krunaljain/filtered-export branch from 3ca8536 to e9f34c6 Compare July 18, 2025 17:02
Copy link
Member

@joestringer joestringer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a few more minor clarifying questions.

@krunaljain krunaljain force-pushed the krunaljain/filtered-export branch 2 times, most recently from 1c1c613 to 51cf3d2 Compare July 21, 2025 17:48
@krunaljain krunaljain requested a review from giorio94 July 22, 2025 16:28
@krunaljain
Copy link
Contributor Author

@joestringer @giorio94 are there any additional questions/ action items on this PR?

Copy link
Member

@joestringer joestringer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fine with the proposal as is. Other @cilium/sig-policy members might be interested to look over the policy parts as well, but I don't think we need to block on that. Maybe we can aim to merge this around the end of the week if there's no other feedback?

@krunaljain krunaljain force-pushed the krunaljain/filtered-export branch from 51cf3d2 to 61a9612 Compare July 22, 2025 21:18
Copy link
Member

@giorio94 giorio94 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The proposal looks good to me as well. Thanks!

@krunaljain krunaljain force-pushed the krunaljain/filtered-export branch from 61a9612 to 3ac805a Compare July 25, 2025 00:14
…and endpointslices exported to etcd

Signed-off-by: krunaljain <[email protected]>
@krunaljain krunaljain force-pushed the krunaljain/filtered-export branch from 3ac805a to 64d9050 Compare July 25, 2025 17:27
@joestringer
Copy link
Member

Thanks! I'll merge in this proposal as "implementable". If there's subsequent learnings from review & implementation, we can always come back and update the proposal through another PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants