Skip to content

Conversation

@FelixMcFelix
Copy link
Collaborator

@FelixMcFelix FelixMcFelix commented Dec 31, 2025

This PR reworks certain aspects of how NAT and gateway layers are constructed to enable attached (external) subnets to function. The primary aim here is that traffic to/from these subnets (owned by the host) do not undergo NAT, and bypass spoof detection as transit IPs would. A few wider changes have been necessary to ensure that these can be attached/detached without breaking any existing transit IPs, and to ensure that traffic originated from an external subnet cannot be directed towards a private VPC recipient.

  • To be compatible with add/remove of transit IPs, we have to store both transit/attached subnets consistently in VpcCfg rather than working directly on the port's ruleset. This also gives us the opportunity to set these on port create rather than using a followup ioctl.
  • Reconfiguring the gateway layer requires that we store DhcpCfg in VpcCfg now.
  • Keeping separate copies of VpcCfg in XdeDev and the port itself seemed... misleading, at best. When XDE needs that, we now reach through port.network().
  • StatefulActions should never have had mutable access to a packet's ActionMeta (the string-to-string map), because they may never be rerun after an LFT is created on subsequent slowpath walks. It is now up to the generated ActionDescs to update metadata -- this trait's role is similar to StaticAction, which is allowed mutable access to this metadata.
  • Use of V2B/V2P mappings is now explicitly restricted to sending traffic from external/internal IPs. I put in some rework of ActionMeta ergonomics to make this easier to express and implement.

As far as the control plane is concerned, it should need to only:

  • Attach external/VPC subnets as they arrive.
  • Ensure that for attached VPC subnets, the system router is updated s.t. that subnet has an Instance(<name>) target (in turn resolved to an IP target when reifying rules for the port, as happens with instance targets today).

Answers the functional dataplane requirements of #890. Closes #703.

@FelixMcFelix FelixMcFelix force-pushed the attached-subnets-mvp branch 7 times, most recently from 374d344 to a832697 Compare January 5, 2026 20:38
@FelixMcFelix FelixMcFelix changed the title Attached External Subnets Attached (External) Subnets Jan 5, 2026
This PR reworks certain aspects of how NAT and gateway layers are
constructed to enable attached (external) subnets to function. The
primary aim here is that traffic to/from these subnets (owned by
the host) do not undergo NAT, and bypass spoof detection as transit
IPs would. A few wider changes have been necessary to ensure that
these can be attached/detached without breaking any existing transit
IPs, and to ensure that traffic originated from an external subnet
cannot be directed towards a private VPC recipient.
use opte_api::Protocol;

pub trait Ip: Into<super::IpAddr> {
pub trait Ip: Into<super::IpAddr> + Send + Sync {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Followup from #883 -- this moves the Send + Sync bounds up so that we don't need to explicitly specify them in as many spots.

Comment on lines 563 to 568
pub fn attach_subnet(
port: &Port<VpcNetwork>,
inet_gw_map: Option<&InternetGatewayMap>,
vpc_mappings: &Arc<VpcMappings>,
req: AttachSubnetReq,
) -> Result<(), OpteError> {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had a bit of uncertainty as to whether this and detach_subnet should live here, gateway, or some other module since the feature crosses multiple layers.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Living in its own module seems to make the most sense to me. I would not think to come to the nat module to look for attached subnet machinery.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, moved out in f488213.

@FelixMcFelix FelixMcFelix marked this pull request as ready for review January 6, 2026 13:44
@rmustacc rmustacc requested a review from bnaecker January 8, 2026 17:10
Copy link
Contributor

@bnaecker bnaecker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've not looked through the entire PR yet, but this seems solid. Nice interface for attaching subnets, and some helpful cleanup along the way. Thanks!

Copy link
Contributor

@zeeshanlakhani zeeshanlakhani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one Q atm, but this is looking really good. I just want to do some realigning with the RFD before approval.

};

encap_external(inner_pkt, bsvc_phys, guest_phys)
encap_external(inner_pkt, *BSVC_PHYS, guest_phys)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where is this const being defined now? Or is it generated?

Copy link
Collaborator Author

@FelixMcFelix FelixMcFelix Jan 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's defined in opte_test_utils, here:

pub static BSVC_PHYS: LazyLock<TestIpPhys> = LazyLock::new(|| TestIpPhys {
ip: BS_IP_ADDR,
mac: BS_MAC_ADDR,
vni: Vni::new(BOUNDARY_SERVICES_VNI).unwrap(),
});

@morlandi7 morlandi7 added this to the 18 milestone Jan 15, 2026
Copy link
Contributor

@zeeshanlakhani zeeshanlakhani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good at the MVP level, until we do further integrated testing.

Copy link
Contributor

@rcgoodfellow rcgoodfellow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Kyle! Comments/questions follow

/// be received/sent.
#[derive(Debug, Clone, Serialize, Deserialize, Default, Eq, PartialEq)]
pub struct AttachedSubnetConfig {
/// Denotes whether this attached subnet is an external IP block,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So if it's not external, then NAT will be applied? What is that being used for?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, attached VPC subnets. The instance owns, say, 172.30.0.0/22 and sends trafffic from 172.30.2.128 to 1.1.1.1 -- NAT still needs to be applied, since we're in the VPC's view of that private address space.

The mechanics in OPTE are simple enough for this, since we just don't touch the NAT layer. I know we've punted on the control plane logic for attached VPC subnets to the next release, but I figured it'd be better to avoid more churn.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In that case we'd be implementing an N:1 NAT in that case where N is the size of the subnet.

I think we'll need to make sure this NAT-from-attached-vpc-subnet path is only on VPC subnets attached to the primary interface so if the guest sends non-VPC traffic out a secondary instance for whatever reason, we don't scoop that up into a NAT vortex and replys come back on the primary interface.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That tracks. I know at least today the limit on any kind of normal external IP (SNAT/EIP/FIP) is that those must be bound to the primary interface.

Comment on lines 563 to 568
pub fn attach_subnet(
port: &Port<VpcNetwork>,
inet_gw_map: Option<&InternetGatewayMap>,
vpc_mappings: &Arc<VpcMappings>,
req: AttachSubnetReq,
) -> Result<(), OpteError> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Living in its own module seems to make the most sense to me. I would not think to come to the nat module to look for attached subnet machinery.

// strong isolation which some customers require.
//
// In future we want this to be a tunable property of the VPC. In this
// case we would require an extra table/poptrie per VPC, containing all
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This works out nicely for attached subnets. But I think regular floating/ephemeral IPs it will be trickier for large pools since we have no subnet aggregation and the routing table will essentially be a bunch of /32s or /128s for v4/v6 respectively.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll put some of this commentary in, you're right that this hypothetical is a bit thornier for individual EIPs. I think it's a little worse in that you can't often aggregate /32s and /128s here -- the value type would be the PhysAddr or (IpAddr, VID) corresponding to the target instance. Although I wouldn't expect the table size to be any larger than the V2P map?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Extended this a little in c647128.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah that's what I was trying to say, is that we cannot really aggregate here. Well ... we could with a clever aggregating routing protocol, but we'd need to disaggregate on migration and would continuously be recomputing aggregations as instances are created/destroyed. It seems like it'd be tricky to get right.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

ActionMeta could allow direct storage/retrieval of an ActionMetaValue

6 participants