-
Notifications
You must be signed in to change notification settings - Fork 10
Attached (External) Subnets #911
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
374d344 to
a832697
Compare
a832697 to
5dc70c8
Compare
This PR reworks certain aspects of how NAT and gateway layers are constructed to enable attached (external) subnets to function. The primary aim here is that traffic to/from these subnets (owned by the host) do not undergo NAT, and bypass spoof detection as transit IPs would. A few wider changes have been necessary to ensure that these can be attached/detached without breaking any existing transit IPs, and to ensure that traffic originated from an external subnet cannot be directed towards a private VPC recipient.
5dc70c8 to
2e7cf69
Compare
| use opte_api::Protocol; | ||
|
|
||
| pub trait Ip: Into<super::IpAddr> { | ||
| pub trait Ip: Into<super::IpAddr> + Send + Sync { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Followup from #883 -- this moves the Send + Sync bounds up so that we don't need to explicitly specify them in as many spots.
lib/oxide-vpc/src/engine/nat.rs
Outdated
| pub fn attach_subnet( | ||
| port: &Port<VpcNetwork>, | ||
| inet_gw_map: Option<&InternetGatewayMap>, | ||
| vpc_mappings: &Arc<VpcMappings>, | ||
| req: AttachSubnetReq, | ||
| ) -> Result<(), OpteError> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had a bit of uncertainty as to whether this and detach_subnet should live here, gateway, or some other module since the feature crosses multiple layers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Living in its own module seems to make the most sense to me. I would not think to come to the nat module to look for attached subnet machinery.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, moved out in f488213.
bnaecker
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've not looked through the entire PR yet, but this seems solid. Nice interface for attaching subnets, and some helpful cleanup along the way. Thanks!
zeeshanlakhani
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
one Q atm, but this is looking really good. I just want to do some realigning with the RFD before approval.
| }; | ||
|
|
||
| encap_external(inner_pkt, bsvc_phys, guest_phys) | ||
| encap_external(inner_pkt, *BSVC_PHYS, guest_phys) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where is this const being defined now? Or is it generated?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's defined in opte_test_utils, here:
opte/lib/opte-test-utils/src/lib.rs
Lines 1000 to 1004 in eb15888
| pub static BSVC_PHYS: LazyLock<TestIpPhys> = LazyLock::new(|| TestIpPhys { | |
| ip: BS_IP_ADDR, | |
| mac: BS_MAC_ADDR, | |
| vni: Vni::new(BOUNDARY_SERVICES_VNI).unwrap(), | |
| }); |
zeeshanlakhani
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good at the MVP level, until we do further integrated testing.
rcgoodfellow
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Kyle! Comments/questions follow
| /// be received/sent. | ||
| #[derive(Debug, Clone, Serialize, Deserialize, Default, Eq, PartialEq)] | ||
| pub struct AttachedSubnetConfig { | ||
| /// Denotes whether this attached subnet is an external IP block, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So if it's not external, then NAT will be applied? What is that being used for?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again, attached VPC subnets. The instance owns, say, 172.30.0.0/22 and sends trafffic from 172.30.2.128 to 1.1.1.1 -- NAT still needs to be applied, since we're in the VPC's view of that private address space.
The mechanics in OPTE are simple enough for this, since we just don't touch the NAT layer. I know we've punted on the control plane logic for attached VPC subnets to the next release, but I figured it'd be better to avoid more churn.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In that case we'd be implementing an N:1 NAT in that case where N is the size of the subnet.
I think we'll need to make sure this NAT-from-attached-vpc-subnet path is only on VPC subnets attached to the primary interface so if the guest sends non-VPC traffic out a secondary instance for whatever reason, we don't scoop that up into a NAT vortex and replys come back on the primary interface.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That tracks. I know at least today the limit on any kind of normal external IP (SNAT/EIP/FIP) is that those must be bound to the primary interface.
lib/oxide-vpc/src/engine/nat.rs
Outdated
| pub fn attach_subnet( | ||
| port: &Port<VpcNetwork>, | ||
| inet_gw_map: Option<&InternetGatewayMap>, | ||
| vpc_mappings: &Arc<VpcMappings>, | ||
| req: AttachSubnetReq, | ||
| ) -> Result<(), OpteError> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Living in its own module seems to make the most sense to me. I would not think to come to the nat module to look for attached subnet machinery.
| // strong isolation which some customers require. | ||
| // | ||
| // In future we want this to be a tunable property of the VPC. In this | ||
| // case we would require an extra table/poptrie per VPC, containing all |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This works out nicely for attached subnets. But I think regular floating/ephemeral IPs it will be trickier for large pools since we have no subnet aggregation and the routing table will essentially be a bunch of /32s or /128s for v4/v6 respectively.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll put some of this commentary in, you're right that this hypothetical is a bit thornier for individual EIPs. I think it's a little worse in that you can't often aggregate /32s and /128s here -- the value type would be the PhysAddr or (IpAddr, VID) corresponding to the target instance. Although I wouldn't expect the table size to be any larger than the V2P map?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Extended this a little in c647128.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah that's what I was trying to say, is that we cannot really aggregate here. Well ... we could with a clever aggregating routing protocol, but we'd need to disaggregate on migration and would continuously be recomputing aggregations as instances are created/destroyed. It seems like it'd be tricky to get right.
This PR reworks certain aspects of how NAT and gateway layers are constructed to enable attached (external) subnets to function. The primary aim here is that traffic to/from these subnets (owned by the host) do not undergo NAT, and bypass spoof detection as transit IPs would. A few wider changes have been necessary to ensure that these can be attached/detached without breaking any existing transit IPs, and to ensure that traffic originated from an external subnet cannot be directed towards a private VPC recipient.
VpcCfgrather than working directly on the port's ruleset. This also gives us the opportunity to set these on port create rather than using a followup ioctl.DhcpCfginVpcCfgnow.VpcCfginXdeDevand the port itself seemed... misleading, at best. When XDE needs that, we now reach throughport.network().StatefulActions should never have had mutable access to a packet'sActionMeta(the string-to-string map), because they may never be rerun after an LFT is created on subsequent slowpath walks. It is now up to the generatedActionDescs to update metadata -- this trait's role is similar toStaticAction, which is allowed mutable access to this metadata.ActionMetaergonomics to make this easier to express and implement.As far as the control plane is concerned, it should need to only:
Instance(<name>)target (in turn resolved to an IP target when reifying rules for the port, as happens with instance targets today).Answers the functional dataplane requirements of #890. Closes #703.