-
Notifications
You must be signed in to change notification settings - Fork 2
External subnets #177
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
External subnets #177
Conversation
FelixMcFelix
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Nils, I've had a look over the P4 and the test machinery. I didn't see anything on the API side to give me pause.
| # Note: this now seems silly since we have maxed out the number of stages, but | ||
| # we want to leave this check and note in place should we ever find a way to | ||
| # reduce our footprint below 20 stages. | ||
| TOFINO_STAGES=20 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yikes 😬
| // This is identical to the more generically named InstanceTarget, which should | ||
| // be used for new APIs. | ||
| #[derive( | ||
| Clone, Copy, Debug, PartialEq, Eq, Deserialize, JsonSchema, Serialize, | ||
| )] | ||
| pub struct NatTarget { | ||
| pub internal_ip: Ipv6Addr, | ||
| pub inner_mac: MacAddr, | ||
| pub vni: Vni, | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this type being kept around due to versioned Open API specs? I haven't had too much interaction with that machinery yet.
| // Remove non-existent entry and be sure it fails | ||
| Ok(()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing a test here?
| // On egress, we deencapsulate the geneve packet as we do for NAT, but we don't | ||
| // rewrite the source address before forwarding the packet. | ||
| async fn test_egress(switch: &Switch, test: &ExternalTest) -> TestResult { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment confused me a bit. The first 'we' refers to the switch, but the second 'we' is the test harness I'd assume (since sidecar.p4 just decaps on egress)?
| // On ingress, we wrap the incoming packet with a geneve header and forward to | ||
| // the correct gimlet. Unlike NAT ingress, there is no manipulation of the | ||
| // target address. | ||
| async fn test_ingress(switch: &Switch, test: &ExternalTest) -> TestResult { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think NatIngress adjusts any target address, other than the destination MAC (which external subnets do too).
| pkt_src_ip: "fa00:22::1".to_string(), | ||
| pkt_src_mac: "a1:a2:a3:a4:a5:a6".to_string(), | ||
| pkt_src_port: 3333, | ||
| pkt_dst_ip: "fc00:13::1".to_string(), | ||
| pkt_dst_mac: "b1:b2:b3:b4:b5:b6".to_string(), | ||
| pkt_dst_port: 4444, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
src_ip and dst_ip seem swapped here wrt. the diagram, also placing pkt_src_ip outside of the external_subnet for egress.
(I think that sidecar.p4 will just blindly decap traffic directed at the switch address anyhow? Which would explain this test passing.)
| meta.nat_ingress_tgt = target; | ||
| meta.nat_inner_mac = inner_mac; | ||
| meta.nat_geneve_vni = vni; | ||
| meta.encap_needed = true; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
meta.encap_needed is always equal to meta.nat_ingress_hit. Can we remove nat_ingress_hit and update the last remaining reference it has in table service to use encap_needed instead?
| table external_subnets_v4 { | ||
| key = { hdr.ipv4.dst_addr: lpm; } | ||
| actions = { forward_extsub_v4_to ; } | ||
|
|
||
| const size = EXTERNAL_SUBNETS_V4_SIZE + 1; | ||
| counters = external_subnets_v4_ctr; | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Possibly unviable suggestion if we're looking to save on stages -- do we need new tables at all (and thus multiple paths to set meta.encap_needed)? I.e., we could update NatIngress to the form:
control NatIngress (...)
{
// ...
table ingress_ipv4 {
key = {
hdr.ipv4.dst_addr : lpm;
meta.l4_dst_port : range;
}
actions = { forward_ipv4_to; }
const size = IPV4_NAT_TABLE_SIZE;
counters = ipv4_ingress_ctr;
}
// ...
}We'd then program external subnets with port ranges always set to 0–65535, and individual NAT/SNAT IPs as /32s or /128s.
The downside is that we no longer have dedicated counters for external subnets. But the reason I'm thinking of this as an approach is that there's no difference in dataplane operations applied by sidecar for these packets whether they get standard NAT or an external subnet (it's always [ingress] push encap and rewrite inner dst MAC or [egress] strip encap). I also feel as though the control plane shouldn't be allowing the pools for these concepts to overlap, so there shouldn't be any ambiguity on when to remove such table entries?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a good idea, and it took me a couple of minutes to figure out why it wouldn't work. The problem is that we drop any packets that arrive at the switch that aren't addressed to the switch. This is what we use the Filter control for. All packets arriving for a NATed guest have a switch port's address as its destination IP. The external subnet traffic has an address on that subnet as a destination IP. So, we either need a table containing all the allowed subnet targets or we have to allow any packet into our Ingress pipeline, and hope it's not aimed at one of our underlay addresses.
I probably could have the external_subnet table just contain the allowed subnets in the Filter control and then let the NatIngress control actually plug in the forwarding data. I'm concerned about the complexity (in dpd, omicron, or both) required to work with an external subnet mechanism that involves multiple tables, or a single ingress_ipv4 table that is populated with both external subnets and NAT ranges. I'm also not sure that ends up saving us any space in the end, but it's worth exploring in the future.
I've added both Ry and Levon as reviewers. I'm particularly interested on Ry's thoughts on the P4 (which is implemented slightly differently on sidecar-lite) and Levon's input on the API, but I'll take any feedback you have to offer.