[perf-cleanup] remove redundant tests and update mcast polling#9547
[perf-cleanup] remove redundant tests and update mcast polling#9547zeeshanlakhani merged 34 commits intomainfrom
Conversation
This PR also addresses permission models, object deletion, and error handling questions related to reserved addresses presented in @askfongjojo's testing Google Doc (default IP Pools are covered in a follow-up, stacked PR). In thinking through the *Groups* API, permission scopes, and flexibility, @rcgoodfellow mentioned this consideration: > Do we need an explicit notion of a group object at all? Or can > instances simply allocate/deallocate group IPs from pools, and there is > no explicit management of group objects. With Fleet admins having access control to create pools and link silos to a pool, we arrived at the idea of replacing the current explicit multicast group CRUD with an implicit lifecycle, where groups are created upon the first member join and deleted when the last member leaves. **Note**: Most of the PR's changes are test-related due to moving away from the explicit multicast group(s) lifecycle. Auth Model: - Discovery (fleet-scoped): - Read/list groups and list members: any authenticated user in the same fleet. - Membership (project-scoped): - Join/leave requires Instance::Modify on the specific instance. - Creation control: - Implicit group creation only when the s silo is linked to a suitable multicast pool (by name or by explicit IP in that pool). Behavior: - Implicit lifecycle: - Create on first join (idempotent); delete when last member leaves (atomic mark-for-removal, reconciler schedules cleanup). - Addressing and validation: - Implicit allocation from the s linked multicast pools. - SSM/ASM semantics enforced: - IPv4 SSM 232/8 and IPv6 ff3x::/32 - Error handling: - Reserved/invalid multicast ranges rejected at pool/range add time. API: - Primary flows: - Group-centric member management: POST/DELETE /v1/multicast-groups/{group}/members - Instance-centric join/leave: PUT/DELETE /v1/instances/{instance}/multicast-groups/{group} - Discovery endpoints remain for list/view; there is no explicit group create/update/delete. - This is a *breaking* change, but multicast is not yet enabled or available in production Key changes: - Implicit group model; groups exist while they have members. - IP pool integration for multicast allocation with silo link gating. - Simplified API centered on join/leave flows. - Add multicast_ip to the member table for responses. - For consistency, move to `Instant` type over `SystemTime` for mcast-related caches Follow-ups (stacked PRs) - [ ] Remove MVLAN from group data model. - [ ] Default IP pool support (IPv4/IPv6 Followrequire unicast/multicast). - [ ] Dendrite: use omicron-common constants for validation.
Introduce API version `VERSION_MULTICAST_IMPLICIT_LIFECYCLE_UPDATES`
(v2025120500) to support the transition from explicit to implicit
multicast group lifecycle management.
Changes in new API version:
- Groups are created implicitly when first member joins
- Groups are deleted implicitly when last member leaves
- Instance create/update accept `MulticastGroupIdentifier` (name, UUID,
or multicast IP address) instead of just `NameOrId`
- MulticastGroupMemberAdd now has optional `source_ips` for SSM
Backward compatibility (v20251120):
- Add `v20251120` module with compatibility types using `NameOrId`
- Explicit group create/update/delete endpoints marked deprecated
- Proper base64 validation for user_data via shared UserData serde helper
Also includes:
- Add version_policy to techport server for omdb compatibility
Includes: - Remove GLOP (233/8), admin-scoped (239/8), and specific reserved address (NTP, Cisco Auto-RP, PTP) restrictions from IP pool validation - Only link-local multicast (224.0.0.0/24) is now rejected (not routable) - Add ASM pool fallback when join-by-name with source_ips finds no SSM pool linked - Allow source filtering on ASM addresses (IGMPv3/MLDv2 supports this) - SSM addresses still require sources per RFC 4607 The previous restrictions were overly conservative. Customers may have legitimate use cases for GLOP (AS-based allocations), admin-scoped (organization-local multicast), and protocol-specific addresses.
This update moves source IPs from group to member for per-member source filtering.
Each member can now subscribe to different sources within the same
multicast group, i.e., [(S, G)]. The group's `source_ips` API field now shows the union of
all member source IPs.
Includes:
- Add source_ips column to multicast_group_member table
- Add underlay_salt for XOR-fold collision avoidance when mapping
external multicast IPs to admin-local IPv6 underlay addresses
- Document the mapping algorithm and add more tests
- Schema migration rename: multicast-implicit-lifecycle (v213)
- Update instance-centric join API to accept source_ips
- Remove deprecated group-centric member add/remove endpoints
- Clean up redundant comments and fix typos
…rce TODO wrt Dendrite
Includes:
- Add shared `put_upsert` helper for idempotent PUT+CREATED requests, for 201 responses
- Add pool_selection.rs tests for SSM/ASM fallback behavior
- ASM sources TODO/workaround:
- Only send sources to DPD for SSM groups (232/8 IPv4, ff3x:: IPv6)
- ASM groups get `None` for sources, meaning "any source allowed"
- Temporary fix until dendrite accepts ASM source filtering (upcoming PR)
- Schema
- Bump version 213.0. 214.0.0 (post-merge_
6cd91a1 to
e7d1504
Compare
Combine similar [nexus_tests] for mcast, while being more thoughtful on failure handling and polling.
e7d1504 to
16b8f12
Compare
This update, post-merge, further consolidates multicast integration and datastore tests and cleans up test documentation. Test perf: - We split test_dpd_failure_across_group_states into 3 parallel tests - Add ops::join* parallelization for test setup operations - Add `instance_wait_for_running_with_simulation` Test coverage: - Fix authorization test to actually test list with regular user - Add 404 test for nonexistent IP lookup
2e686e1 to
e97f1b6
Compare
110aee2 to
e87031f
Compare
9a2d686 to
2bfbfd5
Compare
Includes: - Consolidate duplicate ASM/SSM join-by-ip tests - Add activate_multicast_reconciler after restart_dendrite calls - Change POLL_INTERVAL from 80ms to 50ms"
2bfbfd5 to
be1918b
Compare
There was a problem hiding this comment.
Thanks Zeeshan. I think that the consolidated tests cover the same ground as before, I agree on the reduced focus on state machine internals in a few spots versus observable behaviour. It's not obvious to me at first glance that this solves all the test suite runtime cost (since these tests should be elided, post multicast feature flag?), but I agree with #9758 that we should get this in.
| /// - External IP at instance creation | ||
| #[nexus_test] | ||
| async fn test_multicast_with_external_ip_basic( | ||
| async fn test_multicast_external_ip_scenarios( |
There was a problem hiding this comment.
Having not reviewed #9091 so I'm lacking context, we're aiming to show here that two pretty much othogonal features aren't interacting with one another? Is there an expected case in future where they could interfere?
There was a problem hiding this comment.
This test is more of an integration smoke test to ensure these orthogonal features don't interfere with each other. While both external IPs and multicast communicate with DPD, they use separate subsystems (NAT vs multicast forwarding). The test confirms that external IP operations don't accidentally affect multicast state, and vice versa.
More specifically, since these integrations both work atop instance "actions", it verifies that external IP allocation works for instances that are multicast group members, that multicast state persists through external IP attach/detach operations, and that there is no unexpected state corruption when both features are active at the same time.
This removes some test redundancy in the multicast suite and tries to make convergence faster for integration RPW testing.
I was spurred on to tackle this by way of #9536 and #9758. While this doesn't fully restore pre-multicast-introduced test times (likely due to more complete end-to-end network testing), it does reduce tail-end outliers when I run the
histtool against this branch.I do think we need to rethink our nexus testing suite altogether, though. Nonetheless, I think this should help some.