From 896ff1f429ef54f943b384dccbe9d723b9b39a83 Mon Sep 17 00:00:00 2001 From: sivakami Date: Sat, 22 Nov 2025 15:52:19 -0800 Subject: [PATCH 1/3] Add SwiftV2 long-running pipeline with scheduled tests - Implemented scheduled pipeline running every 1 hour with persistent infrastructure - Split test execution into 2 jobs: Create (with 20min wait) and Delete - Added 8 test scenarios across 2 AKS clusters, 4 VNets, different subnets - Implemented two-phase deletion strategy to prevent PNI ReservationInUse errors - Added context timeouts on kubectl commands with force delete fallbacks - Resource naming uses RG name as BUILD_ID for uniqueness across parallel setups - Added SkipAutoDeleteTill tags to prevent automatic resource cleanup - Conditional setup stages controlled by runSetupStages parameter - Auto-generate RG name from location or allow custom names for parallel setups - Added comprehensive README with setup instructions and troubleshooting - Node selection by agentpool labels with usage tracking to prevent conflicts - Kubernetes naming compliance (RFC 1123) for all resources fix ginkgo flag. Add datapath tests. Delete old test file. Add testcases for provate endpoint. Ginkgo run specs only on specified files. update pipeline params. Add ginkgo tags Add datapath tests. Add ginkgo build tags. remove wait time. set namespace. update pod image. Add more nsg rules to block subnets s1 and s2 test change. Change delegated subnet address range. Use delegated interface for network connectivity tests. Datapath test between clusters. test. test private endpoints. fix private endpoint tests. Set storage account names in putput var. set storage account name. fix pn names. update pe update pe test. update sas token generation. Add node labels for sw2 scenario, cleanup pods on any test failure. enable nsg tests. update storage. Add rules to nsg. disable private endpoint negative test. disable public network access on storage account with private endpoint. wait for default nsg to be created. disable negative test on private endpoint. private endpoint depends on aks cluster vnets, change pipeline job dependencies. Add node labels for each workload type and nic capacity. make sku constant. Update readme, set schedule for long running cluster on test branch. --- .pipelines/swiftv2-long-running/README.md | 661 +++++++++++++++++ .pipelines/swiftv2-long-running/pipeline.yaml | 43 +- .../scripts/create_aks.sh | 144 ++-- .../scripts/create_nsg.sh | 184 +++-- .../swiftv2-long-running/scripts/create_pe.sh | 57 +- .../scripts/create_storage.sh | 42 ++ .../scripts/create_vnets.sh | 160 ++-- .../long-running-pipeline-template.yaml | 331 ++++++++- go.mod | 18 +- go.sum | 34 +- hack/aks/Makefile | 15 +- .../swiftv2/long-running-cluster/pod.yaml | 73 ++ .../long-running-cluster/podnetwork.yaml | 15 + .../podnetworkinstance.yaml | 13 + .../integration/swiftv2/helpers/az_helpers.go | 343 +++++++++ .../swiftv2/longRunningCluster/datapath.go | 690 ++++++++++++++++++ .../datapath_connectivity_test.go | 165 +++++ .../datapath_create_test.go | 118 +++ .../datapath_delete_test.go | 117 +++ .../datapath_private_endpoint_test.go | 150 ++++ 20 files changed, 3164 insertions(+), 209 deletions(-) create mode 100644 .pipelines/swiftv2-long-running/README.md mode change 100644 => 100755 .pipelines/swiftv2-long-running/scripts/create_nsg.sh create mode 100644 test/integration/manifests/swiftv2/long-running-cluster/pod.yaml create mode 100644 test/integration/manifests/swiftv2/long-running-cluster/podnetwork.yaml create mode 100644 test/integration/manifests/swiftv2/long-running-cluster/podnetworkinstance.yaml create mode 100644 test/integration/swiftv2/helpers/az_helpers.go create mode 100644 test/integration/swiftv2/longRunningCluster/datapath.go create mode 100644 test/integration/swiftv2/longRunningCluster/datapath_connectivity_test.go create mode 100644 test/integration/swiftv2/longRunningCluster/datapath_create_test.go create mode 100644 test/integration/swiftv2/longRunningCluster/datapath_delete_test.go create mode 100644 test/integration/swiftv2/longRunningCluster/datapath_private_endpoint_test.go diff --git a/.pipelines/swiftv2-long-running/README.md b/.pipelines/swiftv2-long-running/README.md new file mode 100644 index 0000000000..b513dcab00 --- /dev/null +++ b/.pipelines/swiftv2-long-running/README.md @@ -0,0 +1,661 @@ +# SwiftV2 Long-Running Pipeline + +This pipeline tests SwiftV2 pod networking in a persistent environment with scheduled test runs. + +## Architecture Overview + +**Infrastructure (Persistent)**: +- **2 AKS Clusters**: aks-1, aks-2 (4 nodes each: 2 low-NIC default pool, 2 high-NIC nplinux pool) +- **4 VNets**: cx_vnet_a1, cx_vnet_a2, cx_vnet_a3 (Customer 1 with PE to storage), cx_vnet_b1 (Customer 2) +- **VNet Peerings**: vnet mesh. +- **Storage Account**: With private endpoint from cx_vnet_a1 +- **NSGs**: Restricting traffic between subnets (s1, s2) in vnet cx_vnet_a1. +- **Node Labels**: All nodes labeled with `workload-type` and `nic-capacity` for targeted test execution + +**Test Scenarios (8 total per workload type)**: +- Multiple pods across 2 clusters, 4 VNets, different subnets (s1, s2), and node types (low-NIC, high-NIC) +- Each test run: Create all resources → Wait 20 minutes → Delete all resources +- Tests run automatically every 1 hour via scheduled trigger + +**Multi-Stage Workload Testing**: +- Tests are organized by workload type using node label `workload-type` +- Each workload type runs as a separate stage sequentially +- Current implementation: `swiftv2-linux` (Stage: ManagedNodeDataPathTests) +- Future stages can be added for different workload types (e.g., `swiftv2-l1vhaccelnet`, `swiftv2-linuxbyon`) +- Each stage uses the same infrastructure but targets different labeled nodes + +## Pipeline Modes + +### Resource Group Naming Conventions + +The pipeline uses **strict naming conventions** for resource groups to ensure proper organization and lifecycle management: + +**1. Production Scheduled Runs (Master/Main Branch)**: +``` +Pattern: sv2-long-run- +Examples: sv2-long-run-centraluseuap, sv2-long-run-eastus, sv2-long-run-westus2 +``` +- **When to use**: Creating infrastructure for scheduled automated tests on master/main branch +- **Purpose**: Long-running persistent infrastructure for continuous validation +- **Lifecycle**: Persistent (tagged with `SkipAutoDeleteTill=2032-12-31`) +- **Example**: If running scheduled tests in Central US EUAP region, use `sv2-long-run-centraluseuap` + +**2. Test/Development/PR Validation Runs**: +``` +Pattern: sv2-long-run-$(Build.BuildId) +Examples: sv2-long-run-12345, sv2-long-run-67890 +``` +- **When to use**: Temporary testing, one-time validation, or PR testing +- **Purpose**: Short-lived infrastructure for specific test runs +- **Lifecycle**: Can be cleaned up after testing completes +- **Example**: PR validation run with Build ID 12345 → `sv2-long-run-12345` + +**3. Parallel/Custom Environments**: +``` +Pattern: sv2-long-run-- +Examples: sv2-long-run-centraluseuap-dev, sv2-long-run-eastus-staging +``` +- **When to use**: Parallel environments, feature testing, version upgrades +- **Purpose**: Isolated environment alongside production +- **Lifecycle**: Persistent or temporary based on use case +- **Example**: Development environment in Central US EUAP → `sv2-long-run-centraluseuap-dev` + +**Important Notes**: +- ⚠️ Always follow the naming pattern for scheduled runs on master: `sv2-long-run-` +- ⚠️ Do not use build IDs for production scheduled infrastructure (it breaks continuity) +- ⚠️ Region name should match the `location` parameter for consistency +- ✅ All resource names within the setup use the resource group name as BUILD_ID prefix + +### Mode 1: Scheduled Test Runs (Default) +**Trigger**: Automated cron schedule every 1 hour +**Purpose**: Continuous validation of long-running infrastructure +**Setup Stages**: Disabled +**Test Duration**: ~30-40 minutes per run +**Resource Group**: Static (default: `sv2-long-run-`, e.g., `sv2-long-run-centraluseuap`) + +```yaml +# Runs automatically every 1 hour +# No manual/external triggers allowed +``` + +### Mode 2: Initial Setup or Rebuild +**Trigger**: Manual run with parameter change +**Purpose**: Create new infrastructure or rebuild existing +**Setup Stages**: Enabled via `runSetupStages: true` +**Resource Group**: Must follow naming conventions (see below) + +**To create new infrastructure for scheduled runs on master branch**: +1. Go to Pipeline → Run pipeline +2. Set `runSetupStages` = `true` +3. Set `resourceGroupName` = `sv2-long-run-` (e.g., `sv2-long-run-centraluseuap`) + - **Critical**: Use this exact naming pattern for production scheduled tests + - Region should match the `location` parameter +4. Optionally adjust `location` to match your resource group name +5. Run pipeline + +**To create new infrastructure for testing/development**: +1. Go to Pipeline → Run pipeline +2. Set `runSetupStages` = `true` +3. Set `resourceGroupName` = `sv2-long-run-$(Build.BuildId)` or custom name + - For temporary testing: Use build ID pattern for auto-cleanup + - For parallel environments: Use descriptive suffix (e.g., `sv2-long-run-centraluseuap-dev`) +4. Optionally adjust `location` +5. Run pipeline + +## Pipeline Parameters + +Parameters are organized by usage: + +### Common Parameters (Always Relevant) +| Parameter | Default | Description | +|-----------|---------|-------------| +| `location` | `centraluseuap` | Azure region for resources. Auto-generates RG name: `sv2-long-run-`. | +| `runSetupStages` | `false` | Set to `true` to create new infrastructure. `false` for scheduled test runs. | +| `subscriptionId` | `37deca37-...` | Azure subscription ID. | +| `serviceConnection` | `Azure Container Networking...` | Azure DevOps service connection. | + +### Setup-Only Parameters (Only Used When runSetupStages=true) + +| Parameter | Default | Description | +|-----------|---------|-------------| +| `resourceGroupName` | `""` (empty) | **Leave empty** to auto-generate based on usage pattern. See Resource Group Naming Conventions below. | + +**Resource Group Naming Conventions**: +- **For scheduled runs on master/main branch**: Use `sv2-long-run-` (e.g., `sv2-long-run-centraluseuap`) + - This ensures consistent naming for production scheduled tests + - Example: Creating infrastructure in `centraluseuap` for scheduled runs → `sv2-long-run-centraluseuap` +- **For test/dev runs or PR validation**: Use `sv2-long-run-$(Build.BuildId)` + - Auto-cleanup after testing + - Example: `sv2-long-run-12345` (where 12345 is the build ID) +- **For parallel environments**: Use descriptive suffix (e.g., `sv2-long-run-centraluseuap-dev`, `sv2-long-run-eastus-staging`) + +**Note**: VM SKUs are hardcoded as constants in the pipeline template: +- Default nodepool: `Standard_D4s_v3` (low-nic capacity, 1 NIC) +- NPLinux nodepool: `Standard_D16s_v3` (high-nic capacity, 7 NICs) + +Setup-only parameters are ignored when `runSetupStages=false` (scheduled runs). + +## Pipeline Stage Organization + +The pipeline is organized into stages based on workload type, allowing sequential testing of different node configurations using the same infrastructure. + +### Stage 1: AKS Cluster and Networking Setup (Conditional) +**Runs when**: `runSetupStages=true` +**Purpose**: One-time infrastructure creation +**Creates**: AKS clusters, VNets, peerings, storage accounts, NSGs, private endpoints, node labels + +### Stage 2: ManagedNodeDataPathTests (Current) +**Workload Type**: `swiftv2-linux` +**Node Label Filter**: `workload-type=swiftv2-linux` +**Jobs**: +1. Create Test Resources (8 pod scenarios) +2. Connectivity Tests (9 test cases) +3. Private Endpoint Tests (5 test cases) +4. Delete Test Resources (cleanup) + +**Node Selection**: +- Tests automatically filter nodes by `workload-type=swiftv2-linux` AND `nic-capacity` labels +- Environment variable `WORKLOAD_TYPE=swiftv2-linux` is set for this stage +- Ensures tests only run on nodes designated for this workload type + +### Future Stages (Planned Architecture) +Additional stages can be added to test different workload types sequentially: + +**Example: Stage 3 - BYONodeDataPathTests** +```yaml +- stage: BYONodeDataPathTests + displayName: "SwiftV2 Data Path Tests - BYO Node ID" + dependsOn: ManagedNodeDataPathTests + variables: + WORKLOAD_TYPE: "swiftv2-byonodeid" + # Same job structure as ManagedNodeDataPathTests + # Tests run on nodes labeled: workload-type=swiftv2-byonodeid +``` + +**Example: Stage 4 - WindowsNodeDataPathTests** +```yaml +- stage: WindowsNodeDataPathTests + displayName: "SwiftV2 Data Path Tests - Windows Nodes" + dependsOn: BYONodeDataPathTests + variables: + WORKLOAD_TYPE: "swiftv2-windows" + # Same job structure + # Tests run on nodes labeled: workload-type=swiftv2-windows +``` + +**Benefits of Stage-Based Architecture**: +- ✅ Sequential execution: Each workload type tested independently +- ✅ Isolated node pools: No resource contention between workload types +- ✅ Same infrastructure: All stages use the same VNets, storage, NSGs +- ✅ Same test suite: Connectivity and private endpoint tests run for each workload type +- ✅ Easy extensibility: Add new stages without modifying existing ones +- ✅ Clear results: Separate test results per workload type + +**Node Labeling for Multiple Workload Types**: +Each node pool gets labeled with its designated workload type during setup: +```bash +# During cluster creation or node pool addition: +kubectl label nodes -l agentpool=nodepool1 workload-type=swiftv2-linux +kubectl label nodes -l agentpool=byonodepool workload-type=swiftv2-byonodeid +kubectl label nodes -l agentpool=winnodepool workload-type=swiftv2-windows +``` + +## How It Works + +### Scheduled Test Flow +Every 1 hour, the pipeline: +1. Skips setup stages (infrastructure already exists) +2. **Job 1 - Create Resources**: Creates 8 test scenarios (PodNetwork, PNI, Pods with HTTP servers on port 8080) +3. **Job 2 - Connectivity Tests**: Tests HTTP connectivity between pods (9 test cases), then waits 20 minutes +4. **Job 3 - Private Endpoint Tests**: Tests private endpoint access and tenant isolation (5 test cases) +5. **Job 4 - Delete Resources**: Deletes all test resources (Phase 1: Pods, Phase 2: PNI/PN/Namespaces) +6. Reports results + +**Connectivity Tests (9 scenarios)**: + +| Test | Source → Destination | Expected Result | Purpose | +|------|---------------------|-----------------|---------| +| SameVNetSameSubnet | pod-c1-aks1-a1s2-low → pod-c1-aks1-a1s2-high | ✓ Success | Basic connectivity in same subnet | +| NSGBlocked_S1toS2 | pod-c1-aks1-a1s1-low → pod-c1-aks1-a1s2-high | ✗ Blocked | NSG rule blocks s1→s2 in cx_vnet_a1 | +| NSGBlocked_S2toS1 | pod-c1-aks1-a1s2-low → pod-c1-aks1-a1s1-low | ✗ Blocked | NSG rule blocks s2→s1 (bidirectional) | +| DifferentVNetSameCustomer | pod-c1-aks1-a2s1-high → pod-c1-aks2-a2s1-low | ✓ Success | Cross-cluster, same customer VNet | +| PeeredVNets | pod-c1-aks1-a1s2-low → pod-c1-aks1-a2s1-high | ✓ Success | Peered VNets (a1 ↔ a2) | +| PeeredVNets_A2toA3 | pod-c1-aks1-a2s1-high → pod-c1-aks2-a3s1-high | ✓ Success | Peered VNets across clusters | +| DifferentCustomers_A1toB1 | pod-c1-aks1-a1s2-low → pod-c2-aks2-b1s1-low | ✗ Blocked | Customer isolation (C1 → C2) | +| DifferentCustomers_A2toB1 | pod-c1-aks1-a2s1-high → pod-c2-aks2-b1s1-high | ✗ Blocked | Customer isolation (C1 → C2) | + +**Test Results**: 4 should succeed, 5 should be blocked (3 NSG rules + 2 customer isolation) + +**Private Endpoint Tests (5 scenarios)**: + +| Test | Source → Destination | Expected Result | Purpose | +|------|---------------------|-----------------|---------| +| TenantA_VNetA1_S1_to_StorageA | pod-c1-aks1-a1s1-low → Storage-A | ✓ Success | Tenant A pod can access Storage-A via private endpoint | +| TenantA_VNetA1_S2_to_StorageA | pod-c1-aks1-a1s2-low → Storage-A | ✓ Success | Tenant A pod can access Storage-A via private endpoint | +| TenantA_VNetA2_to_StorageA | pod-c1-aks1-a2s1-high → Storage-A | ✓ Success | Tenant A pod from peered VNet can access Storage-A | +| TenantA_VNetA3_to_StorageA | pod-c1-aks2-a3s1-high → Storage-A | ✓ Success | Tenant A pod from different cluster can access Storage-A | +| TenantB_to_StorageA_Isolation | pod-c2-aks2-b1s1-low → Storage-A | ✗ Blocked | Tenant B pod CANNOT access Storage-A (tenant isolation) | + +**Test Results**: 4 should succeed, 1 should be blocked (tenant isolation) + +## Test Case Details + +### 8 Pod Scenarios (Created in Job 1) + +All test scenarios create the following resources: +- **PodNetwork**: Defines the network configuration for a VNet/subnet combination +- **PodNetworkInstance**: Instance-level configuration with IP allocation +- **Pod**: Test pod running nicolaka/netshoot with HTTP server on port 8080 + +| # | Scenario | Cluster | VNet | Subnet | Node Type | Pod Name | Purpose | +|---|----------|---------|------|--------|-----------|----------|---------| +| 1 | Customer2-AKS2-VnetB1-S1-LowNic | aks-2 | cx_vnet_b1 | s1 | low-nic | pod-c2-aks2-b1s1-low | Tenant B pod for isolation testing | +| 2 | Customer2-AKS2-VnetB1-S1-HighNic | aks-2 | cx_vnet_b1 | s1 | high-nic | pod-c2-aks2-b1s1-high | Tenant B pod on high-NIC node | +| 3 | Customer1-AKS1-VnetA1-S1-LowNic | aks-1 | cx_vnet_a1 | s1 | low-nic | pod-c1-aks1-a1s1-low | Tenant A pod in NSG-protected subnet | +| 4 | Customer1-AKS1-VnetA1-S2-LowNic | aks-1 | cx_vnet_a1 | s2 | low-nic | pod-c1-aks1-a1s2-low | Tenant A pod for NSG isolation test | +| 5 | Customer1-AKS1-VnetA1-S2-HighNic | aks-1 | cx_vnet_a1 | s2 | high-nic | pod-c1-aks1-a1s2-high | Tenant A pod on high-NIC node | +| 6 | Customer1-AKS1-VnetA2-S1-HighNic | aks-1 | cx_vnet_a2 | s1 | high-nic | pod-c1-aks1-a2s1-high | Tenant A pod in peered VNet | +| 7 | Customer1-AKS2-VnetA2-S1-LowNic | aks-2 | cx_vnet_a2 | s1 | low-nic | pod-c1-aks2-a2s1-low | Cross-cluster same VNet test | +| 8 | Customer1-AKS2-VnetA3-S1-HighNic | aks-2 | cx_vnet_a3 | s1 | high-nic | pod-c1-aks2-a3s1-high | Private endpoint access test | + +### Connectivity Tests (9 Test Cases in Job 2) + +Tests HTTP connectivity between pods using curl with 5-second timeout: + +**Expected to SUCCEED (4 tests)**: + +| Test | Source → Destination | Validation | Purpose | +|------|---------------------|------------|---------| +| SameVNetSameSubnet | pod-c1-aks1-a1s2-low → pod-c1-aks1-a1s2-high | HTTP 200 | Basic same-subnet connectivity | +| DifferentVNetSameCustomer | pod-c1-aks1-a2s1-high → pod-c1-aks2-a2s1-low | HTTP 200 | Cross-cluster, same VNet (a2) | +| PeeredVNets | pod-c1-aks1-a1s2-low → pod-c1-aks1-a2s1-high | HTTP 200 | VNet peering (a1 ↔ a2) | +| PeeredVNets_A2toA3 | pod-c1-aks1-a2s1-high → pod-c1-aks2-a3s1-high | HTTP 200 | VNet peering across clusters | + +**Expected to FAIL (5 tests)**: + +| Test | Source → Destination | Expected Error | Purpose | +|------|---------------------|----------------|---------| +| NSGBlocked_S1toS2 | pod-c1-aks1-a1s1-low → pod-c1-aks1-a1s2-high | Connection timeout | NSG blocks s1→s2 in cx_vnet_a1 | +| NSGBlocked_S2toS1 | pod-c1-aks1-a1s2-low → pod-c1-aks1-a1s1-low | Connection timeout | NSG blocks s2→s1 (bidirectional) | +| DifferentCustomers_A1toB1 | pod-c1-aks1-a1s2-low → pod-c2-aks2-b1s1-low | Connection timeout | Customer isolation (no peering) | +| DifferentCustomers_A2toB1 | pod-c1-aks1-a2s1-high → pod-c2-aks2-b1s1-high | Connection timeout | Customer isolation (no peering) | +| UnpeeredVNets_A3toB1 | pod-c1-aks2-a3s1-high → pod-c2-aks2-b1s1-low | Connection timeout | No peering between a3 and b1 | + +**NSG Rules Configuration**: +- cx_vnet_a1 has NSG rules blocking traffic between s1 and s2 subnets: + - Deny outbound from s1 to s2 (priority 100) + - Deny inbound from s1 to s2 (priority 110) + - Deny outbound from s2 to s1 (priority 100) + - Deny inbound from s2 to s1 (priority 110) + +### Private Endpoint Tests (5 Test Cases in Job 3) + +Tests access to Azure Storage Account via Private Endpoint with public network access disabled: + +**Expected to SUCCEED (4 tests)**: + +| Test | Source → Storage | Validation | Purpose | +|------|-----------------|------------|---------| +| TenantA_VNetA1_S1_to_StorageA | pod-c1-aks1-a1s1-low → Storage-A | Blob download via SAS | Access via private endpoint from VNet A1 | +| TenantA_VNetA1_S2_to_StorageA | pod-c1-aks1-a1s2-low → Storage-A | Blob download via SAS | Access via private endpoint from VNet A1 | +| TenantA_VNetA2_to_StorageA | pod-c1-aks1-a2s1-high → Storage-A | Blob download via SAS | Access via peered VNet (A2 peered with A1) | +| TenantA_VNetA3_to_StorageA | pod-c1-aks2-a3s1-high → Storage-A | Blob download via SAS | Access via peered VNet from different cluster | + +**Expected to FAIL (1 test)**: + +| Test | Source → Storage | Expected Error | Purpose | +|------|-----------------|----------------|---------| +| TenantB_to_StorageA_Isolation | pod-c2-aks2-b1s1-low → Storage-A | Connection timeout/failed | Tenant isolation - no private endpoint access, public blocked | + +**Private Endpoint Configuration**: +- Private endpoint created in cx_vnet_a1 subnet 'pe' +- Private DNS zone `privatelink.blob.core.windows.net` linked to: + - cx_vnet_a1, cx_vnet_a2, cx_vnet_a3 (Tenant A VNets) + - aks-1 and aks-2 cluster VNets +- Storage Account 1 (Tenant A): + - Public network access: **Disabled** + - Shared key access: Disabled (Azure AD only) + - Blob public access: Disabled +- Storage Account 2 (Tenant B): Public access enabled (for future tests) + +**Test Flow**: +1. DNS resolution: Storage FQDN resolves to private IP for Tenant A, fails/public IP for Tenant B +2. Generate SAS token: Azure AD authentication via management plane +3. Download blob: Using curl with SAS token via data plane +4. Validation: Verify blob content matches expected value + +### Resource Creation Patterns + +**Naming Convention**: +``` +BUILD_ID = + +PodNetwork: pn--- +PodNetworkInstance: pni--- +Namespace: pn--- +Pod: pod- +``` + +**Example** (for `resourceGroupName=sv2-long-run-centraluseuap`): +``` +pn-sv2-long-run-centraluseuap-a1-s1 +pni-sv2-long-run-centraluseuap-a1-s1 +pn-sv2-long-run-centraluseuap-a1-s1 (namespace) +pod-c1-aks1-a1s1-low +``` + +**VNet Name Simplification**: +- `cx_vnet_a1` → `a1` +- `cx_vnet_a2` → `a2` +- `cx_vnet_a3` → `a3` +- `cx_vnet_b1` → `b1` + +### Setup Flow (When runSetupStages = true) +1. Create resource group with `SkipAutoDeleteTill=2032-12-31` tag +2. Create 2 AKS clusters with 2 node pools each (tagged for persistence) +3. Create 4 customer VNets with subnets and delegations (tagged for persistence) +4. Create VNet peerings +5. Create storage accounts with persistence tags +6. Create NSGs for subnet isolation +7. Run initial test (create → wait → delete) + +**All infrastructure resources are tagged with `SkipAutoDeleteTill=2032-12-31`** to prevent automatic cleanup by Azure subscription policies. + +## Resource Naming + +All test resources use the pattern: `-static-setup--` + +**Examples**: +- PodNetwork: `pn-static-setup-a1-s1` +- PodNetworkInstance: `pni-static-setup-a1-s1` +- Pod: `pod-c1-aks1-a1s1-low` +- Namespace: `pn-static-setup-a1-s1` + +VNet names are simplified: +- `cx_vnet_a1` → `a1` +- `cx_vnet_b1` → `b1` + +## Switching to a New Setup + +**Scenario**: You created a new setup in RG `sv2-long-run-eastus` and want scheduled runs to use it. + +**Steps**: +1. Go to Pipeline → Edit +2. Update location parameter default value: + ```yaml + - name: location + default: "centraluseuap" # Change this + ``` +3. Save and commit +4. RG name will automatically become `sv2-long-run-centraluseuap` + +Alternatively, manually trigger with the new location or override `resourceGroupName` directly. + +## Creating Multiple Test Setups + +**Use Case**: You want to create a new test environment without affecting the existing one (e.g., for testing different configurations, regions, or versions). + +**Steps**: +1. Go to Pipeline → Run pipeline +2. Set `runSetupStages` = `true` +3. **Set `resourceGroupName`** based on usage: + - **For scheduled runs on master/main branch**: `sv2-long-run-` (e.g., `sv2-long-run-centraluseuap`, `sv2-long-run-eastus`) + - Use this naming pattern for production scheduled tests + - **For test/dev runs**: `sv2-long-run-$(Build.BuildId)` or custom (e.g., `sv2-long-run-12345`) + - For temporary testing or PR validation + - **For parallel environments**: Custom with descriptive suffix (e.g., `sv2-long-run-centraluseuap-dev`, `sv2-long-run-centraluseuap-v2`) +4. Optionally adjust `location` +5. Run pipeline + +**After setup completes**: +- The new infrastructure will be tagged with `SkipAutoDeleteTill=2032-12-31` +- Resources are isolated by the unique resource group name +- To run tests against the new setup, the scheduled pipeline would need to be updated with the new RG name + +**Example Scenarios**: +| Scenario | Resource Group Name | Purpose | Naming Pattern | +|----------|-------------------|---------|----------------| +| Production scheduled (Central US EUAP) | `sv2-long-run-centraluseuap` | Daily scheduled tests on master | `sv2-long-run-` | +| Production scheduled (East US) | `sv2-long-run-eastus` | Regional scheduled testing on master | `sv2-long-run-` | +| Temporary test run | `sv2-long-run-12345` | One-time testing (Build ID: 12345) | `sv2-long-run-$(Build.BuildId)` | +| Development environment | `sv2-long-run-centraluseuap-dev` | Development/testing | Custom with suffix | +| Version upgrade testing | `sv2-long-run-centraluseuap-v2` | Parallel environment for upgrades | Custom with suffix | + +## Resource Naming + instead of ping use +The pipeline uses the **resource group name as the BUILD_ID** to ensure unique resource names per test setup. This allows multiple parallel test environments without naming collisions. + +**Generated Resource Names**: +``` +BUILD_ID = + +PodNetwork: pn--- +PodNetworkInstance: pni--- +Namespace: pn--- +Pod: pod- +``` + +**Example for `resourceGroupName=sv2-long-run-centraluseuap`**: +``` +pn-sv2-long-run-centraluseuap-b1-s1 (PodNetwork for cx_vnet_b1, subnet s1) +pni-sv2-long-run-centraluseuap-b1-s1 (PodNetworkInstance) +pn-sv2-long-run-centraluseuap-a1-s1 (PodNetwork for cx_vnet_a1, subnet s1) +pni-sv2-long-run-centraluseuap-a1-s2 (PodNetworkInstance for cx_vnet_a1, subnet s2) +``` + +**Example for different setup `resourceGroupName=sv2-long-run-eastus`**: +``` +pn-sv2-long-run-eastus-b1-s1 (Different from centraluseuap setup) +pni-sv2-long-run-eastus-b1-s1 +pn-sv2-long-run-eastus-a1-s1 +``` + +This ensures **no collision** between different test setups running in parallel. + +## Deletion Strategy +### Phase 1: Delete All Pods +Deletes all pods across all scenarios first. This ensures IP reservations are released. + +``` +Deleting pod pod-c2-aks2-b1s1-low... +Deleting pod pod-c2-aks2-b1s1-high... +... +``` + +### Phase 2: Delete Shared Resources +Groups resources by vnet/subnet/cluster and deletes PNI/PN/Namespace once per group. + +``` +Deleting PodNetworkInstance pni-static-setup-b1-s1... +Deleting PodNetwork pn-static-setup-b1-s1... +Deleting namespace pn-static-setup-b1-s1... +``` + +**Why**: Multiple pods can share the same PNI. Deleting PNI while pods exist causes "ReservationInUse" errors. + +## Troubleshooting + +### Tests are running on wrong cluster +- Check `resourceGroupName` parameter points to correct RG +- Verify RG contains aks-1 and aks-2 clusters +- Check kubeconfig retrieval in logs + +### Setup stages not running +- Verify `runSetupStages` parameter is set to `true` +- Check condition: `condition: eq(${{ parameters.runSetupStages }}, true)` + +### Schedule not triggering +- Verify cron expression: `"0 */1 * * *"` (every 1 hour) +- Check branch in schedule matches your working branch +- Ensure `always: true` is set (runs even without code changes) + +### PNI stuck with "ReservationInUse" +- Check if pods were deleted first (Phase 1 logs) +- Manual fix: Delete pod → Wait 10s → Patch PNI to remove finalizers + +### Pipeline timeout after 6 hours +- This is expected behavior (timeoutInMinutes: 360) +- Tests should complete in ~30-40 minutes +- If tests hang, check deletion logs for stuck resources + +## Manual Testing + +Run locally against existing infrastructure: + +```bash +export RG="sv2-long-run-centraluseuap" # Match your resource group +export BUILD_ID="$RG" # Use same RG name as BUILD_ID for unique resource names + +cd test/integration/swiftv2/longRunningCluster +ginkgo -v -trace --timeout=6h . +``` + +## Node Pool Configuration + +### Node Labels and Architecture + +All nodes in the clusters are labeled with two key labels for workload identification and NIC capacity. These labels are applied during cluster creation by the `create_aks.sh` script. + +**1. Workload Type Label** (`workload-type`): +- Purpose: Identifies which test scenario group the node belongs to +- Current value: `swiftv2-linux` (applied to all nodes in current setup) +- Applied during: Cluster creation in Stage 1 (AKSClusterAndNetworking) +- Applied by: `.pipelines/swiftv2-long-running/scripts/create_aks.sh` +- Future use: Supports multiple workload types running as separate stages (e.g., `swiftv2-windows`, `swiftv2-byonodeid`) +- Stage isolation: Each test stage uses `WORKLOAD_TYPE` environment variable to filter nodes + +**2. NIC Capacity Label** (`nic-capacity`): +- Purpose: Identifies the NIC capacity tier of the node +- Applied during: Cluster creation in Stage 1 (AKSClusterAndNetworking) +- Applied by: `.pipelines/swiftv2-long-running/scripts/create_aks.sh` +- Values: + - `low-nic`: Default nodepool (nodepool1) with `Standard_D4s_v3` (1 NIC) + - `high-nic`: NPLinux nodepool (nplinux) with `Standard_D16s_v3` (7 NICs) + +**Label Application in create_aks.sh**: +```bash +# Step 1: All nodes get workload-type label +kubectl label nodes --all workload-type=swiftv2-linux --overwrite + +# Step 2: Default nodepool gets low-nic capacity label +kubectl label nodes -l agentpool=nodepool1 nic-capacity=low-nic --overwrite + +# Step 3: NPLinux nodepool gets high-nic capacity label +kubectl label nodes -l agentpool=nplinux nic-capacity=high-nic --overwrite +``` + +**Example Node Labels**: +```yaml +# Low-NIC node (nodepool1) +labels: + agentpool: nodepool1 + workload-type: swiftv2-linux + nic-capacity: low-nic + +# High-NIC node (nplinux) +labels: + agentpool: nplinux + workload-type: swiftv2-linux + nic-capacity: high-nic +``` + +### Node Selection in Tests + +Tests use these labels to select appropriate nodes dynamically: +- **Function**: `GetNodesByNicCount()` in `test/integration/swiftv2/longRunningCluster/datapath.go` +- **Filtering**: Nodes filtered by BOTH `workload-type` AND `nic-capacity` labels +- **Environment Variable**: `WORKLOAD_TYPE` (set by each test stage) determines which nodes are used + - Current: `WORKLOAD_TYPE=swiftv2-linux` in ManagedNodeDataPathTests stage + - Future: Different values for each stage (e.g., `swiftv2-byonodeid`, `swiftv2-windows`) +- **Selection Logic**: + ```go + // Get low-nic nodes with matching workload type + kubectl get nodes -l "nic-capacity=low-nic,workload-type=$WORKLOAD_TYPE" + + // Get high-nic nodes with matching workload type + kubectl get nodes -l "nic-capacity=high-nic,workload-type=$WORKLOAD_TYPE" + ``` +- **Pod Assignment**: + - Low-NIC nodes: Limited to 1 pod per node + - High-NIC nodes: Currently limited to 1 pod per node in test logic + +**Node Pool Configuration**: + +| Node Pool | VM SKU | NICs | Label | Pods per Node | +|-----------|--------|------|-------|---------------| +| nodepool1 (default) | `Standard_D4s_v3` | 1 | `nic-capacity=low-nic` | 1 | +| nplinux | `Standard_D16s_v3` | 7 | `nic-capacity=high-nic` | 1 (current test logic) | + +**Note**: VM SKUs are hardcoded as constants in the pipeline template and cannot be changed by users. + +## Schedule Modification + +To change test frequency, edit the cron schedule: + +```yaml +schedules: + - cron: "0 */1 * * *" # Every 1 hour (current) + # Examples: + # - cron: "0 */2 * * *" # Every 2 hours + # - cron: "0 */6 * * *" # Every 6 hours + # - cron: "0 0,8,16 * * *" # At 12am, 8am, 4pm + # - cron: "0 0 * * *" # Daily at midnight +``` + +## File Structure + +``` +.pipelines/swiftv2-long-running/ +├── pipeline.yaml # Main pipeline with schedule +├── README.md # This file +├── template/ +│ └── long-running-pipeline-template.yaml # Stage definitions (2 jobs) +└── scripts/ + ├── create_aks.sh # AKS cluster creation + ├── create_vnets.sh # VNet and subnet creation + ├── create_peerings.sh # VNet peering setup + ├── create_storage.sh # Storage account creation + ├── create_nsg.sh # Network security groups + └── create_pe.sh # Private endpoint setup + +test/integration/swiftv2/longRunningCluster/ +├── datapath_test.go # Original combined test (deprecated) +├── datapath_create_test.go # Create test scenarios (Job 1) +├── datapath_delete_test.go # Delete test scenarios (Job 2) +├── datapath.go # Resource orchestration +└── helpers/ + └── az_helpers.go # Azure/kubectl helper functions +``` + +## Best Practices + +1. **Keep infrastructure persistent**: Only recreate when necessary (cluster upgrades, config changes) +2. **Monitor scheduled runs**: Set up alerts for test failures +3. **Resource naming**: BUILD_ID is automatically set to the resource group name, ensuring unique resource names per setup +4. **Tag resources appropriately**: All setup resources automatically tagged with `SkipAutoDeleteTill=2032-12-31` + - AKS clusters + - AKS VNets + - Customer VNets (cx_vnet_a1, cx_vnet_a2, cx_vnet_a3, cx_vnet_b1) + - Storage accounts +5. **Avoid resource group collisions**: Always use unique `resourceGroupName` when creating new setups +6. **Document changes**: Update this README when modifying test scenarios or infrastructure + +## Resource Tags + +All infrastructure resources are automatically tagged during creation: + +```bash +SkipAutoDeleteTill=2032-12-31 +``` + +This prevents automatic cleanup by Azure subscription policies that delete resources after a certain period. The tag is applied to: +- Resource group (via create_resource_group job) +- AKS clusters (aks-1, aks-2) +- AKS cluster VNets +- Customer VNets (cx_vnet_a1, cx_vnet_a2, cx_vnet_a3, cx_vnet_b1) +- Storage accounts (sa1xxxx, sa2xxxx) + +To manually update the tag date: +```bash +az resource update --ids --set tags.SkipAutoDeleteTill=2033-12-31 +``` diff --git a/.pipelines/swiftv2-long-running/pipeline.yaml b/.pipelines/swiftv2-long-running/pipeline.yaml index b6d085901d..7abc3e1f79 100644 --- a/.pipelines/swiftv2-long-running/pipeline.yaml +++ b/.pipelines/swiftv2-long-running/pipeline.yaml @@ -1,4 +1,14 @@ trigger: none +pr: none + +# Schedule: Run every 1 hour +schedules: + - cron: "0 */3 * * *" # Every 3 hours at minute 0 + displayName: "Run tests every 3 hours" + branches: + include: + - sv2-long-running-pipeline-stage2 + always: true # Run even if there are no code changes parameters: - name: subscriptionId @@ -6,30 +16,26 @@ parameters: type: string default: "37deca37-c375-4a14-b90a-043849bd2bf1" + - name: serviceConnection + displayName: "Azure Service Connection" + type: string + default: "Azure Container Networking - Standalone Test Service Connection" + - name: location displayName: "Deployment Region" type: string default: "centraluseuap" - - name: resourceGroupName - displayName: "Resource Group Name" - type: string - default: "long-run-$(Build.BuildId)" - - - name: vmSkuDefault - displayName: "VM SKU for Default Node Pool" - type: string - default: "Standard_D2s_v3" - - - name: vmSkuHighNIC - displayName: "VM SKU for High NIC Node Pool" - type: string - default: "Standard_D16s_v3" + - name: runSetupStages + displayName: "Create New Infrastructure Setup" + type: boolean + default: false - - name: serviceConnection - displayName: "Azure Service Connection" + # Setup-only parameters (only used when runSetupStages=true) + - name: resourceGroupName + displayName: "Resource Group Name used when Create new Infrastructure Setup is selected" type: string - default: "Azure Container Networking - Standalone Test Service Connection" + default: "sv2-long-run-$(Build.BuildId)" extends: template: template/long-running-pipeline-template.yaml @@ -37,6 +43,5 @@ extends: subscriptionId: ${{ parameters.subscriptionId }} location: ${{ parameters.location }} resourceGroupName: ${{ parameters.resourceGroupName }} - vmSkuDefault: ${{ parameters.vmSkuDefault }} - vmSkuHighNIC: ${{ parameters.vmSkuHighNIC }} serviceConnection: ${{ parameters.serviceConnection }} + runSetupStages: ${{ parameters.runSetupStages }} diff --git a/.pipelines/swiftv2-long-running/scripts/create_aks.sh b/.pipelines/swiftv2-long-running/scripts/create_aks.sh index 4ab38c0f42..999a406900 100644 --- a/.pipelines/swiftv2-long-running/scripts/create_aks.sh +++ b/.pipelines/swiftv2-long-running/scripts/create_aks.sh @@ -7,57 +7,113 @@ RG=$3 VM_SKU_DEFAULT=$4 VM_SKU_HIGHNIC=$5 -CLUSTER_COUNT=2 -CLUSTER_PREFIX="aks" -DEFAULT_NODE_COUNT=1 -COMMON_TAGS="fastpathenabled=true RGOwner=LongRunningTestPipelines stampcreatorserviceinfo=true" - -wait_for_provisioning() { # Helper for safe retry/wait for provisioning states (basic) - local rg="$1" clusterName="$2" - echo "Waiting for AKS '$clusterName' in RG '$rg' to reach Succeeded/Failed (polling)..." +CLUSTER_COUNT=2 +CLUSTER_PREFIX="aks" + + +stamp_vnet() { + local vnet_id="$1" + + responseFile="response.txt" + modified_vnet="${vnet_id//\//%2F}" + cmd_stamp_curl="'curl -v -X PUT http://localhost:8080/VirtualNetwork/$modified_vnet/stampcreatorservicename'" + cmd_containerapp_exec="az containerapp exec -n subnetdelegator-westus-u3h4j -g subnetdelegator-westus --subscription 9b8218f9-902a-4d20-a65c-e98acec5362f --command $cmd_stamp_curl" + + max_retries=10 + sleep_seconds=15 + retry_count=0 + + while [[ $retry_count -lt $max_retries ]]; do + script --quiet -c "$cmd_containerapp_exec" "$responseFile" + if grep -qF "200 OK" "$responseFile"; then + echo "Subnet Delegator successfully stamped the vnet" + return 0 + else + echo "Subnet Delegator failed to stamp the vnet, attempt $((retry_count+1))" + cat "$responseFile" + retry_count=$((retry_count+1)) + sleep "$sleep_seconds" + fi + done + + echo "Failed to stamp the vnet even after $max_retries attempts" + exit 1 +} + +wait_for_provisioning() { + local rg="$1" clusterName="$2" + echo "Waiting for AKS '$clusterName' in RG '$rg'..." while :; do state=$(az aks show --resource-group "$rg" --name "$clusterName" --query provisioningState -o tsv 2>/dev/null || true) - if [ -z "$state" ]; then - sleep 3 - continue + if [[ "$state" =~ Succeeded ]]; then + echo "Provisioning state: $state" + break fi - case "$state" in - Succeeded|Succeeded*) echo "Provisioning state: $state"; break ;; - Failed|Canceled|Rejected) echo "Provisioning finished with state: $state"; break ;; - *) printf "."; sleep 6 ;; - esac + if [[ "$state" =~ Failed|Canceled ]]; then + echo "Provisioning finished with state: $state" + break + fi + sleep 6 done } +######################################### +# Main script starts here +######################################### + for i in $(seq 1 "$CLUSTER_COUNT"); do - echo "==============================" - echo " Working on cluster set #$i" - echo "==============================" - - CLUSTER_NAME="${CLUSTER_PREFIX}-${i}" - echo "Creating AKS cluster '$CLUSTER_NAME' in RG '$RG'" - - make -C ./hack/aks azcfg AZCLI=az REGION=$LOCATION - - make -C ./hack/aks swiftv2-podsubnet-cluster-up \ - AZCLI=az REGION=$LOCATION \ - SUB=$SUBSCRIPTION_ID \ - GROUP=$RG \ - CLUSTER=$CLUSTER_NAME \ - NODE_COUNT=$DEFAULT_NODE_COUNT \ - VM_SIZE=$VM_SKU_DEFAULT \ - - echo " - waiting for AKS provisioning state..." - wait_for_provisioning "$RG" "$CLUSTER_NAME" - - echo "Adding multi-tenant nodepool ' to '$CLUSTER_NAME'" - make -C ./hack/aks linux-swiftv2-nodepool-up \ - AZCLI=az REGION=$LOCATION \ - GROUP=$RG \ - VM_SIZE=$VM_SKU_HIGHNIC \ - CLUSTER=$CLUSTER_NAME \ - SUB=$SUBSCRIPTION_ID \ + echo "Creating cluster #$i..." + CLUSTER_NAME="${CLUSTER_PREFIX}-${i}" + + make -C ./hack/aks azcfg AZCLI=az REGION=$LOCATION + + # Create cluster with SkipAutoDeleteTill tag for persistent infrastructure + make -C ./hack/aks swiftv2-podsubnet-cluster-up \ + AZCLI=az REGION=$LOCATION \ + SUB=$SUBSCRIPTION_ID \ + GROUP=$RG \ + CLUSTER=$CLUSTER_NAME \ + VM_SIZE=$VM_SKU_DEFAULT + + # Add SkipAutoDeleteTill tag to cluster (2032-12-31 for long-term persistence) + az aks update -g "$RG" -n "$CLUSTER_NAME" --tags SkipAutoDeleteTill=2032-12-31 || echo "Warning: Failed to add tag to cluster" + + wait_for_provisioning "$RG" "$CLUSTER_NAME" + + vnet_id=$(az network vnet show -g "$RG" --name "$CLUSTER_NAME" --query id -o tsv) + echo "Found VNET: $vnet_id" + + # Add SkipAutoDeleteTill tag to AKS VNet + az network vnet update --ids "$vnet_id" --set tags.SkipAutoDeleteTill=2032-12-31 || echo "Warning: Failed to add tag to vnet" + + stamp_vnet "$vnet_id" + + make -C ./hack/aks linux-swiftv2-nodepool-up \ + AZCLI=az REGION=$LOCATION \ + GROUP=$RG \ + VM_SIZE=$VM_SKU_HIGHNIC \ + CLUSTER=$CLUSTER_NAME \ + SUB=$SUBSCRIPTION_ID + + az aks get-credentials -g "$RG" -n "$CLUSTER_NAME" --admin --overwrite-existing \ + --file "/tmp/${CLUSTER_NAME}.kubeconfig" + + # Label all nodes with workload-type and nic-capacity labels + echo "==> Labeling all nodes in $CLUSTER_NAME with workload-type=swiftv2-linux" + kubectl --kubeconfig "/tmp/${CLUSTER_NAME}.kubeconfig" label nodes --all workload-type=swiftv2-linux --overwrite + echo "[OK] All nodes labeled with workload-type=swiftv2-linux" + + # Label default nodepool (nodepool1) with low-nic capacity + echo "==> Labeling default nodepool (nodepool1) nodes with nic-capacity=low-nic" + kubectl --kubeconfig "/tmp/${CLUSTER_NAME}.kubeconfig" label nodes -l agentpool=nodepool1 nic-capacity=low-nic --overwrite + echo "[OK] Default nodepool nodes labeled with nic-capacity=low-nic" + + # Label nplinux nodepool with high-nic capacity + echo "==> Labeling nplinux nodepool nodes with nic-capacity=high-nic" + kubectl --kubeconfig "/tmp/${CLUSTER_NAME}.kubeconfig" label nodes -l agentpool=nplinux nic-capacity=high-nic --overwrite + echo "[OK] nplinux nodepool nodes labeled with nic-capacity=high-nic" done -echo "All done. Created $CLUSTER_COUNT cluster set(s)." + +echo "All clusters complete." diff --git a/.pipelines/swiftv2-long-running/scripts/create_nsg.sh b/.pipelines/swiftv2-long-running/scripts/create_nsg.sh old mode 100644 new mode 100755 index cec91cd7cf..34c04f5c70 --- a/.pipelines/swiftv2-long-running/scripts/create_nsg.sh +++ b/.pipelines/swiftv2-long-running/scripts/create_nsg.sh @@ -7,9 +7,59 @@ RG=$2 LOCATION=$3 VNET_A1="cx_vnet_a1" -SUBNET1_PREFIX="10.10.1.0/24" -SUBNET2_PREFIX="10.10.2.0/24" -NSG_NAME="${VNET_A1}-nsg" + +# Get actual subnet CIDR ranges dynamically +echo "==> Retrieving actual subnet address prefixes..." +SUBNET1_PREFIX=$(az network vnet subnet show -g "$RG" --vnet-name "$VNET_A1" -n s1 --query "addressPrefix" -o tsv) +SUBNET2_PREFIX=$(az network vnet subnet show -g "$RG" --vnet-name "$VNET_A1" -n s2 --query "addressPrefix" -o tsv) + +echo "Subnet s1 CIDR: $SUBNET1_PREFIX" +echo "Subnet s2 CIDR: $SUBNET2_PREFIX" + +if [[ -z "$SUBNET1_PREFIX" || -z "$SUBNET2_PREFIX" ]]; then + echo "[ERROR] Failed to retrieve subnet address prefixes!" >&2 + exit 1 +fi + +# Wait 5 minutes for NSGs to be associated with subnets +echo "==> Waiting 5 minutes for NSG associations to complete..." +sleep 300 + +# Get NSG IDs associated with each subnet with retry logic +echo "==> Retrieving NSGs associated with subnets..." +max_retries=10 +retry_count=0 +retry_delay=30 + +while [[ $retry_count -lt $max_retries ]]; do + NSG_S1_ID=$(az network vnet subnet show -g "$RG" --vnet-name "$VNET_A1" -n s1 --query "networkSecurityGroup.id" -o tsv 2>/dev/null || echo "") + NSG_S2_ID=$(az network vnet subnet show -g "$RG" --vnet-name "$VNET_A1" -n s2 --query "networkSecurityGroup.id" -o tsv 2>/dev/null || echo "") + + if [[ -n "$NSG_S1_ID" && -n "$NSG_S2_ID" ]]; then + echo "[OK] Successfully retrieved NSG associations for both subnets" + break + fi + + retry_count=$((retry_count + 1)) + if [[ $retry_count -lt $max_retries ]]; then + echo "[RETRY $retry_count/$max_retries] NSG associations not ready yet. Waiting ${retry_delay}s before retry..." + echo " Subnet s1 NSG ID: ${NSG_S1_ID:-}" + echo " Subnet s2 NSG ID: ${NSG_S2_ID:-}" + sleep $retry_delay + else + echo "[ERROR] Failed to retrieve NSG associations after $max_retries attempts!" >&2 + echo " Subnet s1 NSG ID: ${NSG_S1_ID:-}" >&2 + echo " Subnet s2 NSG ID: ${NSG_S2_ID:-}" >&2 + exit 1 + fi +done + +# Extract NSG names from IDs +NSG_S1_NAME=$(basename "$NSG_S1_ID") +NSG_S2_NAME=$(basename "$NSG_S2_ID") + +echo "Subnet s1 NSG: $NSG_S1_NAME" +echo "Subnet s2 NSG: $NSG_S2_NAME" verify_nsg() { local rg="$1"; local name="$2" @@ -33,77 +83,119 @@ verify_nsg_rule() { fi } -verify_subnet_nsg_association() { - local rg="$1"; local vnet="$2"; local subnet="$3"; local nsg="$4" - echo "==> Verifying NSG association on subnet $subnet..." - local associated_nsg - associated_nsg=$(az network vnet subnet show -g "$rg" --vnet-name "$vnet" -n "$subnet" --query "networkSecurityGroup.id" -o tsv 2>/dev/null || echo "") - if [[ "$associated_nsg" == *"$nsg"* ]]; then - echo "[OK] Verified subnet $subnet is associated with NSG $nsg." - else - echo "[ERROR] Subnet $subnet is NOT associated with NSG $nsg!" >&2 - exit 1 - fi +wait_for_nsg() { + local rg="$1"; local name="$2" + echo "==> Waiting for NSG $name to become available..." + local max_attempts=30 + local attempt=0 + while [[ $attempt -lt $max_attempts ]]; do + if az network nsg show -g "$rg" -n "$name" &>/dev/null; then + local provisioning_state + provisioning_state=$(az network nsg show -g "$rg" -n "$name" --query "provisioningState" -o tsv) + if [[ "$provisioning_state" == "Succeeded" ]]; then + echo "[OK] NSG $name is available (provisioningState: $provisioning_state)." + return 0 + fi + echo "Waiting... NSG $name provisioningState: $provisioning_state" + fi + attempt=$((attempt + 1)) + sleep 10 + done + echo "[ERROR] NSG $name did not become available within the expected time!" >&2 + exit 1 } # ------------------------------- -# 1. Create NSG +# 1. Wait for NSGs to be available # ------------------------------- -echo "==> Creating Network Security Group: $NSG_NAME" -az network nsg create -g "$RG" -n "$NSG_NAME" -l "$LOCATION" --output none \ - && echo "[OK] NSG '$NSG_NAME' created." -verify_nsg "$RG" "$NSG_NAME" +wait_for_nsg "$RG" "$NSG_S1_NAME" +wait_for_nsg "$RG" "$NSG_S2_NAME" # ------------------------------- -# 2. Create NSG Rules +# 2. Create NSG Rules on Subnet1's NSG # ------------------------------- -echo "==> Creating NSG rule to DENY traffic from Subnet1 ($SUBNET1_PREFIX) to Subnet2 ($SUBNET2_PREFIX)" +# Rule 1: Deny Outbound traffic FROM Subnet1 TO Subnet2 +echo "==> Creating NSG rule on $NSG_S1_NAME to DENY OUTBOUND traffic from Subnet1 ($SUBNET1_PREFIX) to Subnet2 ($SUBNET2_PREFIX)" az network nsg rule create \ --resource-group "$RG" \ - --nsg-name "$NSG_NAME" \ - --name deny-subnet1-to-subnet2 \ + --nsg-name "$NSG_S1_NAME" \ + --name deny-s1-to-s2-outbound \ --priority 100 \ --source-address-prefixes "$SUBNET1_PREFIX" \ --destination-address-prefixes "$SUBNET2_PREFIX" \ - --direction Inbound \ + --source-port-ranges "*" \ + --destination-port-ranges "*" \ + --direction Outbound \ --access Deny \ --protocol "*" \ - --description "Deny all traffic from Subnet1 to Subnet2" \ + --description "Deny outbound traffic from Subnet1 to Subnet2" \ --output none \ - && echo "[OK] Deny rule from Subnet1 → Subnet2 created." + && echo "[OK] Deny outbound rule from Subnet1 → Subnet2 created on $NSG_S1_NAME." -verify_nsg_rule "$RG" "$NSG_NAME" "deny-subnet1-to-subnet2" +verify_nsg_rule "$RG" "$NSG_S1_NAME" "deny-s1-to-s2-outbound" -echo "==> Creating NSG rule to DENY traffic from Subnet2 ($SUBNET2_PREFIX) to Subnet1 ($SUBNET1_PREFIX)" +# Rule 2: Deny Inbound traffic FROM Subnet2 TO Subnet1 (for packets arriving at s1) +echo "==> Creating NSG rule on $NSG_S1_NAME to DENY INBOUND traffic from Subnet2 ($SUBNET2_PREFIX) to Subnet1 ($SUBNET1_PREFIX)" az network nsg rule create \ --resource-group "$RG" \ - --nsg-name "$NSG_NAME" \ - --name deny-subnet2-to-subnet1 \ - --priority 200 \ + --nsg-name "$NSG_S1_NAME" \ + --name deny-s2-to-s1-inbound \ + --priority 110 \ --source-address-prefixes "$SUBNET2_PREFIX" \ --destination-address-prefixes "$SUBNET1_PREFIX" \ + --source-port-ranges "*" \ + --destination-port-ranges "*" \ --direction Inbound \ --access Deny \ --protocol "*" \ - --description "Deny all traffic from Subnet2 to Subnet1" \ + --description "Deny inbound traffic from Subnet2 to Subnet1" \ --output none \ - && echo "[OK] Deny rule from Subnet2 → Subnet1 created." + && echo "[OK] Deny inbound rule from Subnet2 → Subnet1 created on $NSG_S1_NAME." -verify_nsg_rule "$RG" "$NSG_NAME" "deny-subnet2-to-subnet1" +verify_nsg_rule "$RG" "$NSG_S1_NAME" "deny-s2-to-s1-inbound" # ------------------------------- -# 3. Associate NSG with Subnets +# 3. Create NSG Rules on Subnet2's NSG # ------------------------------- -for SUBNET in s1 s2; do - echo "==> Associating NSG $NSG_NAME with subnet $SUBNET" - az network vnet subnet update \ - --name "$SUBNET" \ - --vnet-name "$VNET_A1" \ - --resource-group "$RG" \ - --network-security-group "$NSG_NAME" \ - --output none - verify_subnet_nsg_association "$RG" "$VNET_A1" "$SUBNET" "$NSG_NAME" -done +# Rule 3: Deny Outbound traffic FROM Subnet2 TO Subnet1 +echo "==> Creating NSG rule on $NSG_S2_NAME to DENY OUTBOUND traffic from Subnet2 ($SUBNET2_PREFIX) to Subnet1 ($SUBNET1_PREFIX)" +az network nsg rule create \ + --resource-group "$RG" \ + --nsg-name "$NSG_S2_NAME" \ + --name deny-s2-to-s1-outbound \ + --priority 100 \ + --source-address-prefixes "$SUBNET2_PREFIX" \ + --destination-address-prefixes "$SUBNET1_PREFIX" \ + --source-port-ranges "*" \ + --destination-port-ranges "*" \ + --direction Outbound \ + --access Deny \ + --protocol "*" \ + --description "Deny outbound traffic from Subnet2 to Subnet1" \ + --output none \ + && echo "[OK] Deny outbound rule from Subnet2 → Subnet1 created on $NSG_S2_NAME." + +verify_nsg_rule "$RG" "$NSG_S2_NAME" "deny-s2-to-s1-outbound" + +# Rule 4: Deny Inbound traffic FROM Subnet1 TO Subnet2 (for packets arriving at s2) +echo "==> Creating NSG rule on $NSG_S2_NAME to DENY INBOUND traffic from Subnet1 ($SUBNET1_PREFIX) to Subnet2 ($SUBNET2_PREFIX)" +az network nsg rule create \ + --resource-group "$RG" \ + --nsg-name "$NSG_S2_NAME" \ + --name deny-s1-to-s2-inbound \ + --priority 110 \ + --source-address-prefixes "$SUBNET1_PREFIX" \ + --destination-address-prefixes "$SUBNET2_PREFIX" \ + --source-port-ranges "*" \ + --destination-port-ranges "*" \ + --direction Inbound \ + --access Deny \ + --protocol "*" \ + --description "Deny inbound traffic from Subnet1 to Subnet2" \ + --output none \ + && echo "[OK] Deny inbound rule from Subnet1 → Subnet2 created on $NSG_S2_NAME." + +verify_nsg_rule "$RG" "$NSG_S2_NAME" "deny-s1-to-s2-inbound" -echo "NSG '$NSG_NAME' created successfully with bidirectional isolation between Subnet1 and Subnet2." +echo "NSG rules applied successfully on $NSG_S1_NAME and $NSG_S2_NAME with bidirectional isolation between Subnet1 and Subnet2." diff --git a/.pipelines/swiftv2-long-running/scripts/create_pe.sh b/.pipelines/swiftv2-long-running/scripts/create_pe.sh index c9f7e782e0..4d83a8a700 100644 --- a/.pipelines/swiftv2-long-running/scripts/create_pe.sh +++ b/.pipelines/swiftv2-long-running/scripts/create_pe.sh @@ -57,7 +57,7 @@ az network private-dns zone create -g "$RG" -n "$PRIVATE_DNS_ZONE" --output none verify_dns_zone "$RG" "$PRIVATE_DNS_ZONE" -# 2. Link DNS zone to VNet +# 2. Link DNS zone to Customer VNets for VNET in "$VNET_A1" "$VNET_A2" "$VNET_A3"; do LINK_NAME="${VNET}-link" echo "==> Linking DNS zone $PRIVATE_DNS_ZONE to VNet $VNET" @@ -71,9 +71,34 @@ for VNET in "$VNET_A1" "$VNET_A2" "$VNET_A3"; do verify_dns_link "$RG" "$PRIVATE_DNS_ZONE" "$LINK_NAME" done -# 3. Create Private Endpoint +# 2b. Link DNS zone to AKS Cluster VNets (so pods can resolve private endpoint) +echo "==> Linking DNS zone to AKS cluster VNets" +for CLUSTER in "aks-1" "aks-2"; do + echo "==> Getting VNet for $CLUSTER" + AKS_VNET_ID=$(az aks show -g "$RG" -n "$CLUSTER" --query "agentPoolProfiles[0].vnetSubnetId" -o tsv | cut -d'/' -f1-9) + + if [ -z "$AKS_VNET_ID" ]; then + echo "[WARNING] Could not get VNet for $CLUSTER, skipping DNS link" + continue + fi + + LINK_NAME="${CLUSTER}-vnet-link" + echo "==> Linking DNS zone to $CLUSTER VNet" + az network private-dns link vnet create \ + -g "$RG" -n "$LINK_NAME" \ + --zone-name "$PRIVATE_DNS_ZONE" \ + --virtual-network "$AKS_VNET_ID" \ + --registration-enabled false \ + --output none \ + && echo "[OK] Linked DNS zone to $CLUSTER VNet." + verify_dns_link "$RG" "$PRIVATE_DNS_ZONE" "$LINK_NAME" +done + +# 3. Create Private Endpoint with Private DNS Zone integration echo "==> Creating Private Endpoint for Storage Account: $SA1_NAME" SA1_ID=$(az storage account show -g "$RG" -n "$SA1_NAME" --query id -o tsv) +DNS_ZONE_ID=$(az network private-dns zone show -g "$RG" -n "$PRIVATE_DNS_ZONE" --query id -o tsv) + az network private-endpoint create \ -g "$RG" -n "$PE_NAME" -l "$LOCATION" \ --vnet-name "$VNET_A1" --subnet "$SUBNET_PE_A1" \ @@ -84,4 +109,32 @@ az network private-endpoint create \ && echo "[OK] Private Endpoint $PE_NAME created for $SA1_NAME." verify_private_endpoint "$RG" "$PE_NAME" +# 4. Create Private DNS Zone Group to auto-register the private endpoint IP +echo "==> Creating Private DNS Zone Group to register DNS record" +az network private-endpoint dns-zone-group create \ + -g "$RG" \ + --endpoint-name "$PE_NAME" \ + --name "default" \ + --private-dns-zone "$DNS_ZONE_ID" \ + --zone-name "blob" \ + --output none \ + && echo "[OK] DNS Zone Group created, DNS record will be auto-registered." + +# 5. Verify DNS record was created +echo "==> Waiting 10 seconds for DNS record propagation..." +sleep 10 + +echo "==> Verifying DNS A record for $SA1_NAME" +PE_IP=$(az network private-endpoint show -g "$RG" -n "$PE_NAME" --query 'customDnsConfigs[0].ipAddresses[0]' -o tsv) +echo "Private Endpoint IP: $PE_IP" + +DNS_RECORD=$(az network private-dns record-set a list -g "$RG" -z "$PRIVATE_DNS_ZONE" --query "[?contains(name, '$SA1_NAME')].{Name:name, IP:aRecords[0].ipv4Address}" -o tsv) +echo "DNS Record: $DNS_RECORD" + +if [ -z "$DNS_RECORD" ]; then + echo "[WARNING] DNS A record not found. Manual verification needed." +else + echo "[OK] DNS A record created successfully." +fi + echo "All Private DNS and Endpoint resources created and verified successfully." diff --git a/.pipelines/swiftv2-long-running/scripts/create_storage.sh b/.pipelines/swiftv2-long-running/scripts/create_storage.sh index caefc69294..fd5f7addae 100644 --- a/.pipelines/swiftv2-long-running/scripts/create_storage.sh +++ b/.pipelines/swiftv2-long-running/scripts/create_storage.sh @@ -26,8 +26,10 @@ for SA in "$SA1" "$SA2"; do --allow-shared-key-access false \ --https-only true \ --min-tls-version TLS1_2 \ + --tags SkipAutoDeleteTill=2032-12-31 \ --query "name" -o tsv \ && echo "Storage account $SA created successfully." + # Verify creation success echo "==> Verifying storage account $SA exists..." if az storage account show --name "$SA" --resource-group "$RG" &>/dev/null; then @@ -36,8 +38,48 @@ for SA in "$SA1" "$SA2"; do echo "[ERROR] Storage account $SA not found after creation!" >&2 exit 1 fi + + # Assign RBAC role to pipeline service principal for blob access + echo "==> Assigning Storage Blob Data Contributor role to service principal" + SP_OBJECT_ID=$(az ad signed-in-user show --query id -o tsv 2>/dev/null || az account show --query user.name -o tsv) + SA_SCOPE="/subscriptions/${SUBSCRIPTION_ID}/resourceGroups/${RG}/providers/Microsoft.Storage/storageAccounts/${SA}" + + az role assignment create \ + --assignee "$SP_OBJECT_ID" \ + --role "Storage Blob Data Contributor" \ + --scope "$SA_SCOPE" \ + --output none \ + && echo "[OK] RBAC role assigned to service principal for $SA" + + # Create container and upload test blob for private endpoint testing + echo "==> Creating test container in $SA" + az storage container create \ + --name "test" \ + --account-name "$SA" \ + --auth-mode login \ + && echo "[OK] Container 'test' created in $SA" + + # Upload test blob + echo "==> Uploading test blob to $SA" + az storage blob upload \ + --account-name "$SA" \ + --container-name "test" \ + --name "hello.txt" \ + --data "Hello from Private Endpoint - Storage: $SA" \ + --auth-mode login \ + --overwrite \ + && echo "[OK] Test blob 'hello.txt' uploaded to $SA/test/" done +# # Disable public network access ONLY on SA1 (Tenant A storage with private endpoint) +# echo "==> Disabling public network access on $SA1" +# az storage account update \ +# --name "$SA1" \ +# --resource-group "$RG" \ +# --public-network-access Disabled \ +# --output none \ +# && echo "[OK] Public network access disabled on $SA1" + echo "All storage accounts created and verified successfully." # Set pipeline output variables diff --git a/.pipelines/swiftv2-long-running/scripts/create_vnets.sh b/.pipelines/swiftv2-long-running/scripts/create_vnets.sh index eb894d06ff..4649c3aca1 100644 --- a/.pipelines/swiftv2-long-running/scripts/create_vnets.sh +++ b/.pipelines/swiftv2-long-running/scripts/create_vnets.sh @@ -2,35 +2,31 @@ set -e trap 'echo "[ERROR] Failed while creating VNets or subnets. Check Azure CLI logs above." >&2' ERR -SUBSCRIPTION_ID=$1 +SUB_ID=$1 LOCATION=$2 RG=$3 +BUILD_ID=$4 -az account set --subscription "$SUBSCRIPTION_ID" - -# VNets and subnets -VNET_A1="cx_vnet_a1" -VNET_A2="cx_vnet_a2" -VNET_A3="cx_vnet_a3" -VNET_B1="cx_vnet_b1" - -A1_S1="10.10.1.0/24" -A1_S2="10.10.2.0/24" -A1_PE="10.10.100.0/24" - -A2_MAIN="10.11.1.0/24" - -A3_MAIN="10.12.1.0/24" - -B1_MAIN="10.20.1.0/24" +# --- VNet definitions --- +# Create customer vnets for two customers A and B. +# Using 172.16.0.0/12 range to avoid overlap with AKS infra 10.0.0.0/8 +VNAMES=( "cx_vnet_a1" "cx_vnet_a2" "cx_vnet_a3" "cx_vnet_b1" ) +VCIDRS=( "172.16.0.0/16" "172.17.0.0/16" "172.18.0.0/16" "172.19.0.0/16" ) +NODE_SUBNETS=( "172.16.0.0/24" "172.17.0.0/24" "172.18.0.0/24" "172.19.0.0/24" ) +EXTRA_SUBNETS_LIST=( "s1 s2 pe" "s1" "s1" "s1" ) +EXTRA_CIDRS_LIST=( "172.16.1.0/24,172.16.2.0/24,172.16.3.0/24" \ + "172.17.1.0/24" \ + "172.18.1.0/24" \ + "172.19.1.0/24" ) +az account set --subscription "$SUB_ID" # ------------------------------- # Verification functions # ------------------------------- verify_vnet() { - local rg="$1"; local vnet="$2" + local vnet="$1" echo "==> Verifying VNet: $vnet" - if az network vnet show -g "$rg" -n "$vnet" &>/dev/null; then + if az network vnet show -g "$RG" -n "$vnet" &>/dev/null; then echo "[OK] Verified VNet $vnet exists." else echo "[ERROR] VNet $vnet not found!" >&2 @@ -39,9 +35,9 @@ verify_vnet() { } verify_subnet() { - local rg="$1"; local vnet="$2"; local subnet="$3" + local vnet="$1"; local subnet="$2" echo "==> Verifying subnet: $subnet in $vnet" - if az network vnet subnet show -g "$rg" --vnet-name "$vnet" -n "$subnet" &>/dev/null; then + if az network vnet subnet show -g "$RG" --vnet-name "$vnet" -n "$subnet" &>/dev/null; then echo "[OK] Verified subnet $subnet exists in $vnet." else echo "[ERROR] Subnet $subnet not found in $vnet!" >&2 @@ -50,35 +46,99 @@ verify_subnet() { } # ------------------------------- -# Create VNets and Subnets -# ------------------------------- -# A1 -az network vnet create -g "$RG" -n "$VNET_A1" --address-prefix 10.10.0.0/16 --subnet-name s1 --subnet-prefix "$A1_S1" -l "$LOCATION" --output none \ - && echo "Created $VNET_A1 with subnet s1" -az network vnet subnet create -g "$RG" --vnet-name "$VNET_A1" -n s2 --address-prefix "$A1_S2" --output none \ - && echo "Created $VNET_A1 with subnet s2" -az network vnet subnet create -g "$RG" --vnet-name "$VNET_A1" -n pe --address-prefix "$A1_PE" --output none \ - && echo "Created $VNET_A1 with subnet pe" -# Verify A1 -verify_vnet "$RG" "$VNET_A1" -for sn in s1 s2 pe; do verify_subnet "$RG" "$VNET_A1" "$sn"; done +create_vnet_subets() { + local vnet="$1" + local vnet_cidr="$2" + local node_subnet_cidr="$3" + local extra_subnets="$4" + local extra_cidrs="$5" -# A2 -az network vnet create -g "$RG" -n "$VNET_A2" --address-prefix 10.11.0.0/16 --subnet-name s1 --subnet-prefix "$A2_MAIN" -l "$LOCATION" --output none \ - && echo "Created $VNET_A2 with subnet s1" -verify_vnet "$RG" "$VNET_A2" -verify_subnet "$RG" "$VNET_A2" "s1" + echo "==> Creating VNet: $vnet with CIDR: $vnet_cidr" + az network vnet create -g "$RG" -l "$LOCATION" --name "$vnet" --address-prefixes "$vnet_cidr" \ + --tags SkipAutoDeleteTill=2032-12-31 -o none + + IFS=' ' read -r -a extra_subnet_array <<< "$extra_subnets" + IFS=',' read -r -a extra_cidr_array <<< "$extra_cidrs" + + for i in "${!extra_subnet_array[@]}"; do + subnet_name="${extra_subnet_array[$i]}" + subnet_cidr="${extra_cidr_array[$i]}" + echo "==> Creating extra subnet: $subnet_name with CIDR: $subnet_cidr" + + # Only delegate pod subnets (not private endpoint subnets) + if [[ "$subnet_name" != "pe" ]]; then + az network vnet subnet create -g "$RG" \ + --vnet-name "$vnet" --name "$subnet_name" \ + --delegations Microsoft.SubnetDelegator/msfttestclients \ + --address-prefixes "$subnet_cidr" -o none + else + az network vnet subnet create -g "$RG" \ + --vnet-name "$vnet" --name "$subnet_name" \ + --address-prefixes "$subnet_cidr" -o none + fi + done +} + +delegate_subnet() { + local vnet="$1" + local subnet="$2" + local max_attempts=7 + local attempt=1 + + echo "==> Delegating subnet: $subnet in VNet: $vnet to Subnet Delegator" + subnet_id=$(az network vnet subnet show -g "$RG" --vnet-name "$vnet" -n "$subnet" --query id -o tsv) + modified_custsubnet="${subnet_id//\//%2F}" + + responseFile="delegate_response.txt" + cmd_delegator_curl="'curl -X PUT http://localhost:8080/DelegatedSubnet/$modified_custsubnet'" + cmd_containerapp_exec="az containerapp exec -n subnetdelegator-westus-u3h4j -g subnetdelegator-westus --subscription 9b8218f9-902a-4d20-a65c-e98acec5362f --command $cmd_delegator_curl" + + while [ $attempt -le $max_attempts ]; do + echo "Attempt $attempt of $max_attempts..." + + # Use script command to provide PTY for az containerapp exec + script --quiet -c "$cmd_containerapp_exec" "$responseFile" + + if grep -qF "success" "$responseFile"; then + echo "Subnet Delegator successfully registered the subnet" + rm -f "$responseFile" + return 0 + else + echo "Subnet Delegator failed to register the subnet (attempt $attempt)" + cat "$responseFile" + + if [ $attempt -lt $max_attempts ]; then + echo "Retrying in 5 seconds..." + sleep 5 + fi + fi + + ((attempt++)) + done + + echo "[ERROR] Failed to delegate subnet after $max_attempts attempts" + rm -f "$responseFile" + exit 1 +} -# A3 -az network vnet create -g "$RG" -n "$VNET_A3" --address-prefix 10.12.0.0/16 --subnet-name s1 --subnet-prefix "$A3_MAIN" -l "$LOCATION" --output none \ - && echo "Created $VNET_A3 with subnet s1" -verify_vnet "$RG" "$VNET_A3" -verify_subnet "$RG" "$VNET_A3" "s1" +# --- Loop over VNets --- +for i in "${!VNAMES[@]}"; do + VNET=${VNAMES[$i]} + VNET_CIDR=${VCIDRS[$i]} + NODE_SUBNET_CIDR=${NODE_SUBNETS[$i]} + EXTRA_SUBNETS=${EXTRA_SUBNETS_LIST[$i]} + EXTRA_SUBNET_CIDRS=${EXTRA_CIDRS_LIST[$i]} -# B1 -az network vnet create -g "$RG" -n "$VNET_B1" --address-prefix 10.20.0.0/16 --subnet-name s1 --subnet-prefix "$B1_MAIN" -l "$LOCATION" --output none \ - && echo "Created $VNET_B1 with subnet s1" -verify_vnet "$RG" "$VNET_B1" -verify_subnet "$RG" "$VNET_B1" "s1" + # Create VNet + subnets + create_vnet_subets "$VNET" "$VNET_CIDR" "$NODE_SUBNET_CIDR" "$EXTRA_SUBNETS" "$EXTRA_SUBNET_CIDRS" + verify_vnet "$VNET" + # Loop over extra subnets to verify and delegate the pod subnets. + for PODSUBNET in $EXTRA_SUBNETS; do + verify_subnet "$VNET" "$PODSUBNET" + if [[ "$PODSUBNET" != "pe" ]]; then + delegate_subnet "$VNET" "$PODSUBNET" + fi + done +done -echo " All VNets and subnets created and verified successfully." +echo "All VNets and subnets created and verified successfully." \ No newline at end of file diff --git a/.pipelines/swiftv2-long-running/template/long-running-pipeline-template.yaml b/.pipelines/swiftv2-long-running/template/long-running-pipeline-template.yaml index cc6016f17a..7236fc8776 100644 --- a/.pipelines/swiftv2-long-running/template/long-running-pipeline-template.yaml +++ b/.pipelines/swiftv2-long-running/template/long-running-pipeline-template.yaml @@ -5,16 +5,30 @@ parameters: type: string - name: resourceGroupName type: string - - name: vmSkuDefault - type: string - - name: vmSkuHighNIC - type: string - name: serviceConnection type: string + - name: runSetupStages + type: boolean + default: false + +variables: + - name: rgName + ${{ if eq(parameters.runSetupStages, true) }}: + value: ${{ parameters.resourceGroupName }} + ${{ else }}: + value: sv2-long-run-${{ parameters.location }} + - name: vmSkuDefault + value: "Standard_D4s_v3" + - name: vmSkuHighNIC + value: "Standard_D16s_v3" stages: + # ================================================================= + # Stage 1: AKS Cluster and Networking Setup (Conditional) + # ================================================================= - stage: AKSClusterAndNetworking displayName: "Stage: AKS Cluster and Networking Setup" + condition: eq(${{ parameters.runSetupStages }}, true) jobs: # ------------------------------------------------------------ # Job 1: Create Resource Group @@ -32,10 +46,13 @@ stages: scriptType: bash scriptLocation: inlineScript inlineScript: | - echo "==> Creating resource group ${{ parameters.resourceGroupName }} in ${{ parameters.location }}" + echo "Org: $SYSTEM_COLLECTIONURI" + echo "Project: $SYSTEM_TEAMPROJECT" + echo "==> Creating resource group $(rgName) in ${{ parameters.location }}" az group create \ - --name "${{ parameters.resourceGroupName }}" \ + --name "$(rgName)" \ --location "${{ parameters.location }}" \ + --tags SkipAutoDeleteTill=2032-12-31 \ --subscription "${{ parameters.subscriptionId }}" echo "Resource group created successfully." @@ -59,16 +76,17 @@ stages: arguments: > ${{ parameters.subscriptionId }} ${{ parameters.location }} - ${{ parameters.resourceGroupName }} - ${{ parameters.vmSkuDefault }} - ${{ parameters.vmSkuHighNIC }} + $(rgName) + $(vmSkuDefault) + $(vmSkuHighNIC) # ------------------------------------------------------------ # Job 3: Networking & Storage # ------------------------------------------------------------ - job: NetworkingAndStorage + timeoutInMinutes: 0 displayName: "Networking and Storage Setup" - dependsOn: CreateResourceGroup + dependsOn: CreateCluster pool: vmImage: ubuntu-latest steps: @@ -85,7 +103,8 @@ stages: arguments: > ${{ parameters.subscriptionId }} ${{ parameters.location }} - ${{ parameters.resourceGroupName }} + $(rgName) + $(Build.BuildId) # Task 2: Create Peerings - task: AzureCLI@2 @@ -96,7 +115,7 @@ stages: scriptLocation: scriptPath scriptPath: ".pipelines/swiftv2-long-running/scripts/create_peerings.sh" arguments: > - ${{ parameters.resourceGroupName }} + $(rgName) # Task 3: Create Storage Accounts - task: AzureCLI@2 @@ -110,31 +129,297 @@ stages: arguments: > ${{ parameters.subscriptionId }} ${{ parameters.location }} - ${{ parameters.resourceGroupName }} + $(rgName) - # Task 4: Create NSG + # Task 4: Create Private Endpoint - task: AzureCLI@2 - displayName: "Create network security groups to restrict access between subnets" + displayName: "Create Private Endpoint for Storage Account" inputs: azureSubscription: ${{ parameters.serviceConnection }} scriptType: bash scriptLocation: scriptPath - scriptPath: ".pipelines/swiftv2-long-running/scripts/create_nsg.sh" + scriptPath: ".pipelines/swiftv2-long-running/scripts/create_pe.sh" arguments: > ${{ parameters.subscriptionId }} - ${{ parameters.resourceGroupName }} ${{ parameters.location }} - - # Task 5: Create Private Endpoint + $(rgName) + $(CreateStorageAccounts.StorageAccount1) + + # Task 5: Create NSG - task: AzureCLI@2 - displayName: "Create Private Endpoint for Storage Account" + displayName: "Create network security groups to restrict access between subnets" inputs: azureSubscription: ${{ parameters.serviceConnection }} scriptType: bash scriptLocation: scriptPath - scriptPath: ".pipelines/swiftv2-long-running/scripts/create_pe.sh" + scriptPath: ".pipelines/swiftv2-long-running/scripts/create_nsg.sh" arguments: > ${{ parameters.subscriptionId }} + $(rgName) ${{ parameters.location }} - ${{ parameters.resourceGroupName }} - $(CreateStorageAccounts.StorageAccount1) + # ================================================================= + # Stage 2: Datapath Tests + # ================================================================= + - stage: ManagedNodeDataPathTests + displayName: "Stage: Swiftv2 Data Path Tests on Linux Managed Nodes" + dependsOn: AKSClusterAndNetworking + condition: or(eq(${{ parameters.runSetupStages }}, false), succeeded()) + variables: + storageAccount1: $[ stageDependencies.AKSClusterAndNetworking.NetworkingAndStorage.outputs['CreateStorageAccounts.StorageAccount1'] ] + storageAccount2: $[ stageDependencies.AKSClusterAndNetworking.NetworkingAndStorage.outputs['CreateStorageAccounts.StorageAccount2'] ] + jobs: + # ------------------------------------------------------------ + # Job 1: Create Test Resources and Wait + # ------------------------------------------------------------ + - job: CreateTestResources + displayName: "Create Resources and Wait 20 Minutes" + timeoutInMinutes: 90 + pool: + vmImage: ubuntu-latest + steps: + - checkout: self + + - task: GoTool@0 + displayName: "Use Go 1.22.5" + inputs: + version: "1.22.5" + + - task: AzureCLI@2 + displayName: "Create Test Resources" + inputs: + azureSubscription: ${{ parameters.serviceConnection }} + scriptType: bash + scriptLocation: inlineScript + inlineScript: | + echo "==> Installing Ginkgo CLI" + go install github.com/onsi/ginkgo/v2/ginkgo@latest + + echo "==> Adding Go bin to PATH" + export PATH=$PATH:$(go env GOPATH)/bin + + echo "==> Downloading Go dependencies" + go mod download + + echo "==> Setting up kubeconfig for cluster aks-1" + az aks get-credentials \ + --resource-group $(rgName) \ + --name aks-1 \ + --file /tmp/aks-1.kubeconfig \ + --overwrite-existing \ + --admin + + echo "==> Setting up kubeconfig for cluster aks-2" + az aks get-credentials \ + --resource-group $(rgName) \ + --name aks-2 \ + --file /tmp/aks-2.kubeconfig \ + --overwrite-existing \ + --admin + + echo "==> Verifying cluster aks-1 connectivity" + kubectl --kubeconfig /tmp/aks-1.kubeconfig get nodes + + echo "==> Verifying cluster aks-2 connectivity" + kubectl --kubeconfig /tmp/aks-2.kubeconfig get nodes + + echo "==> Creating test resources (8 scenarios)" + export RG="$(rgName)" + export BUILD_ID="$(rgName)" + export WORKLOAD_TYPE="swiftv2-linux" + cd ./test/integration/swiftv2/longRunningCluster + ginkgo -v -trace --timeout=1h --tags=create_test + + - script: | + echo "Waiting 2 minutes for pods to fully start and HTTP servers to be ready..." + sleep 120 + echo "Wait period complete, proceeding with connectivity tests" + displayName: "Wait for pods to be ready" + + # ------------------------------------------------------------ + # Job 2: Run Connectivity Tests + # ------------------------------------------------------------ + - job: ConnectivityTests + displayName: "Test Pod-to-Pod Connectivity" + dependsOn: CreateTestResources + timeoutInMinutes: 30 + pool: + vmImage: ubuntu-latest + steps: + - checkout: self + + - task: GoTool@0 + displayName: "Use Go 1.22.5" + inputs: + version: "1.22.5" + + - task: AzureCLI@2 + displayName: "Run Connectivity Tests" + inputs: + azureSubscription: ${{ parameters.serviceConnection }} + scriptType: bash + scriptLocation: inlineScript + inlineScript: | + echo "==> Installing Ginkgo CLI" + go install github.com/onsi/ginkgo/v2/ginkgo@latest + + echo "==> Adding Go bin to PATH" + export PATH=$PATH:$(go env GOPATH)/bin + + echo "==> Downloading Go dependencies" + go mod download + + echo "==> Setting up kubeconfig for cluster aks-1" + az aks get-credentials \ + --resource-group $(rgName) \ + --name aks-1 \ + --file /tmp/aks-1.kubeconfig \ + --overwrite-existing \ + --admin + + echo "==> Setting up kubeconfig for cluster aks-2" + az aks get-credentials \ + --resource-group $(rgName) \ + --name aks-2 \ + --file /tmp/aks-2.kubeconfig \ + --overwrite-existing \ + --admin + + echo "==> Running connectivity tests" + export RG="$(rgName)" + export BUILD_ID="$(rgName)" + export WORKLOAD_TYPE="swiftv2-linux" + cd ./test/integration/swiftv2/longRunningCluster + ginkgo -v -trace --timeout=30m --tags=connectivity_test + + # ------------------------------------------------------------ + # Job 3: Private Endpoint Connectivity Tests + # ------------------------------------------------------------ + - job: PrivateEndpointTests + displayName: "Test Private Endpoint Access" + dependsOn: ConnectivityTests + timeoutInMinutes: 30 + pool: + vmImage: ubuntu-latest + steps: + - checkout: self + + - task: GoTool@0 + displayName: "Use Go 1.22.5" + inputs: + version: "1.22.5" + + - task: AzureCLI@2 + displayName: "Run Private Endpoint Tests" + inputs: + azureSubscription: ${{ parameters.serviceConnection }} + scriptType: bash + scriptLocation: inlineScript + inlineScript: | + echo "==> Installing Ginkgo CLI" + go install github.com/onsi/ginkgo/v2/ginkgo@latest + + echo "==> Adding Go bin to PATH" + export PATH=$PATH:$(go env GOPATH)/bin + + echo "==> Downloading Go dependencies" + go mod download + + echo "==> Setting up kubeconfig for cluster aks-1" + az aks get-credentials \ + --resource-group $(rgName) \ + --name aks-1 \ + --file /tmp/aks-1.kubeconfig \ + --overwrite-existing \ + --admin + + echo "==> Setting up kubeconfig for cluster aks-2" + az aks get-credentials \ + --resource-group $(rgName) \ + --name aks-2 \ + --file /tmp/aks-2.kubeconfig \ + --overwrite-existing \ + --admin + + echo "==> Running Private Endpoint connectivity tests" + export RG="$(rgName)" + export BUILD_ID="$(rgName)" + export WORKLOAD_TYPE="swiftv2-linux" + + # Get storage account names - either from stage variables or discover from resource group + STORAGE_ACCOUNT_1="$(storageAccount1)" + STORAGE_ACCOUNT_2="$(storageAccount2)" + + # If variables are empty (when runSetupStages=false), discover from resource group + if [ -z "$STORAGE_ACCOUNT_1" ] || [ -z "$STORAGE_ACCOUNT_2" ]; then + echo "Storage account variables not set, discovering from resource group..." + STORAGE_ACCOUNT_1=$(az storage account list -g $(rgName) --query "[?starts_with(name, 'sa1')].name" -o tsv) + STORAGE_ACCOUNT_2=$(az storage account list -g $(rgName) --query "[?starts_with(name, 'sa2')].name" -o tsv) + echo "Discovered: STORAGE_ACCOUNT_1=$STORAGE_ACCOUNT_1, STORAGE_ACCOUNT_2=$STORAGE_ACCOUNT_2" + fi + + export STORAGE_ACCOUNT_1 + export STORAGE_ACCOUNT_2 + cd ./test/integration/swiftv2/longRunningCluster + ginkgo -v -trace --timeout=30m --tags=private_endpoint_test + + # ------------------------------------------------------------ + # Job 4: Delete Test Resources + # ------------------------------------------------------------ + - job: DeleteTestResources + displayName: "Delete PodNetwork, PNI, and Pods" + dependsOn: + - CreateTestResources + - ConnectivityTests + - PrivateEndpointTests + # Always run cleanup, even if previous jobs failed + condition: always() + timeoutInMinutes: 60 + pool: + vmImage: ubuntu-latest + steps: + - checkout: self + + - task: GoTool@0 + displayName: "Use Go 1.22.5" + inputs: + version: "1.22.5" + + - task: AzureCLI@2 + displayName: "Delete Test Resources" + inputs: + azureSubscription: ${{ parameters.serviceConnection }} + scriptType: bash + scriptLocation: inlineScript + inlineScript: | + echo "==> Installing Ginkgo CLI" + go install github.com/onsi/ginkgo/v2/ginkgo@latest + + echo "==> Adding Go bin to PATH" + export PATH=$PATH:$(go env GOPATH)/bin + + echo "==> Downloading Go dependencies" + go mod download + + echo "==> Setting up kubeconfig for cluster aks-1" + az aks get-credentials \ + --resource-group $(rgName) \ + --name aks-1 \ + --file /tmp/aks-1.kubeconfig \ + --overwrite-existing \ + --admin + + echo "==> Setting up kubeconfig for cluster aks-2" + az aks get-credentials \ + --resource-group $(rgName) \ + --name aks-2 \ + --file /tmp/aks-2.kubeconfig \ + --overwrite-existing \ + --admin + + echo "==> Deleting test resources (8 scenarios)" + export RG="$(rgName)" + export BUILD_ID="$(rgName)" + export WORKLOAD_TYPE="swiftv2-linux" + cd ./test/integration/swiftv2/longRunningCluster + ginkgo -v -trace --timeout=1h --tags=delete_test + \ No newline at end of file diff --git a/go.mod b/go.mod index bf07d7f6ac..8096f632b3 100644 --- a/go.mod +++ b/go.mod @@ -1,6 +1,8 @@ module github.com/Azure/azure-container-networking -go 1.24.1 +go 1.24.0 + +toolchain go1.24.10 require ( github.com/Azure/azure-sdk-for-go/sdk/azcore v1.19.1 @@ -68,7 +70,6 @@ require ( github.com/gofrs/uuid v4.4.0+incompatible // indirect github.com/gogo/protobuf v1.3.2 // indirect github.com/golang/groupcache v0.0.0-20210331224755-41bb18bfe9da // indirect - github.com/hpcloud/tail v1.0.0 // indirect github.com/inconshreveable/mousetrap v1.1.0 // indirect github.com/josharian/intern v1.0.0 // indirect github.com/json-iterator/go v1.1.12 // indirect @@ -104,12 +105,9 @@ require ( golang.org/x/term v0.36.0 // indirect golang.org/x/text v0.30.0 // indirect golang.org/x/time v0.14.0 - golang.org/x/xerrors v0.0.0-20220907171357-04be3eba64a2 // indirect gomodules.xyz/jsonpatch/v2 v2.4.0 // indirect - gopkg.in/fsnotify.v1 v1.4.7 // indirect gopkg.in/inf.v0 v0.9.1 // indirect gopkg.in/tomb.v1 v1.0.0-20141024135613-dd632973f1e7 // indirect - gopkg.in/yaml.v2 v2.4.0 // indirect gopkg.in/yaml.v3 v3.0.1 // indirect k8s.io/kube-openapi v0.0.0-20250910181357-589584f1c912 // indirect sigs.k8s.io/json v0.0.0-20241014173422-cfa47c3a1cc8 // indirect @@ -125,6 +123,7 @@ require ( github.com/cilium/cilium v1.15.16 github.com/cilium/ebpf v0.19.0 github.com/jsternberg/zap-logfmt v1.3.0 + github.com/onsi/ginkgo/v2 v2.23.4 golang.org/x/sync v0.17.0 gotest.tools/v3 v3.5.2 k8s.io/kubectl v0.34.1 @@ -147,9 +146,11 @@ require ( github.com/go-openapi/spec v0.20.11 // indirect github.com/go-openapi/strfmt v0.21.9 // indirect github.com/go-openapi/validate v0.22.3 // indirect + github.com/go-task/slim-sprig/v3 v3.0.0 // indirect github.com/go-viper/mapstructure/v2 v2.4.0 // indirect github.com/google/btree v1.1.3 // indirect github.com/google/gopacket v1.1.19 // indirect + github.com/google/pprof v0.0.0-20250630185457-6e76a2b096b5 // indirect github.com/gorilla/websocket v1.5.4-0.20250319132907-e064f32e3674 // indirect github.com/hashicorp/golang-lru/v2 v2.0.7 // indirect github.com/kr/pretty v0.3.1 // indirect @@ -174,10 +175,12 @@ require ( go.opentelemetry.io/otel/sdk v1.38.0 // indirect go.opentelemetry.io/otel/sdk/metric v1.38.0 // indirect go.opentelemetry.io/otel/trace v1.38.0 // indirect + go.uber.org/automaxprocs v1.6.0 // indirect go.uber.org/dig v1.17.1 // indirect go.yaml.in/yaml/v2 v2.4.3 // indirect go.yaml.in/yaml/v3 v3.0.4 // indirect go4.org/netipx v0.0.0-20231129151722-fdeea329fbba // indirect + golang.org/x/tools v0.37.0 // indirect gopkg.in/evanphx/json-patch.v4 v4.12.0 // indirect sigs.k8s.io/randfill v1.0.0 // indirect sigs.k8s.io/structured-merge-diff/v6 v6.3.0 // indirect @@ -193,11 +196,6 @@ require ( k8s.io/kubelet v0.34.1 ) -replace ( - github.com/onsi/ginkgo => github.com/onsi/ginkgo v1.12.0 - github.com/onsi/gomega => github.com/onsi/gomega v1.10.0 -) - retract ( v1.16.17 // contains only retractions, new version to retract 1.15.22. v1.16.16 // contains only retractions, has to be newer than 1.16.15. diff --git a/go.sum b/go.sum index dbcced8ba9..c1ac6b2891 100644 --- a/go.sum +++ b/go.sum @@ -114,6 +114,7 @@ github.com/evanphx/json-patch/v5 v5.9.11/go.mod h1:3j+LviiESTElxA4p3EMKAB9HXj3/X github.com/frankban/quicktest v1.14.6 h1:7Xjx+VpznH+oBnejlPUj8oUpdxnVs4f8XU8WnHkI4W8= github.com/frankban/quicktest v1.14.6/go.mod h1:4ptaffx2x8+WTWXmUCuVU6aPUX1/Mz7zb5vbUoiM6w0= github.com/fsnotify/fsnotify v1.4.7/go.mod h1:jwhsz4b93w/PPRr/qN1Yymfu8t87LnFCMoQvtojpjFo= +github.com/fsnotify/fsnotify v1.4.9/go.mod h1:znqG4EE+3YCdAaPaxE2ZRY/06pZUdp0tY4IgpuI1SZQ= github.com/fsnotify/fsnotify v1.6.0/go.mod h1:sl3t1tCWJFWoRz9R8WJCbQihKKwmorjAbSClcnxKAGw= github.com/fsnotify/fsnotify v1.9.0 h1:2Ml+OJNzbYCTzsxtv8vKSFD9PbJjmhYF14k/jKC7S9k= github.com/fsnotify/fsnotify v1.9.0/go.mod h1:8jBTzvmWwFyi3Pb8djgCCO5IBqzKJ/Jwo8TRcHyHii0= @@ -160,6 +161,7 @@ github.com/go-openapi/validate v0.22.3 h1:KxG9mu5HBRYbecRb37KRCihvGGtND2aXziBAv0 github.com/go-openapi/validate v0.22.3/go.mod h1:kVxh31KbfsxU8ZyoHaDbLBWU5CnMdqBUEtadQ2G4d5M= github.com/go-quicktest/qt v1.101.1-0.20240301121107-c6c8733fa1e6 h1:teYtXy9B7y5lHTp8V9KPxpYRAVA7dozigQcMiBust1s= github.com/go-quicktest/qt v1.101.1-0.20240301121107-c6c8733fa1e6/go.mod h1:p4lGIVX+8Wa6ZPNDvqcxq36XpUDLh42FLetFU7odllI= +github.com/go-task/slim-sprig v0.0.0-20210107165309-348f09dbbbc0/go.mod h1:fyg7847qk6SyHyPtNmDHnmrv/HOrqktSC+C9fM+CJOE= github.com/go-task/slim-sprig/v3 v3.0.0 h1:sUs3vkvUymDpBKi3qH1YSqBQk9+9D/8M2mN1vB6EwHI= github.com/go-task/slim-sprig/v3 v3.0.0/go.mod h1:W848ghGpv3Qj3dhTPRyJypKRiqCdHZiAzKg9hl15HA8= github.com/go-viper/mapstructure/v2 v2.4.0 h1:EBsztssimR/CONLSZZ04E8qAkxNYq4Qp9LvH92wZUgs= @@ -186,6 +188,7 @@ github.com/golang/protobuf v1.4.0-rc.2/go.mod h1:LlEzMj4AhA7rCAGe4KMBDvJI+AwstrU github.com/golang/protobuf v1.4.0-rc.4.0.20200313231945-b860323f09d0/go.mod h1:WU3c8KckQ9AFe+yFwt9sWVRKCVIyN9cPHBJSNnbL67w= github.com/golang/protobuf v1.4.0/go.mod h1:jodUvKwWbYaEsadDk5Fwe5c77LiNKVO9IDvqG2KuDX0= github.com/golang/protobuf v1.4.1/go.mod h1:U8fpvMrcmy5pZrNK1lt4xCsGvpyWQ/VVv6QDs8UjoX8= +github.com/golang/protobuf v1.4.2/go.mod h1:oDoupMAO8OvCJWAcko0GGGIgR6R6ocIYbsSw735rRwI= github.com/golang/protobuf v1.4.3/go.mod h1:oDoupMAO8OvCJWAcko0GGGIgR6R6ocIYbsSw735rRwI= github.com/golang/protobuf v1.5.4 h1:i7eJL8qZTpSEXOPTxNKhASYpMn+8e5Q6AdndVa1dWek= github.com/golang/protobuf v1.5.4/go.mod h1:lnTiLA8Wa4RWRcIUkrtSVa5nRhsEGBg48fD6rSs7xps= @@ -224,7 +227,6 @@ github.com/hashicorp/go-version v1.7.0 h1:5tqGy27NaOTB8yJKUZELlFAS/LTKJkrmONwQKe github.com/hashicorp/go-version v1.7.0/go.mod h1:fltr4n8CU8Ke44wwGCBoEymUuxUHl09ZGVZPK5anwXA= github.com/hashicorp/golang-lru/v2 v2.0.7 h1:a+bsQ5rvGLjzHuww6tVxozPZFVghXaHOwFs4luLUK2k= github.com/hashicorp/golang-lru/v2 v2.0.7/go.mod h1:QeFd9opnmA6QUJc5vARoKUSoFhyfM2/ZepoAG6RGpeM= -github.com/hpcloud/tail v1.0.0 h1:nfCOvKYfkgYP8hkirhJocXT2+zOD8yUNjXaWfTlyFKI= github.com/hpcloud/tail v1.0.0/go.mod h1:ab1qPbhIpdTxEkNHXyeSf5vhxWSCs/tWer42PpOxQnU= github.com/inconshreveable/mousetrap v1.1.0 h1:wN+x4NVGpMsO7ErUn/mUI3vEoE6Jt13X2s0bqwp9tc8= github.com/inconshreveable/mousetrap v1.1.0/go.mod h1:vpF70FUmC8bwa3OWnCshd2FqLfsEA9PFc4w1p2J65bw= @@ -297,16 +299,24 @@ github.com/mwitkow/go-conntrack v0.0.0-20190716064945-2f068394615f/go.mod h1:qRW github.com/mxk/go-flowrate v0.0.0-20140419014527-cca7078d478f h1:y5//uYreIhSUg3J1GEMiLbxo1LJaP8RfCpH6pymGZus= github.com/mxk/go-flowrate v0.0.0-20140419014527-cca7078d478f/go.mod h1:ZdcZmHo+o7JKHSa8/e818NopupXU1YMK5fe1lsApnBw= github.com/niemeyer/pretty v0.0.0-20200227124842-a10e7caefd8e/go.mod h1:zD1mROLANZcx1PVRCS0qkT7pwLkGfwJo4zjcN/Tysno= +github.com/nxadm/tail v1.4.4/go.mod h1:kenIhsEOeOJmVchQTgglprH7qJGnHDVpk1VPCcaMI8A= +github.com/nxadm/tail v1.4.8/go.mod h1:+ncqLTQzXmGhMZNUePPaPqPvBxHAIsmXswZKocGu+AU= github.com/nxadm/tail v1.4.11 h1:8feyoE3OzPrcshW5/MJ4sGESc5cqmGkGCWlco4l0bqY= github.com/nxadm/tail v1.4.11/go.mod h1:OTaG3NK980DZzxbRq6lEuzgU+mug70nY11sMd4JXXHc= github.com/oklog/ulid v1.3.1 h1:EGfNDEx6MqHz8B3uNV6QAib1UR2Lm97sHi3ocA6ESJ4= github.com/oklog/ulid v1.3.1/go.mod h1:CirwcVhetQ6Lv90oh/F+FBtV6XMibvdAFo93nm5qn4U= -github.com/onsi/ginkgo v1.12.0 h1:Iw5WCbBcaAAd0fpRb1c9r5YCylv4XDoCSigm1zLevwU= -github.com/onsi/ginkgo v1.12.0/go.mod h1:oUhWkIvk5aDxtKvDDuw8gItl8pKl42LzjC9KZE0HfGg= +github.com/onsi/ginkgo v1.6.0/go.mod h1:lLunBs/Ym6LB5Z9jYTR76FiuTmxDTDusOGeTQH+WWjE= +github.com/onsi/ginkgo v1.8.0/go.mod h1:lLunBs/Ym6LB5Z9jYTR76FiuTmxDTDusOGeTQH+WWjE= +github.com/onsi/ginkgo v1.12.1/go.mod h1:zj2OWP4+oCPe1qIXoGWkgMRwljMUYCdkwsT2108oapk= +github.com/onsi/ginkgo v1.16.5 h1:8xi0RTUf59SOSfEtZMvwTvXYMzG4gV23XVHOZiXNtnE= +github.com/onsi/ginkgo v1.16.5/go.mod h1:+E8gABHa3K6zRBolWtd+ROzc/U5bkGt0FwiG042wbpU= github.com/onsi/ginkgo/v2 v2.23.4 h1:ktYTpKJAVZnDT4VjxSbiBenUjmlL/5QkBEocaWXiQus= github.com/onsi/ginkgo/v2 v2.23.4/go.mod h1:Bt66ApGPBFzHyR+JO10Zbt0Gsp4uWxu5mIOTusL46e8= -github.com/onsi/gomega v1.10.0 h1:Gwkk+PTu/nfOwNMtUB/mRUv0X7ewW5dO4AERT1ThVKo= -github.com/onsi/gomega v1.10.0/go.mod h1:Ho0h+IUsWyvy1OpqCwxlQ/21gkhVunqlU8fDGcoTdcA= +github.com/onsi/gomega v1.5.0/go.mod h1:ex+gbHU/CVuBBDIJjb2X0qEXbFg53c61hWP/1CpauHY= +github.com/onsi/gomega v1.7.1/go.mod h1:XdKZgCCFLUoM/7CFJVPcG8C1xQ1AJ0vpAezJrB7JYyY= +github.com/onsi/gomega v1.10.1/go.mod h1:iN09h71vgCQne3DLsj+A5owkum+a2tYe+TOCB1ybHNo= +github.com/onsi/gomega v1.37.0 h1:CdEG8g0S133B4OswTDC/5XPSzE1OeP29QOioj2PID2Y= +github.com/onsi/gomega v1.37.0/go.mod h1:8D9+Txp43QWKhM24yyOBEdpkzN8FvJyAwecBgsU4KU0= github.com/opentracing/opentracing-go v1.2.1-0.20220228012449-10b1cf09e00b h1:FfH+VrHHk6Lxt9HdVS0PXzSXFyS2NbZKXv33FYPol0A= github.com/opentracing/opentracing-go v1.2.1-0.20220228012449-10b1cf09e00b/go.mod h1:AC62GU6hc0BrNm+9RK9VSiwa/EUe1bkIeFORAMcHvJU= github.com/patrickmn/go-cache v2.1.0+incompatible h1:HRMgzkcYKYpi3C8ajMPV8OFXaaRUnok+kx1WdO15EQc= @@ -325,6 +335,8 @@ github.com/pmezard/go-difflib v1.0.1-0.20181226105442-5d4384ee4fb2 h1:Jamvg5psRI github.com/pmezard/go-difflib v1.0.1-0.20181226105442-5d4384ee4fb2/go.mod h1:iKH77koFhYxTK1pcRnkKkqfTogsbg7gZNVY4sRDYZ/4= github.com/power-devops/perfstat v0.0.0-20210106213030-5aafc221ea8c h1:ncq/mPwQF4JjgDlrVEn3C11VoGHZN7m8qihwgMEtzYw= github.com/power-devops/perfstat v0.0.0-20210106213030-5aafc221ea8c/go.mod h1:OmDBASR4679mdNQnz2pUhc2G8CO2JrUAVFDRBDP/hJE= +github.com/prashantv/gostub v1.1.0 h1:BTyx3RfQjRHnUWaGF9oQos79AlQ5k8WNktv7VGvVH4g= +github.com/prashantv/gostub v1.1.0/go.mod h1:A5zLQHz7ieHGG7is6LLXLz7I8+3LZzsrV0P1IAHhP5U= github.com/prometheus/client_golang v1.23.2 h1:Je96obch5RDVy3FDMndoUsjAhG5Edi49h0RJWRi/o0o= github.com/prometheus/client_golang v1.23.2/go.mod h1:Tb1a6LWHB3/SPIzCoaDXI4I8UHKeFTEQ1YCr+0Gyqmg= github.com/prometheus/client_model v0.0.0-20190812154241-14fe0d1b01d4/go.mod h1:xMI15A0UPsDsEKsMN9yxemIoYk6Tm2C1GtYGdfGttqA= @@ -367,6 +379,7 @@ github.com/stretchr/objx v0.5.0/go.mod h1:Yh+to48EsGEfYuaHDzXPcE3xhTkx73EhmCGUpE github.com/stretchr/objx v0.5.2 h1:xuMeJ0Sdp5ZMRXx/aWO6RZxdr3beISkG5/G/aIRr3pY= github.com/stretchr/objx v0.5.2/go.mod h1:FRsXN1f5AsAjCGJKqEizvkpNtU+EGNCLh3NxZ/8L+MA= github.com/stretchr/testify v1.3.0/go.mod h1:M5WIy9Dh21IEIfnGCwXGc5bZfKNJtfHm1UVUgZn+9EI= +github.com/stretchr/testify v1.5.1/go.mod h1:5W2xD1RspED5o8YsWQXVCued0rvSQ+mT+I5cxcmMvtA= github.com/stretchr/testify v1.6.1/go.mod h1:6Fq8oRcR53rry900zMqJjRRixrwX3KX962/h/Wwjteg= github.com/stretchr/testify v1.7.0/go.mod h1:6Fq8oRcR53rry900zMqJjRRixrwX3KX962/h/Wwjteg= github.com/stretchr/testify v1.7.1/go.mod h1:6Fq8oRcR53rry900zMqJjRRixrwX3KX962/h/Wwjteg= @@ -465,6 +478,7 @@ golang.org/x/net v0.0.0-20190311183353-d8887717615a/go.mod h1:t9HGtf8HONx5eT2rtn golang.org/x/net v0.0.0-20190404232315-eb5bcb51f2a3/go.mod h1:t9HGtf8HONx5eT2rtn7q6eTqICYqUVnKs3thJo3Qplg= golang.org/x/net v0.0.0-20190620200207-3b0461eec859/go.mod h1:z5CRVTTTmAJ677TzLLGU+0bjPO0LkuOLi4/5GtJWs/s= golang.org/x/net v0.0.0-20200226121028-0de0cce0169b/go.mod h1:z5CRVTTTmAJ677TzLLGU+0bjPO0LkuOLi4/5GtJWs/s= +golang.org/x/net v0.0.0-20200520004742-59133d7f0dd7/go.mod h1:qpuaurCH72eLCgpAm/N6yyVIVM9cpaDIP3A8BGJEC5A= golang.org/x/net v0.0.0-20201021035429-f5854403a974/go.mod h1:sp8m0HH+o8qH0wwXwYZr8TS3Oi6o0r6Gce1SSxlDquU= golang.org/x/net v0.0.0-20201110031124-69a78807bb2b/go.mod h1:sp8m0HH+o8qH0wwXwYZr8TS3Oi6o0r6Gce1SSxlDquU= golang.org/x/net v0.0.0-20210226172049-e18ecbb05110/go.mod h1:m0MpNAwzfU5UDzcl9v0D8zg8gWTRqZa9RBIspLL5mdg= @@ -489,11 +503,15 @@ golang.org/x/sys v0.0.0-20180830151530-49385e6e1522/go.mod h1:STP8DvDyc/dI5b8T5h golang.org/x/sys v0.0.0-20180909124046-d0be0721c37e/go.mod h1:STP8DvDyc/dI5b8T5hshtkjS+E42TnysNCUPdjciGhY= golang.org/x/sys v0.0.0-20190215142949-d0b11bdaac8a/go.mod h1:STP8DvDyc/dI5b8T5hshtkjS+E42TnysNCUPdjciGhY= golang.org/x/sys v0.0.0-20190412213103-97732733099d/go.mod h1:h1NjWce9XRLGQEsW7wpKNCjG9DtNlClVuFLEZdDNbEs= +golang.org/x/sys v0.0.0-20190904154756-749cb33beabd/go.mod h1:h1NjWce9XRLGQEsW7wpKNCjG9DtNlClVuFLEZdDNbEs= golang.org/x/sys v0.0.0-20190916202348-b4ddaad3f8a3/go.mod h1:h1NjWce9XRLGQEsW7wpKNCjG9DtNlClVuFLEZdDNbEs= +golang.org/x/sys v0.0.0-20191005200804-aed5e4c7ecf9/go.mod h1:h1NjWce9XRLGQEsW7wpKNCjG9DtNlClVuFLEZdDNbEs= golang.org/x/sys v0.0.0-20191120155948-bd437916bb0e/go.mod h1:h1NjWce9XRLGQEsW7wpKNCjG9DtNlClVuFLEZdDNbEs= +golang.org/x/sys v0.0.0-20200323222414-85ca7c5b95cd/go.mod h1:h1NjWce9XRLGQEsW7wpKNCjG9DtNlClVuFLEZdDNbEs= golang.org/x/sys v0.0.0-20200930185726-fdedc70b468f/go.mod h1:h1NjWce9XRLGQEsW7wpKNCjG9DtNlClVuFLEZdDNbEs= golang.org/x/sys v0.0.0-20201119102817-f84b799fce68/go.mod h1:h1NjWce9XRLGQEsW7wpKNCjG9DtNlClVuFLEZdDNbEs= golang.org/x/sys v0.0.0-20201204225414-ed752295db88/go.mod h1:h1NjWce9XRLGQEsW7wpKNCjG9DtNlClVuFLEZdDNbEs= +golang.org/x/sys v0.0.0-20210112080510-489259a85091/go.mod h1:h1NjWce9XRLGQEsW7wpKNCjG9DtNlClVuFLEZdDNbEs= golang.org/x/sys v0.0.0-20210330210617-4fbd30eecc44/go.mod h1:h1NjWce9XRLGQEsW7wpKNCjG9DtNlClVuFLEZdDNbEs= golang.org/x/sys v0.0.0-20210423082822-04245dca01da/go.mod h1:h1NjWce9XRLGQEsW7wpKNCjG9DtNlClVuFLEZdDNbEs= golang.org/x/sys v0.0.0-20210510120138-977fb7262007/go.mod h1:oPkhp1MJrh7nUepCBck5+mAzfO9JrbApNNgaTdGDITg= @@ -532,6 +550,7 @@ golang.org/x/tools v0.0.0-20190524140312-2c0ae7006135/go.mod h1:RgjU9mgBXZiqYHBn golang.org/x/tools v0.0.0-20191119224855-298f0cb1881e/go.mod h1:b+2E5dAYhXwXZwtnZ6UAqBI28+e2cm9otk0dWdXHAEo= golang.org/x/tools v0.0.0-20200130002326-2f3ba24bd6e7/go.mod h1:TB2adYChydJhpapKDTa4BR/hXlZSLoq2Wpct/0txZ28= golang.org/x/tools v0.0.0-20200619180055-7c47624df98f/go.mod h1:EkVYQZoAsY45+roYkvgYkIh4xh/qjgUK9TdY2XT94GE= +golang.org/x/tools v0.0.0-20201224043029-2b0845dc783e/go.mod h1:emZCQorbCU4vsT4fOWvOPXz4eW1wZW4PmDk9uLelYpA= golang.org/x/tools v0.0.0-20210106214847-113979e3529a/go.mod h1:emZCQorbCU4vsT4fOWvOPXz4eW1wZW4PmDk9uLelYpA= golang.org/x/tools v0.1.1/go.mod h1:o0xws9oXOQQZyjljx8fwUC0k7L1pTE6eaCbjGeHmOkk= golang.org/x/tools v0.1.12/go.mod h1:hNGJHUnrk76NpqgfD5Aqm5Crs+Hm0VOH/i9J2+nxYbc= @@ -541,8 +560,6 @@ golang.org/x/xerrors v0.0.0-20190717185122-a985d3407aa7/go.mod h1:I/5z698sn9Ka8T golang.org/x/xerrors v0.0.0-20191011141410-1b5146add898/go.mod h1:I/5z698sn9Ka8TeJc9MKroUUfqBBauWjQqLJ2OPfmY0= golang.org/x/xerrors v0.0.0-20191204190536-9bdfabe68543/go.mod h1:I/5z698sn9Ka8TeJc9MKroUUfqBBauWjQqLJ2OPfmY0= golang.org/x/xerrors v0.0.0-20200804184101-5ec99f83aff1/go.mod h1:I/5z698sn9Ka8TeJc9MKroUUfqBBauWjQqLJ2OPfmY0= -golang.org/x/xerrors v0.0.0-20220907171357-04be3eba64a2 h1:H2TDz8ibqkAF6YGhCdN3jS9O0/s90v0rJh3X/OLHEUk= -golang.org/x/xerrors v0.0.0-20220907171357-04be3eba64a2/go.mod h1:K8+ghG5WaK9qNqU5K3HdILfMLy1f3aNYFI/wnl100a8= gomodules.xyz/jsonpatch/v2 v2.4.0 h1:Ci3iUJyx9UeRx7CeFN8ARgGbkESwJK+KB9lLcWxY/Zw= gomodules.xyz/jsonpatch/v2 v2.4.0/go.mod h1:AH3dM2RI6uoBZxn3LVrfvJ3E0/9dG4cSrbuBJT4moAY= gonum.org/v1/gonum v0.16.0 h1:5+ul4Swaf3ESvrOnidPp4GZbzf0mxVQpDCYUQE7OJfk= @@ -582,7 +599,6 @@ gopkg.in/check.v1 v1.0.0-20201130134442-10cb98267c6c h1:Hei/4ADfdWqJk1ZMxUNpqntN gopkg.in/check.v1 v1.0.0-20201130134442-10cb98267c6c/go.mod h1:JHkPIbrfpd72SG/EVd6muEfDQjcINNoR0C8j2r3qZ4Q= gopkg.in/evanphx/json-patch.v4 v4.12.0 h1:n6jtcsulIzXPJaxegRbvFNNrZDjbij7ny3gmSPG+6V4= gopkg.in/evanphx/json-patch.v4 v4.12.0/go.mod h1:p8EYWUEYMpynmqDbY58zCKCFZw8pRWMG4EsWvDvM72M= -gopkg.in/fsnotify.v1 v1.4.7 h1:xOHLXZwVvI9hhs+cLKq5+I5onOuwQLhQwiu63xxlHs4= gopkg.in/fsnotify.v1 v1.4.7/go.mod h1:Tz8NjZHkW78fSQdbUxIjBTcgA1z1m8ZHf0WmKUhAMys= gopkg.in/inf.v0 v0.9.1 h1:73M5CoZyi3ZLMOyDlQh031Cx6N9NDJ2Vvfl76EDAgDc= gopkg.in/inf.v0 v0.9.1/go.mod h1:cWUDdTG/fYaXco+Dcufb5Vnc6Gp2YChqWtbxRZE0mXw= @@ -590,8 +606,10 @@ gopkg.in/natefinch/lumberjack.v2 v2.2.1 h1:bBRl1b0OH9s/DuPhuXpNl+VtCaJXFZ5/uEFST gopkg.in/natefinch/lumberjack.v2 v2.2.1/go.mod h1:YD8tP3GAjkrDg1eZH7EGmyESg/lsYskCTPBJVb9jqSc= gopkg.in/tomb.v1 v1.0.0-20141024135613-dd632973f1e7 h1:uRGJdciOHaEIrze2W8Q3AKkepLTh2hOroT7a+7czfdQ= gopkg.in/tomb.v1 v1.0.0-20141024135613-dd632973f1e7/go.mod h1:dt/ZhP58zS4L8KSrWDmTeBkI65Dw0HsyUHuEVlX15mw= +gopkg.in/yaml.v2 v2.2.1/go.mod h1:hI93XBmqTisBFMUTm0b8Fm+jr3Dg1NNxqwp+5A1VGuI= gopkg.in/yaml.v2 v2.2.2/go.mod h1:hI93XBmqTisBFMUTm0b8Fm+jr3Dg1NNxqwp+5A1VGuI= gopkg.in/yaml.v2 v2.2.4/go.mod h1:hI93XBmqTisBFMUTm0b8Fm+jr3Dg1NNxqwp+5A1VGuI= +gopkg.in/yaml.v2 v2.3.0/go.mod h1:hI93XBmqTisBFMUTm0b8Fm+jr3Dg1NNxqwp+5A1VGuI= gopkg.in/yaml.v2 v2.4.0 h1:D8xgwECY7CYvx+Y2n4sBz93Jn9JRvxdiyyo8CTfuKaY= gopkg.in/yaml.v2 v2.4.0/go.mod h1:RDklbk79AGWmwhnvt/jBztapEOGDOx6ZbXqjP6csGnQ= gopkg.in/yaml.v3 v3.0.0-20200313102051-9f266ea9e77c/go.mod h1:K4uyk7z7BCEPqu6E+C64Yfv1cQ7kz7rIZviUmN+EgEM= diff --git a/hack/aks/Makefile b/hack/aks/Makefile index 5e1c8f3f9b..3b31345ec5 100644 --- a/hack/aks/Makefile +++ b/hack/aks/Makefile @@ -29,6 +29,7 @@ PUBLIC_IPv6 ?= $(PUBLIC_IP_ID)/$(IP_PREFIX)-$(CLUSTER)-v6 KUBE_PROXY_JSON_PATH ?= ./kube-proxy.json LTS ?= false + # overrideable variables SUB ?= $(AZURE_SUBSCRIPTION) CLUSTER ?= $(USER)-$(REGION) @@ -280,22 +281,22 @@ swiftv2-dummy-cluster-up: rg-up ipv4 swift-net-up ## Bring up a SWIFT AzCNI clus --network-plugin azure \ --vnet-subnet-id /subscriptions/$(SUB)/resourceGroups/$(GROUP)/providers/Microsoft.Network/virtualNetworks/$(VNET)/subnets/nodenet \ --pod-subnet-id /subscriptions/$(SUB)/resourceGroups/$(GROUP)/providers/Microsoft.Network/virtualNetworks/$(VNET)/subnets/podnet \ + --tags stampcreatorserviceinfo=true \ --load-balancer-outbound-ips $(PUBLIC_IPv4) \ --no-ssh-key \ --yes @$(MAKE) set-kubeconf swiftv2-podsubnet-cluster-up: ipv4 swift-net-up ## Bring up a SWIFTv2 PodSubnet cluster - $(COMMON_AKS_FIELDS) + $(COMMON_AKS_FIELDS) \ --network-plugin azure \ - --nodepool-name nodepool1 \ - --load-balancer-outbound-ips $(PUBLIC_IPv4) \ + --node-vm-size $(VM_SIZE) \ --vnet-subnet-id /subscriptions/$(SUB)/resourceGroups/$(GROUP)/providers/Microsoft.Network/virtualNetworks/$(VNET)/subnets/nodenet \ --pod-subnet-id /subscriptions/$(SUB)/resourceGroups/$(GROUP)/providers/Microsoft.Network/virtualNetworks/$(VNET)/subnets/podnet \ - --service-cidr "10.0.0.0/16" \ - --dns-service-ip "10.0.0.10" \ - --tags fastpathenabled=true RGOwner=LongRunningTestPipelines stampcreatorserviceinfo=true \ + --nodepool-tags fastpathenabled=true aks-nic-enable-multi-tenancy=true \ + --tags stampcreatorserviceinfo=true \ --aks-custom-headers AKSHTTPCustomFeatures=Microsoft.ContainerService/NetworkingMultiTenancyPreview \ + --load-balancer-outbound-ips $(PUBLIC_IPv4) \ --yes @$(MAKE) set-kubeconf @@ -446,7 +447,7 @@ linux-swiftv2-nodepool-up: ## Add linux node pool to swiftv2 cluster --os-type Linux \ --max-pods 250 \ --subscription $(SUB) \ - --tags fastpathenabled=true,aks-nic-enable-multi-tenancy=true \ + --tags fastpathenabled=true aks-nic-enable-multi-tenancy=true stampcreatorserviceinfo=true\ --aks-custom-headers AKSHTTPCustomFeatures=Microsoft.ContainerService/NetworkingMultiTenancyPreview \ --pod-subnet-id /subscriptions/$(SUB)/resourceGroups/$(GROUP)/providers/Microsoft.Network/virtualNetworks/$(VNET)/subnets/podnet diff --git a/test/integration/manifests/swiftv2/long-running-cluster/pod.yaml b/test/integration/manifests/swiftv2/long-running-cluster/pod.yaml new file mode 100644 index 0000000000..ffb5293b18 --- /dev/null +++ b/test/integration/manifests/swiftv2/long-running-cluster/pod.yaml @@ -0,0 +1,73 @@ +apiVersion: v1 +kind: Pod +metadata: + name: {{ .PodName }} + namespace: {{ .Namespace }} + labels: + kubernetes.azure.com/pod-network-instance: {{ .PNIName }} + kubernetes.azure.com/pod-network: {{ .PNName }} +spec: + nodeSelector: + kubernetes.io/hostname: {{ .NodeName }} + containers: + - name: net-debugger + image: {{ .Image }} + command: ["/bin/bash", "-c"] + args: + - | + echo "Pod Network Diagnostics started on $(hostname)"; + echo "Pod IP: $(hostname -i)"; + echo "Starting HTTP server on port 8080"; + + # Create a simple HTTP server directory + mkdir -p /tmp/www + cat > /tmp/www/index.html <<'EOF' + + + Network Test Pod + +

Pod Network Test

+

Hostname: $(hostname)

+

IP Address: $(hostname -i)

+

Timestamp: $(date)

+ + + EOF + + # Start Python HTTP server on port 8080 in background + cd /tmp/www && python3 -m http.server 8080 & + HTTP_PID=$! + echo "HTTP server started with PID $HTTP_PID on port 8080" + + # Give server a moment to start + sleep 2 + + # Verify server is running + if netstat -tuln | grep -q ':8080'; then + echo "HTTP server is listening on port 8080" + else + echo "WARNING: HTTP server may not be listening on port 8080" + fi + + # Keep showing network info periodically + while true; do + echo "=== Network Status at $(date) ===" + ip addr show + ip route show + echo "=== Listening ports ===" + netstat -tuln | grep LISTEN || ss -tuln | grep LISTEN + sleep 300 # Every 5 minutes + done + ports: + - containerPort: 8080 + protocol: TCP + resources: + limits: + cpu: 300m + memory: 600Mi + requests: + cpu: 300m + memory: 600Mi + securityContext: + privileged: true + restartPolicy: Always diff --git a/test/integration/manifests/swiftv2/long-running-cluster/podnetwork.yaml b/test/integration/manifests/swiftv2/long-running-cluster/podnetwork.yaml new file mode 100644 index 0000000000..25a7491d90 --- /dev/null +++ b/test/integration/manifests/swiftv2/long-running-cluster/podnetwork.yaml @@ -0,0 +1,15 @@ +apiVersion: multitenancy.acn.azure.com/v1alpha1 +kind: PodNetwork +metadata: + name: {{ .PNName }} +{{- if .SubnetToken }} + labels: + kubernetes.azure.com/override-subnet-token: "{{ .SubnetToken }}" +{{- end }} +spec: + networkID: "{{ .VnetGUID }}" +{{- if not .SubnetToken }} + subnetGUID: "{{ .SubnetGUID }}" +{{- end }} + subnetResourceID: "{{ .SubnetARMID }}" + deviceType: acn.azure.com/vnet-nic diff --git a/test/integration/manifests/swiftv2/long-running-cluster/podnetworkinstance.yaml b/test/integration/manifests/swiftv2/long-running-cluster/podnetworkinstance.yaml new file mode 100644 index 0000000000..4d1f8ca384 --- /dev/null +++ b/test/integration/manifests/swiftv2/long-running-cluster/podnetworkinstance.yaml @@ -0,0 +1,13 @@ +apiVersion: multitenancy.acn.azure.com/v1alpha1 +kind: PodNetworkInstance +metadata: + name: {{ .PNIName }} + namespace: {{ .Namespace }} +spec: + podNetworkConfigs: + - podNetwork: {{ .PNName }} + {{- if eq .Type "explicit" }} + podIPReservationSize: {{ .Reservations }} + {{- else }} + podIPReservationSize: 1 + {{- end }} diff --git a/test/integration/swiftv2/helpers/az_helpers.go b/test/integration/swiftv2/helpers/az_helpers.go new file mode 100644 index 0000000000..c6e5d4b090 --- /dev/null +++ b/test/integration/swiftv2/helpers/az_helpers.go @@ -0,0 +1,343 @@ +package helpers + +import ( + "context" + "fmt" + "os/exec" + "strings" + "time" +) + +func runAzCommand(cmd string, args ...string) (string, error) { + out, err := exec.Command(cmd, args...).CombinedOutput() + if err != nil { + return "", fmt.Errorf("failed to run %s %v: %w\nOutput: %s", cmd, args, err, string(out)) + } + return strings.TrimSpace(string(out)), nil +} + +func GetVnetGUID(rg, vnet string) (string, error) { + return runAzCommand("az", "network", "vnet", "show", "--resource-group", rg, "--name", vnet, "--query", "resourceGuid", "-o", "tsv") +} + +func GetSubnetARMID(rg, vnet, subnet string) (string, error) { + return runAzCommand("az", "network", "vnet", "subnet", "show", "--resource-group", rg, "--vnet-name", vnet, "--name", subnet, "--query", "id", "-o", "tsv") +} + +func GetSubnetGUID(rg, vnet, subnet string) (string, error) { + subnetID, err := GetSubnetARMID(rg, vnet, subnet) + if err != nil { + return "", err + } + return runAzCommand("az", "resource", "show", "--ids", subnetID, "--api-version", "2023-09-01", "--query", "properties.serviceAssociationLinks[0].properties.subnetId", "-o", "tsv") +} + +func GetSubnetToken(rg, vnet, subnet string) (string, error) { + // Optionally implement if you use subnet token override + return "", nil +} + +// GetClusterNodes returns a slice of node names from a cluster using the given kubeconfig +func GetClusterNodes(kubeconfig string) ([]string, error) { + cmd := exec.Command("kubectl", "--kubeconfig", kubeconfig, "get", "nodes", "-o", "name") + out, err := cmd.CombinedOutput() + if err != nil { + return nil, fmt.Errorf("failed to get nodes using kubeconfig %s: %w\nOutput: %s", kubeconfig, err, string(out)) + } + + lines := strings.Split(strings.TrimSpace(string(out)), "\n") + nodes := make([]string, 0, len(lines)) + + for _, line := range lines { + // kubectl returns "node/", we strip the prefix + if strings.HasPrefix(line, "node/") { + nodes = append(nodes, strings.TrimPrefix(line, "node/")) + } + } + return nodes, nil +} + +// EnsureNamespaceExists checks if a namespace exists and creates it if it doesn't +func EnsureNamespaceExists(kubeconfig, namespace string) error { + cmd := exec.Command("kubectl", "--kubeconfig", kubeconfig, "get", "namespace", namespace) + err := cmd.Run() + + if err == nil { + return nil // Namespace exists + } + + // Namespace doesn't exist, create it + cmd = exec.Command("kubectl", "--kubeconfig", kubeconfig, "create", "namespace", namespace) + out, err := cmd.CombinedOutput() + if err != nil { + return fmt.Errorf("failed to create namespace %s: %s\n%s", namespace, err, string(out)) + } + + return nil +} + +// DeletePod deletes a pod in the specified namespace and waits for it to be fully removed +func DeletePod(kubeconfig, namespace, podName string) error { + fmt.Printf("Deleting pod %s in namespace %s...\n", podName, namespace) + + // Initiate pod deletion with context timeout + ctx, cancel := context.WithTimeout(context.Background(), 90*time.Second) + defer cancel() + + cmd := exec.CommandContext(ctx, "kubectl", "--kubeconfig", kubeconfig, "delete", "pod", podName, "-n", namespace, "--ignore-not-found=true") + out, err := cmd.CombinedOutput() + if err != nil { + if ctx.Err() == context.DeadlineExceeded { + fmt.Printf("kubectl delete pod command timed out after 90s, attempting force delete...\n") + } else { + return fmt.Errorf("failed to delete pod %s in namespace %s: %s\n%s", podName, namespace, err, string(out)) + } + } + + // Wait for pod to be completely gone (critical for IP release) + fmt.Printf("Waiting for pod %s to be fully removed...\n", podName) + for attempt := 1; attempt <= 30; attempt++ { + ctx, cancel := context.WithTimeout(context.Background(), 10*time.Second) + checkCmd := exec.CommandContext(ctx, "kubectl", "--kubeconfig", kubeconfig, "get", "pod", podName, "-n", namespace, "--ignore-not-found=true", "-o", "name") + checkOut, _ := checkCmd.CombinedOutput() + cancel() + + if strings.TrimSpace(string(checkOut)) == "" { + fmt.Printf("Pod %s fully removed after %d seconds\n", podName, attempt*2) + // Extra wait to ensure IP reservation is released in DNC + time.Sleep(5 * time.Second) + return nil + } + + if attempt%5 == 0 { + fmt.Printf("Pod %s still terminating (attempt %d/30)...\n", podName, attempt) + } + time.Sleep(2 * time.Second) + } + + // If pod still exists after 60 seconds, force delete + fmt.Printf("Pod %s still exists after 60s, attempting force delete...\n", podName) + ctx, cancel = context.WithTimeout(context.Background(), 30*time.Second) + defer cancel() + + forceCmd := exec.CommandContext(ctx, "kubectl", "--kubeconfig", kubeconfig, "delete", "pod", podName, "-n", namespace, "--grace-period=0", "--force", "--ignore-not-found=true") + forceOut, forceErr := forceCmd.CombinedOutput() + if forceErr != nil { + fmt.Printf("Warning: Force delete failed: %s\n%s\n", forceErr, string(forceOut)) + } + + // Wait a bit more for force delete to complete + time.Sleep(10 * time.Second) + fmt.Printf("Pod %s deletion completed (may have required force)\n", podName) + return nil +} + +// DeletePodNetworkInstance deletes a PodNetworkInstance and waits for it to be removed +func DeletePodNetworkInstance(kubeconfig, namespace, pniName string) error { + fmt.Printf("Deleting PodNetworkInstance %s in namespace %s...\n", pniName, namespace) + + // Initiate PNI deletion + cmd := exec.Command("kubectl", "--kubeconfig", kubeconfig, "delete", "podnetworkinstance", pniName, "-n", namespace, "--ignore-not-found=true") + out, err := cmd.CombinedOutput() + if err != nil { + return fmt.Errorf("failed to delete PodNetworkInstance %s: %s\n%s", pniName, err, string(out)) + } + + // Wait for PNI to be completely gone (it may take time for DNC to release reservations) + fmt.Printf("Waiting for PodNetworkInstance %s to be fully removed...\n", pniName) + for attempt := 1; attempt <= 60; attempt++ { + checkCmd := exec.Command("kubectl", "--kubeconfig", kubeconfig, "get", "podnetworkinstance", pniName, "-n", namespace, "--ignore-not-found=true", "-o", "name") + checkOut, _ := checkCmd.CombinedOutput() + + if strings.TrimSpace(string(checkOut)) == "" { + fmt.Printf("PodNetworkInstance %s fully removed after %d seconds\n", pniName, attempt*2) + return nil + } + + if attempt%10 == 0 { + // Check for ReservationInUse errors + descCmd := exec.Command("kubectl", "--kubeconfig", kubeconfig, "describe", "podnetworkinstance", pniName, "-n", namespace) + descOut, _ := descCmd.CombinedOutput() + descStr := string(descOut) + + if strings.Contains(descStr, "ReservationInUse") { + fmt.Printf("PNI %s still has active reservations (attempt %d/60). Waiting for DNC to release...\n", pniName, attempt) + } else { + fmt.Printf("PNI %s still terminating (attempt %d/60)...\n", pniName, attempt) + } + } + time.Sleep(2 * time.Second) + } + + // If PNI still exists after 120 seconds, try to remove finalizers + fmt.Printf("PNI %s still exists after 120s, attempting to remove finalizers...\n", pniName) + patchCmd := exec.Command("kubectl", "--kubeconfig", kubeconfig, "patch", "podnetworkinstance", pniName, "-n", namespace, "-p", `{"metadata":{"finalizers":[]}}`, "--type=merge") + patchOut, patchErr := patchCmd.CombinedOutput() + if patchErr != nil { + fmt.Printf("Warning: Failed to remove finalizers: %s\n%s\n", patchErr, string(patchOut)) + } else { + fmt.Printf("Finalizers removed, waiting for deletion...\n") + time.Sleep(5 * time.Second) + } + + fmt.Printf("PodNetworkInstance %s deletion completed\n", pniName) + return nil +} + +// DeletePodNetwork deletes a PodNetwork and waits for it to be removed +func DeletePodNetwork(kubeconfig, pnName string) error { + fmt.Printf("Deleting PodNetwork %s...\n", pnName) + + cmd := exec.Command("kubectl", "--kubeconfig", kubeconfig, "delete", "podnetwork", pnName, "--ignore-not-found=true") + out, err := cmd.CombinedOutput() + if err != nil { + return fmt.Errorf("failed to delete PodNetwork %s: %s\n%s", pnName, err, string(out)) + } + + // Wait for PN to be completely gone + fmt.Printf("Waiting for PodNetwork %s to be fully removed...\n", pnName) + for attempt := 1; attempt <= 30; attempt++ { + checkCmd := exec.Command("kubectl", "--kubeconfig", kubeconfig, "get", "podnetwork", pnName, "--ignore-not-found=true", "-o", "name") + checkOut, _ := checkCmd.CombinedOutput() + + if strings.TrimSpace(string(checkOut)) == "" { + fmt.Printf("PodNetwork %s fully removed after %d seconds\n", pnName, attempt*2) + return nil + } + + if attempt%10 == 0 { + fmt.Printf("PodNetwork %s still terminating (attempt %d/30)...\n", pnName, attempt) + } + time.Sleep(2 * time.Second) + } + + // Try to remove finalizers if still stuck + fmt.Printf("PodNetwork %s still exists, attempting to remove finalizers...\n", pnName) + patchCmd := exec.Command("kubectl", "--kubeconfig", kubeconfig, "patch", "podnetwork", pnName, "-p", `{"metadata":{"finalizers":[]}}`, "--type=merge") + patchOut, patchErr := patchCmd.CombinedOutput() + if patchErr != nil { + fmt.Printf("Warning: Failed to remove finalizers: %s\n%s\n", patchErr, string(patchOut)) + } + + time.Sleep(5 * time.Second) + fmt.Printf("PodNetwork %s deletion completed\n", pnName) + return nil +} + +// DeleteNamespace deletes a namespace and waits for it to be removed +func DeleteNamespace(kubeconfig, namespace string) error { + fmt.Printf("Deleting namespace %s...\n", namespace) + + cmd := exec.Command("kubectl", "--kubeconfig", kubeconfig, "delete", "namespace", namespace, "--ignore-not-found=true") + out, err := cmd.CombinedOutput() + if err != nil { + return fmt.Errorf("failed to delete namespace %s: %s\n%s", namespace, err, string(out)) + } + + // Wait for namespace to be completely gone + fmt.Printf("Waiting for namespace %s to be fully removed...\n", namespace) + for attempt := 1; attempt <= 60; attempt++ { + checkCmd := exec.Command("kubectl", "--kubeconfig", kubeconfig, "get", "namespace", namespace, "--ignore-not-found=true", "-o", "name") + checkOut, _ := checkCmd.CombinedOutput() + + if strings.TrimSpace(string(checkOut)) == "" { + fmt.Printf("Namespace %s fully removed after %d seconds\n", namespace, attempt*2) + return nil + } + + if attempt%15 == 0 { + fmt.Printf("Namespace %s still terminating (attempt %d/60)...\n", namespace, attempt) + } + time.Sleep(2 * time.Second) + } + + // Try to remove finalizers if still stuck + fmt.Printf("Namespace %s still exists, attempting to remove finalizers...\n", namespace) + patchCmd := exec.Command("kubectl", "--kubeconfig", kubeconfig, "patch", "namespace", namespace, "-p", `{"metadata":{"finalizers":[]}}`, "--type=merge") + patchOut, patchErr := patchCmd.CombinedOutput() + if patchErr != nil { + fmt.Printf("Warning: Failed to remove finalizers: %s\n%s\n", patchErr, string(patchOut)) + } + + time.Sleep(5 * time.Second) + fmt.Printf("Namespace %s deletion completed\n", namespace) + return nil +} + +// WaitForPodRunning waits for a pod to reach Running state with retries +func WaitForPodRunning(kubeconfig, namespace, podName string, maxRetries, sleepSeconds int) error { + for attempt := 1; attempt <= maxRetries; attempt++ { + cmd := exec.Command("kubectl", "--kubeconfig", kubeconfig, "get", "pod", podName, "-n", namespace, "-o", "jsonpath={.status.phase}") + out, err := cmd.CombinedOutput() + + if err == nil && strings.TrimSpace(string(out)) == "Running" { + fmt.Printf("Pod %s is now Running\n", podName) + return nil + } + + if attempt < maxRetries { + fmt.Printf("Pod %s not running yet (attempt %d/%d), status: %s. Waiting %d seconds...\n", + podName, attempt, maxRetries, strings.TrimSpace(string(out)), sleepSeconds) + time.Sleep(time.Duration(sleepSeconds) * time.Second) + } + } + + return fmt.Errorf("pod %s did not reach Running state after %d attempts", podName, maxRetries) +} + +// GetPodIP retrieves the IP address of a pod +func GetPodIP(kubeconfig, namespace, podName string) (string, error) { + ctx, cancel := context.WithTimeout(context.Background(), 10*time.Second) + defer cancel() + + cmd := exec.CommandContext(ctx, "kubectl", "--kubeconfig", kubeconfig, "get", "pod", podName, + "-n", namespace, "-o", "jsonpath={.status.podIP}") + out, err := cmd.CombinedOutput() + if err != nil { + return "", fmt.Errorf("failed to get pod IP for %s in namespace %s: %w\nOutput: %s", podName, namespace, err, string(out)) + } + + ip := strings.TrimSpace(string(out)) + if ip == "" { + return "", fmt.Errorf("pod %s in namespace %s has no IP address assigned", podName, namespace) + } + + return ip, nil +} + +// GetPodDelegatedIP retrieves the eth1 IP address (delegated subnet IP) of a pod +// This is the IP used for cross-VNet communication and is subject to NSG rules +func GetPodDelegatedIP(kubeconfig, namespace, podName string) (string, error) { + ctx, cancel := context.WithTimeout(context.Background(), 10*time.Second) + defer cancel() + + // Get eth1 IP address by running 'ip addr show eth1' in the pod + cmd := exec.CommandContext(ctx, "kubectl", "--kubeconfig", kubeconfig, "exec", podName, + "-n", namespace, "--", "sh", "-c", "ip -4 addr show eth1 | grep 'inet ' | awk '{print $2}' | cut -d'/' -f1") + out, err := cmd.CombinedOutput() + if err != nil { + return "", fmt.Errorf("failed to get eth1 IP for %s in namespace %s: %w\nOutput: %s", podName, namespace, err, string(out)) + } + + ip := strings.TrimSpace(string(out)) + if ip == "" { + return "", fmt.Errorf("pod %s in namespace %s has no eth1 IP address (delegated subnet not configured?)", podName, namespace) + } + + return ip, nil +} + +// ExecInPod executes a command in a pod and returns the output +func ExecInPod(kubeconfig, namespace, podName, command string) (string, error) { + ctx, cancel := context.WithTimeout(context.Background(), 15*time.Second) + defer cancel() + + cmd := exec.CommandContext(ctx, "kubectl", "--kubeconfig", kubeconfig, "exec", podName, + "-n", namespace, "--", "sh", "-c", command) + out, err := cmd.CombinedOutput() + if err != nil { + return string(out), fmt.Errorf("failed to exec in pod %s in namespace %s: %w", podName, namespace, err) + } + + return string(out), nil +} diff --git a/test/integration/swiftv2/longRunningCluster/datapath.go b/test/integration/swiftv2/longRunningCluster/datapath.go new file mode 100644 index 0000000000..4d138dca32 --- /dev/null +++ b/test/integration/swiftv2/longRunningCluster/datapath.go @@ -0,0 +1,690 @@ +package longRunningCluster + +import ( + "bytes" + "fmt" + "os" + "os/exec" + "strings" + "text/template" + "time" + + "github.com/Azure/azure-container-networking/test/integration/swiftv2/helpers" +) + +func applyTemplate(templatePath string, data interface{}, kubeconfig string) error { + tmpl, err := template.ParseFiles(templatePath) + if err != nil { + return err + } + + var buf bytes.Buffer + if err := tmpl.Execute(&buf, data); err != nil { + return err + } + + cmd := exec.Command("kubectl", "--kubeconfig", kubeconfig, "apply", "-f", "-") + cmd.Stdin = &buf + out, err := cmd.CombinedOutput() + if err != nil { + return fmt.Errorf("kubectl apply failed: %s\n%s", err, string(out)) + } + + fmt.Println(string(out)) + return nil +} + +// ------------------------- +// PodNetwork +// ------------------------- +type PodNetworkData struct { + PNName string + VnetGUID string + SubnetGUID string + SubnetARMID string + SubnetToken string +} + +func CreatePodNetwork(kubeconfig string, data PodNetworkData, templatePath string) error { + return applyTemplate(templatePath, data, kubeconfig) +} + +// ------------------------- +// PodNetworkInstance +// ------------------------- +type PNIData struct { + PNIName string + PNName string + Namespace string + Type string + Reservations int +} + +func CreatePodNetworkInstance(kubeconfig string, data PNIData, templatePath string) error { + return applyTemplate(templatePath, data, kubeconfig) +} + +// ------------------------- +// Pod +// ------------------------- +type PodData struct { + PodName string + NodeName string + OS string + PNName string + PNIName string + Namespace string + Image string +} + +func CreatePod(kubeconfig string, data PodData, templatePath string) error { + return applyTemplate(templatePath, data, kubeconfig) +} + +// ------------------------- +// High-level orchestration +// ------------------------- + +// TestResources holds all the configuration needed for creating test resources +type TestResources struct { + Kubeconfig string + PNName string + PNIName string + VnetGUID string + SubnetGUID string + SubnetARMID string + SubnetToken string + PodNetworkTemplate string + PNITemplate string + PodTemplate string + PodImage string +} + +// PodScenario defines a single pod creation scenario +type PodScenario struct { + Name string // Descriptive name for the scenario + Cluster string // "aks-1" or "aks-2" + VnetName string // e.g., "cx_vnet_a1", "cx_vnet_b1" + SubnetName string // e.g., "s1", "s2" + NodeSelector string // "low-nic" or "high-nic" + PodNameSuffix string // Unique suffix for pod name +} + +// TestScenarios holds all pod scenarios to test +type TestScenarios struct { + ResourceGroup string + BuildID string + PodImage string + Scenarios []PodScenario + VnetSubnetCache map[string]VnetSubnetInfo // Cache for vnet/subnet info + UsedNodes map[string]bool // Tracks which nodes are already used (one pod per node for low-NIC) +} + +// VnetSubnetInfo holds network information for a vnet/subnet combination +type VnetSubnetInfo struct { + VnetGUID string + SubnetGUID string + SubnetARMID string + SubnetToken string +} + +// NodePoolInfo holds information about nodes in different pools +type NodePoolInfo struct { + LowNicNodes []string + HighNicNodes []string +} + +// GetNodesByNicCount categorizes nodes by NIC count based on nic-capacity labels +func GetNodesByNicCount(kubeconfig string) (NodePoolInfo, error) { + nodeInfo := NodePoolInfo{ + LowNicNodes: []string{}, + HighNicNodes: []string{}, + } + + // Get workload type from environment variable (defaults to swiftv2-linux) + workloadType := os.Getenv("WORKLOAD_TYPE") + if workloadType == "" { + workloadType = "swiftv2-linux" + } + + fmt.Printf("Filtering nodes by workload-type=%s\n", workloadType) + + // Get nodes with low-nic capacity and matching workload-type + cmd := exec.Command("kubectl", "--kubeconfig", kubeconfig, "get", "nodes", + "-l", fmt.Sprintf("nic-capacity=low-nic,workload-type=%s", workloadType), "-o", "name") + out, err := cmd.CombinedOutput() + if err != nil { + return NodePoolInfo{}, fmt.Errorf("failed to get low-nic nodes: %w\nOutput: %s", err, string(out)) + } + + lines := strings.Split(strings.TrimSpace(string(out)), "\n") + for _, line := range lines { + if strings.HasPrefix(line, "node/") { + nodeInfo.LowNicNodes = append(nodeInfo.LowNicNodes, strings.TrimPrefix(line, "node/")) + } + } + + // Get nodes with high-nic capacity and matching workload-type + cmd = exec.Command("kubectl", "--kubeconfig", kubeconfig, "get", "nodes", + "-l", fmt.Sprintf("nic-capacity=high-nic,workload-type=%s", workloadType), "-o", "name") + out, err = cmd.CombinedOutput() + if err != nil { + return NodePoolInfo{}, fmt.Errorf("failed to get high-nic nodes: %w\nOutput: %s", err, string(out)) + } + + lines = strings.Split(strings.TrimSpace(string(out)), "\n") + for _, line := range lines { + if line != "" && strings.HasPrefix(line, "node/") { + nodeInfo.HighNicNodes = append(nodeInfo.HighNicNodes, strings.TrimPrefix(line, "node/")) + } + } + + fmt.Printf("Found %d low-nic nodes and %d high-nic nodes with workload-type=%s\n", + len(nodeInfo.LowNicNodes), len(nodeInfo.HighNicNodes), workloadType) + + return nodeInfo, nil +} + +// CreatePodNetworkResource creates a PodNetwork +func CreatePodNetworkResource(resources TestResources) error { + err := CreatePodNetwork(resources.Kubeconfig, PodNetworkData{ + PNName: resources.PNName, + VnetGUID: resources.VnetGUID, + SubnetGUID: resources.SubnetGUID, + SubnetARMID: resources.SubnetARMID, + SubnetToken: resources.SubnetToken, + }, resources.PodNetworkTemplate) + if err != nil { + return fmt.Errorf("failed to create PodNetwork: %w", err) + } + return nil +} + +// CreateNamespaceResource creates a namespace +func CreateNamespaceResource(kubeconfig, namespace string) error { + err := helpers.EnsureNamespaceExists(kubeconfig, namespace) + if err != nil { + return fmt.Errorf("failed to create namespace: %w", err) + } + return nil +} + +// CreatePodNetworkInstanceResource creates a PodNetworkInstance +func CreatePodNetworkInstanceResource(resources TestResources) error { + err := CreatePodNetworkInstance(resources.Kubeconfig, PNIData{ + PNIName: resources.PNIName, + PNName: resources.PNName, + Namespace: resources.PNName, + Type: "explicit", + Reservations: 2, + }, resources.PNITemplate) + if err != nil { + return fmt.Errorf("failed to create PodNetworkInstance: %w", err) + } + return nil +} + +// CreatePodResource creates a single pod on a specified node and waits for it to be running +func CreatePodResource(resources TestResources, podName, nodeName string) error { + err := CreatePod(resources.Kubeconfig, PodData{ + PodName: podName, + NodeName: nodeName, + OS: "linux", + PNName: resources.PNName, + PNIName: resources.PNIName, + Namespace: resources.PNName, + Image: resources.PodImage, + }, resources.PodTemplate) + if err != nil { + return fmt.Errorf("failed to create pod %s: %w", podName, err) + } + + // Wait for pod to be running with retries + err = helpers.WaitForPodRunning(resources.Kubeconfig, resources.PNName, podName, 10, 30) + if err != nil { + return fmt.Errorf("pod %s did not reach running state: %w", podName, err) + } + + return nil +} + +// GetOrFetchVnetSubnetInfo retrieves cached network info or fetches it from Azure +func GetOrFetchVnetSubnetInfo(rg, vnetName, subnetName string, cache map[string]VnetSubnetInfo) (VnetSubnetInfo, error) { + key := fmt.Sprintf("%s/%s", vnetName, subnetName) + + if info, exists := cache[key]; exists { + return info, nil + } + + // Fetch from Azure + vnetGUID, err := helpers.GetVnetGUID(rg, vnetName) + if err != nil { + return VnetSubnetInfo{}, fmt.Errorf("failed to get VNet GUID: %w", err) + } + + subnetGUID, err := helpers.GetSubnetGUID(rg, vnetName, subnetName) + if err != nil { + return VnetSubnetInfo{}, fmt.Errorf("failed to get Subnet GUID: %w", err) + } + + subnetARMID, err := helpers.GetSubnetARMID(rg, vnetName, subnetName) + if err != nil { + return VnetSubnetInfo{}, fmt.Errorf("failed to get Subnet ARM ID: %w", err) + } + + subnetToken, err := helpers.GetSubnetToken(rg, vnetName, subnetName) + if err != nil { + return VnetSubnetInfo{}, fmt.Errorf("failed to get Subnet Token: %w", err) + } + + info := VnetSubnetInfo{ + VnetGUID: vnetGUID, + SubnetGUID: subnetGUID, + SubnetARMID: subnetARMID, + SubnetToken: subnetToken, + } + + cache[key] = info + return info, nil +} + +// CreateScenarioResources creates all resources for a specific pod scenario +func CreateScenarioResources(scenario PodScenario, testScenarios TestScenarios) error { + // Get kubeconfig for the cluster + kubeconfig := fmt.Sprintf("/tmp/%s.kubeconfig", scenario.Cluster) + + // Get network info + netInfo, err := GetOrFetchVnetSubnetInfo(testScenarios.ResourceGroup, scenario.VnetName, scenario.SubnetName, testScenarios.VnetSubnetCache) + if err != nil { + return fmt.Errorf("failed to get network info for %s/%s: %w", scenario.VnetName, scenario.SubnetName, err) + } + + // Create unique names for this scenario (simplify vnet name and make K8s compatible) + // Remove "cx_vnet_" prefix and replace underscores with hyphens + vnetShort := strings.TrimPrefix(scenario.VnetName, "cx_vnet_") + vnetShort = strings.ReplaceAll(vnetShort, "_", "-") + subnetNameSafe := strings.ReplaceAll(scenario.SubnetName, "_", "-") + pnName := fmt.Sprintf("pn-%s-%s-%s", testScenarios.BuildID, vnetShort, subnetNameSafe) + pniName := fmt.Sprintf("pni-%s-%s-%s", testScenarios.BuildID, vnetShort, subnetNameSafe) + + resources := TestResources{ + Kubeconfig: kubeconfig, + PNName: pnName, + PNIName: pniName, + VnetGUID: netInfo.VnetGUID, + SubnetGUID: netInfo.SubnetGUID, + SubnetARMID: netInfo.SubnetARMID, + SubnetToken: netInfo.SubnetToken, + PodNetworkTemplate: "../../manifests/swiftv2/long-running-cluster/podnetwork.yaml", + PNITemplate: "../../manifests/swiftv2/long-running-cluster/podnetworkinstance.yaml", + PodTemplate: "../../manifests/swiftv2/long-running-cluster/pod.yaml", + PodImage: testScenarios.PodImage, + } + + // Step 1: Create PodNetwork + err = CreatePodNetworkResource(resources) + if err != nil { + return fmt.Errorf("scenario %s: %w", scenario.Name, err) + } + + // Step 2: Create namespace + err = CreateNamespaceResource(resources.Kubeconfig, resources.PNName) + if err != nil { + return fmt.Errorf("scenario %s: %w", scenario.Name, err) + } + + // Step 3: Create PodNetworkInstance + err = CreatePodNetworkInstanceResource(resources) + if err != nil { + return fmt.Errorf("scenario %s: %w", scenario.Name, err) + } + + // Step 4: Get nodes by NIC count + nodeInfo, err := GetNodesByNicCount(kubeconfig) + if err != nil { + return fmt.Errorf("scenario %s: failed to get nodes: %w", scenario.Name, err) + } + + // Step 5: Select appropriate node based on scenario + var targetNode string + + // Initialize used nodes tracker if not exists + if testScenarios.UsedNodes == nil { + testScenarios.UsedNodes = make(map[string]bool) + } + + if scenario.NodeSelector == "low-nic" { + if len(nodeInfo.LowNicNodes) == 0 { + return fmt.Errorf("scenario %s: no low-NIC nodes available", scenario.Name) + } + // Find first unused node in the pool (low-NIC nodes can only handle one pod) + targetNode = "" + for _, node := range nodeInfo.LowNicNodes { + if !testScenarios.UsedNodes[node] { + targetNode = node + testScenarios.UsedNodes[node] = true + break + } + } + if targetNode == "" { + return fmt.Errorf("scenario %s: all low-NIC nodes already in use", scenario.Name) + } + } else { // "high-nic" + if len(nodeInfo.HighNicNodes) == 0 { + return fmt.Errorf("scenario %s: no high-NIC nodes available", scenario.Name) + } + // Find first unused node in the pool + targetNode = "" + for _, node := range nodeInfo.HighNicNodes { + if !testScenarios.UsedNodes[node] { + targetNode = node + testScenarios.UsedNodes[node] = true + break + } + } + if targetNode == "" { + return fmt.Errorf("scenario %s: all high-NIC nodes already in use", scenario.Name) + } + } + + // Step 6: Create pod + podName := fmt.Sprintf("pod-%s", scenario.PodNameSuffix) + err = CreatePodResource(resources, podName, targetNode) + if err != nil { + return fmt.Errorf("scenario %s: %w", scenario.Name, err) + } + + fmt.Printf("Successfully created scenario: %s (pod: %s on node: %s)\n", scenario.Name, podName, targetNode) + return nil +} + +// DeleteScenarioResources deletes all resources for a specific pod scenario +func DeleteScenarioResources(scenario PodScenario, buildID string) error { + kubeconfig := fmt.Sprintf("/tmp/%s.kubeconfig", scenario.Cluster) + + // Create same names as creation (simplify vnet name and make K8s compatible) + // Remove "cx_vnet_" prefix and replace underscores with hyphens + vnetShort := strings.TrimPrefix(scenario.VnetName, "cx_vnet_") + vnetShort = strings.ReplaceAll(vnetShort, "_", "-") + subnetNameSafe := strings.ReplaceAll(scenario.SubnetName, "_", "-") + pnName := fmt.Sprintf("pn-%s-%s-%s", buildID, vnetShort, subnetNameSafe) + pniName := fmt.Sprintf("pni-%s-%s-%s", buildID, vnetShort, subnetNameSafe) + podName := fmt.Sprintf("pod-%s", scenario.PodNameSuffix) + + // Delete pod + err := helpers.DeletePod(kubeconfig, pnName, podName) + if err != nil { + return fmt.Errorf("scenario %s: failed to delete pod: %w", scenario.Name, err) + } + + // Delete PodNetworkInstance + err = helpers.DeletePodNetworkInstance(kubeconfig, pnName, pniName) + if err != nil { + return fmt.Errorf("scenario %s: failed to delete PNI: %w", scenario.Name, err) + } + + // Delete PodNetwork + err = helpers.DeletePodNetwork(kubeconfig, pnName) + if err != nil { + return fmt.Errorf("scenario %s: failed to delete PN: %w", scenario.Name, err) + } + + // Delete namespace + err = helpers.DeleteNamespace(kubeconfig, pnName) + if err != nil { + return fmt.Errorf("scenario %s: failed to delete namespace: %w", scenario.Name, err) + } + + fmt.Printf("Successfully deleted scenario: %s\n", scenario.Name) + return nil +} + +// CreateAllScenarios creates resources for all test scenarios +func CreateAllScenarios(testScenarios TestScenarios) error { + for _, scenario := range testScenarios.Scenarios { + fmt.Printf("\n=== Creating scenario: %s ===\n", scenario.Name) + err := CreateScenarioResources(scenario, testScenarios) + if err != nil { + return err + } + } + return nil +} + +// DeleteAllScenarios deletes resources for all test scenarios +// Strategy: Delete all pods first, then delete shared PNI/PN/Namespace resources +func DeleteAllScenarios(testScenarios TestScenarios) error { + // Phase 1: Delete all pods first + fmt.Printf("\n=== Phase 1: Deleting all pods ===\n") + for _, scenario := range testScenarios.Scenarios { + kubeconfig := fmt.Sprintf("/tmp/%s.kubeconfig", scenario.Cluster) + vnetShort := strings.TrimPrefix(scenario.VnetName, "cx_vnet_") + vnetShort = strings.ReplaceAll(vnetShort, "_", "-") + subnetNameSafe := strings.ReplaceAll(scenario.SubnetName, "_", "-") + pnName := fmt.Sprintf("pn-%s-%s-%s", testScenarios.BuildID, vnetShort, subnetNameSafe) + podName := fmt.Sprintf("pod-%s", scenario.PodNameSuffix) + + fmt.Printf("Deleting pod for scenario: %s\n", scenario.Name) + err := helpers.DeletePod(kubeconfig, pnName, podName) + if err != nil { + fmt.Printf("Warning: Failed to delete pod for scenario %s: %v\n", scenario.Name, err) + } + } + + // Phase 2: Delete shared PNI/PN/Namespace resources (grouped by vnet/subnet/cluster) + fmt.Printf("\n=== Phase 2: Deleting shared PNI/PN/Namespace resources ===\n") + resourceGroups := make(map[string]bool) + + for _, scenario := range testScenarios.Scenarios { + kubeconfig := fmt.Sprintf("/tmp/%s.kubeconfig", scenario.Cluster) + vnetShort := strings.TrimPrefix(scenario.VnetName, "cx_vnet_") + vnetShort = strings.ReplaceAll(vnetShort, "_", "-") + subnetNameSafe := strings.ReplaceAll(scenario.SubnetName, "_", "-") + pnName := fmt.Sprintf("pn-%s-%s-%s", testScenarios.BuildID, vnetShort, subnetNameSafe) + pniName := fmt.Sprintf("pni-%s-%s-%s", testScenarios.BuildID, vnetShort, subnetNameSafe) + + // Create unique key for this vnet/subnet/cluster combination + resourceKey := fmt.Sprintf("%s:%s", scenario.Cluster, pnName) + + // Skip if we already deleted resources for this combination + if resourceGroups[resourceKey] { + continue + } + resourceGroups[resourceKey] = true + + fmt.Printf("\nDeleting shared resources for %s/%s on %s\n", scenario.VnetName, scenario.SubnetName, scenario.Cluster) + + // Delete PodNetworkInstance + err := helpers.DeletePodNetworkInstance(kubeconfig, pnName, pniName) + if err != nil { + fmt.Printf("Warning: Failed to delete PNI %s: %v\n", pniName, err) + } + + // Delete PodNetwork + err = helpers.DeletePodNetwork(kubeconfig, pnName) + if err != nil { + fmt.Printf("Warning: Failed to delete PN %s: %v\n", pnName, err) + } + + // Delete namespace + err = helpers.DeleteNamespace(kubeconfig, pnName) + if err != nil { + fmt.Printf("Warning: Failed to delete namespace %s: %v\n", pnName, err) + } + } + + fmt.Printf("\n=== All scenarios deleted ===\n") + return nil +} + +// DeleteTestResources deletes all test resources in reverse order +func DeleteTestResources(kubeconfig, pnName, pniName string) error { + // Delete pods (first two nodes only, matching creation) + for i := 0; i < 2; i++ { + podName := fmt.Sprintf("pod-c2-%d", i) + err := helpers.DeletePod(kubeconfig, pnName, podName) + if err != nil { + return fmt.Errorf("failed to delete pod %s: %w", podName, err) + } + } + + // Delete PodNetworkInstance + err := helpers.DeletePodNetworkInstance(kubeconfig, pnName, pniName) + if err != nil { + return fmt.Errorf("failed to delete PodNetworkInstance: %w", err) + } + + // Delete PodNetwork + err = helpers.DeletePodNetwork(kubeconfig, pnName) + if err != nil { + return fmt.Errorf("failed to delete PodNetwork: %w", err) + } + + // Delete namespace + err = helpers.DeleteNamespace(kubeconfig, pnName) + if err != nil { + return fmt.Errorf("failed to delete namespace: %w", err) + } + + return nil +} + +// ConnectivityTest defines a connectivity test between two pods +type ConnectivityTest struct { + Name string + SourcePod string + SourceNamespace string // Namespace of the source pod + DestinationPod string + DestNamespace string // Namespace of the destination pod + Cluster string // Cluster where source pod is running (for backward compatibility) + DestCluster string // Cluster where destination pod is running (if different from source) + Description string + ShouldFail bool // If true, connectivity is expected to fail (NSG block, customer isolation) + + // Fields for private endpoint tests + SourceCluster string // Cluster where source pod is running + SourcePodName string // Name of the source pod + SourceNS string // Namespace of the source pod + DestEndpoint string // Destination endpoint (IP or hostname) + TestType string // Type of test: "pod-to-pod" or "storage-access" + Purpose string // Description of the test purpose +} + +// RunConnectivityTest tests HTTP connectivity between two pods +func RunConnectivityTest(test ConnectivityTest, rg, buildId string) error { + // Get kubeconfig for the source cluster + sourceKubeconfig := fmt.Sprintf("/tmp/%s.kubeconfig", test.Cluster) + + // Get kubeconfig for the destination cluster (default to source cluster if not specified) + destKubeconfig := sourceKubeconfig + if test.DestCluster != "" { + destKubeconfig = fmt.Sprintf("/tmp/%s.kubeconfig", test.DestCluster) + } + + // Get destination pod's eth1 IP (delegated subnet IP for cross-VNet connectivity) + // This is the IP that is subject to NSG rules, not the overlay eth0 IP + destIP, err := helpers.GetPodDelegatedIP(destKubeconfig, test.DestNamespace, test.DestinationPod) + if err != nil { + return fmt.Errorf("failed to get destination pod delegated IP: %w", err) + } + + fmt.Printf("Testing connectivity from %s/%s (cluster: %s) to %s/%s (cluster: %s, eth1: %s) on port 8080\n", + test.SourceNamespace, test.SourcePod, test.Cluster, + test.DestNamespace, test.DestinationPod, test.DestCluster, destIP) + + // Run curl command from source pod to destination pod using eth1 IP + // Using -m 10 for 10 second timeout, -f to fail on HTTP errors + // Using --interface eth1 to force traffic through delegated subnet interface + curlCmd := fmt.Sprintf("curl --interface eth1 -f -m 10 http://%s:8080/", destIP) + + output, err := helpers.ExecInPod(sourceKubeconfig, test.SourceNamespace, test.SourcePod, curlCmd) + if err != nil { + return fmt.Errorf("connectivity test failed: %w\nOutput: %s", err, output) + } + + fmt.Printf("Connectivity successful! Response preview: %s\n", truncateString(output, 100)) + return nil +} + +// Helper function to truncate long strings +func truncateString(s string, maxLen int) string { + if len(s) <= maxLen { + return s + } + return s[:maxLen] + "..." +} + +// GenerateStorageSASToken generates a SAS token for a blob in a storage account +func GenerateStorageSASToken(storageAccountName, containerName, blobName string) (string, error) { + // Calculate expiry time: 7 days from now (Azure CLI limit) + expiryTime := time.Now().UTC().Add(7 * 24 * time.Hour).Format("2006-01-02") + + cmd := exec.Command("az", "storage", "blob", "generate-sas", + "--account-name", storageAccountName, + "--container-name", containerName, + "--name", blobName, + "--permissions", "r", + "--expiry", expiryTime, + "--auth-mode", "login", + "--as-user", + "--output", "tsv") + + out, err := cmd.CombinedOutput() + if err != nil { + return "", fmt.Errorf("failed to generate SAS token: %s\n%s", err, string(out)) + } + + sasToken := strings.TrimSpace(string(out)) + if sasToken == "" { + return "", fmt.Errorf("generated SAS token is empty") + } + + return sasToken, nil +} + +// GetStoragePrivateEndpoint retrieves the private IP address of a storage account's private endpoint +func GetStoragePrivateEndpoint(resourceGroup, storageAccountName string) (string, error) { + // Return the storage account blob endpoint FQDN + // This will resolve to the private IP via Private DNS Zone + return fmt.Sprintf("%s.blob.core.windows.net", storageAccountName), nil +} + +// RunPrivateEndpointTest tests connectivity from a pod to a private endpoint (storage account) +func RunPrivateEndpointTest(testScenarios TestScenarios, test ConnectivityTest) error { + // Get kubeconfig for the cluster + kubeconfig := fmt.Sprintf("/tmp/%s.kubeconfig", test.SourceCluster) + + fmt.Printf("Testing private endpoint access from %s to %s\n", + test.SourcePodName, test.DestEndpoint) + + // Step 1: Verify DNS resolution + fmt.Printf("==> Checking DNS resolution for %s\n", test.DestEndpoint) + resolveCmd := fmt.Sprintf("nslookup %s | tail -2", test.DestEndpoint) + resolveOutput, resolveErr := helpers.ExecInPod(kubeconfig, test.SourceNS, test.SourcePodName, resolveCmd) + if resolveErr != nil { + return fmt.Errorf("DNS resolution failed: %w\nOutput: %s", resolveErr, resolveOutput) + } + fmt.Printf("DNS Resolution Result:\n%s\n", resolveOutput) + + // Step 2: Generate SAS token for test blob + fmt.Printf("==> Generating SAS token for test blob\n") + // Extract storage account name from FQDN (e.g., sa106936191.blob.core.windows.net -> sa106936191) + storageAccountName := strings.Split(test.DestEndpoint, ".")[0] + sasToken, err := GenerateStorageSASToken(storageAccountName, "test", "hello.txt") + if err != nil { + return fmt.Errorf("failed to generate SAS token: %w", err) + } + + // Step 3: Download test blob using SAS token (proves both connectivity AND data plane access) + fmt.Printf("==> Downloading test blob via private endpoint\n") + blobURL := fmt.Sprintf("https://%s/test/hello.txt?%s", test.DestEndpoint, sasToken) + curlCmd := fmt.Sprintf("curl -f -s --connect-timeout 5 --max-time 10 '%s'", blobURL) + + output, err := helpers.ExecInPod(kubeconfig, test.SourceNS, test.SourcePodName, curlCmd) + if err != nil { + return fmt.Errorf("private endpoint connectivity test failed: %w\nOutput: %s", err, output) + } + + fmt.Printf("Private endpoint access successful! Blob content: %s\n", truncateString(output, 100)) + return nil +} diff --git a/test/integration/swiftv2/longRunningCluster/datapath_connectivity_test.go b/test/integration/swiftv2/longRunningCluster/datapath_connectivity_test.go new file mode 100644 index 0000000000..2852992581 --- /dev/null +++ b/test/integration/swiftv2/longRunningCluster/datapath_connectivity_test.go @@ -0,0 +1,165 @@ +//go:build connectivity_test +// +build connectivity_test + +package longRunningCluster + +import ( + "fmt" + "os" + "strings" + "testing" + + "github.com/onsi/ginkgo/v2" + "github.com/onsi/gomega" +) + +func TestDatapathConnectivity(t *testing.T) { + gomega.RegisterFailHandler(ginkgo.Fail) + suiteConfig, reporterConfig := ginkgo.GinkgoConfiguration() + suiteConfig.Timeout = 0 + ginkgo.RunSpecs(t, "Datapath Connectivity Suite", suiteConfig, reporterConfig) +} + +var _ = ginkgo.Describe("Datapath Connectivity Tests", func() { + rg := os.Getenv("RG") + buildId := os.Getenv("BUILD_ID") + + if rg == "" || buildId == "" { + ginkgo.Fail(fmt.Sprintf("Missing required environment variables: RG='%s', BUILD_ID='%s'", rg, buildId)) + } + + ginkgo.It("tests HTTP connectivity between pods", ginkgo.NodeTimeout(0), func() { + // Helper function to generate namespace from vnet and subnet + // Format: pn--- + // Example: pn-sv2-long-run-centraluseuap-a1-s1 + getNamespace := func(vnetName, subnetName string) string { + // Extract vnet prefix (a1, a2, a3, b1, etc.) from cx_vnet_a1 -> a1 + vnetPrefix := strings.TrimPrefix(vnetName, "cx_vnet_") + return fmt.Sprintf("pn-%s-%s-%s", rg, vnetPrefix, subnetName) + } + + // Define connectivity test cases + // Format: {SourcePod, DestinationPod, Cluster, Description, ShouldFail} + connectivityTests := []ConnectivityTest{ + { + Name: "SameVNetSameSubnet", + SourcePod: "pod-c1-aks1-a1s2-low", + SourceNamespace: getNamespace("cx_vnet_a1", "s2"), + DestinationPod: "pod-c1-aks1-a1s2-high", + DestNamespace: getNamespace("cx_vnet_a1", "s2"), + Cluster: "aks-1", + Description: "Test connectivity between low-NIC and high-NIC pods in same VNet/Subnet (cx_vnet_a1/s2)", + ShouldFail: false, + }, + { + Name: "NSGBlocked_S1toS2", + SourcePod: "pod-c1-aks1-a1s1-low", + SourceNamespace: getNamespace("cx_vnet_a1", "s1"), + DestinationPod: "pod-c1-aks1-a1s2-high", + DestNamespace: getNamespace("cx_vnet_a1", "s2"), + Cluster: "aks-1", + Description: "Test NSG isolation: s1 -> s2 in cx_vnet_a1 (should be blocked by NSG rule)", + ShouldFail: true, + }, + { + Name: "NSGBlocked_S2toS1", + SourcePod: "pod-c1-aks1-a1s2-low", + SourceNamespace: getNamespace("cx_vnet_a1", "s2"), + DestinationPod: "pod-c1-aks1-a1s1-low", + DestNamespace: getNamespace("cx_vnet_a1", "s1"), + Cluster: "aks-1", + Description: "Test NSG isolation: s2 -> s1 in cx_vnet_a1 (should be blocked by NSG rule)", + ShouldFail: true, + }, + { + Name: "DifferentClusters_SameVNet", + SourcePod: "pod-c1-aks1-a2s1-high", + SourceNamespace: getNamespace("cx_vnet_a2", "s1"), + DestinationPod: "pod-c1-aks2-a2s1-low", + DestNamespace: getNamespace("cx_vnet_a2", "s1"), + Cluster: "aks-1", + DestCluster: "aks-2", + Description: "Test connectivity across different clusters, same customer VNet (cx_vnet_a2)", + ShouldFail: false, + }, + { + Name: "PeeredVNets", + SourcePod: "pod-c1-aks1-a1s2-low", + SourceNamespace: getNamespace("cx_vnet_a1", "s2"), + DestinationPod: "pod-c1-aks1-a2s1-high", + DestNamespace: getNamespace("cx_vnet_a2", "s1"), + Cluster: "aks-1", + Description: "Test connectivity between peered VNets (cx_vnet_a1/s2 <-> cx_vnet_a2/s1)", + ShouldFail: false, + }, + { + Name: "PeeredVNets_A2toA3", + SourcePod: "pod-c1-aks1-a2s1-high", + SourceNamespace: getNamespace("cx_vnet_a2", "s1"), + DestinationPod: "pod-c1-aks2-a3s1-high", + DestNamespace: getNamespace("cx_vnet_a3", "s1"), + Cluster: "aks-1", + DestCluster: "aks-2", + Description: "Test connectivity between peered VNets across clusters (cx_vnet_a2 <-> cx_vnet_a3)", + ShouldFail: false, + }, + { + Name: "DifferentCustomers_A1toB1", + SourcePod: "pod-c1-aks1-a1s2-low", + SourceNamespace: getNamespace("cx_vnet_a1", "s2"), + DestinationPod: "pod-c2-aks2-b1s1-low", + DestNamespace: getNamespace("cx_vnet_b1", "s1"), + Cluster: "aks-1", + DestCluster: "aks-2", + Description: "Test isolation: Customer 1 to Customer 2 should fail (cx_vnet_a1 -> cx_vnet_b1)", + ShouldFail: true, + }, + { + Name: "DifferentCustomers_A2toB1", + SourcePod: "pod-c1-aks1-a2s1-high", + SourceNamespace: getNamespace("cx_vnet_a2", "s1"), + DestinationPod: "pod-c2-aks2-b1s1-high", + DestNamespace: getNamespace("cx_vnet_b1", "s1"), + Cluster: "aks-1", + DestCluster: "aks-2", + Description: "Test isolation: Customer 1 to Customer 2 should fail (cx_vnet_a2 -> cx_vnet_b1)", + ShouldFail: true, + }, + } + + ginkgo.By(fmt.Sprintf("Running %d connectivity tests", len(connectivityTests))) + + successCount := 0 + failureCount := 0 + + for _, test := range connectivityTests { + ginkgo.By(fmt.Sprintf("Test: %s - %s", test.Name, test.Description)) + + err := RunConnectivityTest(test, rg, buildId) + + if test.ShouldFail { + // This test should fail (NSG blocked or customer isolation) + if err == nil { + fmt.Printf("Test %s: UNEXPECTED SUCCESS (expected to be blocked!)\n", test.Name) + failureCount++ + ginkgo.Fail(fmt.Sprintf("Test %s: Expected failure but succeeded (blocking not working!)", test.Name)) + } else { + fmt.Printf("Test %s: Correctly blocked (connection failed as expected)\n", test.Name) + successCount++ + } + } else { + // This test should succeed + if err != nil { + fmt.Printf("Test %s: FAILED - %v\n", test.Name, err) + failureCount++ + gomega.Expect(err).To(gomega.BeNil(), fmt.Sprintf("Test %s failed: %v", test.Name, err)) + } else { + fmt.Printf("Test %s: Connectivity successful\n", test.Name) + successCount++ + } + } + } + + ginkgo.By(fmt.Sprintf("Connectivity test summary: %d succeeded, %d failures", successCount, failureCount)) + }) +}) diff --git a/test/integration/swiftv2/longRunningCluster/datapath_create_test.go b/test/integration/swiftv2/longRunningCluster/datapath_create_test.go new file mode 100644 index 0000000000..9ba860e022 --- /dev/null +++ b/test/integration/swiftv2/longRunningCluster/datapath_create_test.go @@ -0,0 +1,118 @@ +//go:build create_test +// +build create_test + +package longRunningCluster + +import ( + "fmt" + "os" + "testing" + + "github.com/onsi/ginkgo/v2" + "github.com/onsi/gomega" +) + +func TestDatapathCreate(t *testing.T) { + gomega.RegisterFailHandler(ginkgo.Fail) + suiteConfig, reporterConfig := ginkgo.GinkgoConfiguration() + suiteConfig.Timeout = 0 + ginkgo.RunSpecs(t, "Datapath Create Suite", suiteConfig, reporterConfig) +} + +var _ = ginkgo.Describe("Datapath Create Tests", func() { + rg := os.Getenv("RG") + buildId := os.Getenv("BUILD_ID") + + if rg == "" || buildId == "" { + ginkgo.Fail(fmt.Sprintf("Missing required environment variables: RG='%s', BUILD_ID='%s'", rg, buildId)) + } + + ginkgo.It("creates PodNetwork, PodNetworkInstance, and Pods", ginkgo.NodeTimeout(0), func() { + // Define all test scenarios + scenarios := []PodScenario{ + // Customer 2 scenarios on aks-2 with cx_vnet_b1 + { + Name: "Customer2-AKS2-VnetB1-S1-LowNic", + Cluster: "aks-2", + VnetName: "cx_vnet_b1", + SubnetName: "s1", + NodeSelector: "low-nic", + PodNameSuffix: "c2-aks2-b1s1-low", + }, + { + Name: "Customer2-AKS2-VnetB1-S1-HighNic", + Cluster: "aks-2", + VnetName: "cx_vnet_b1", + SubnetName: "s1", + NodeSelector: "high-nic", + PodNameSuffix: "c2-aks2-b1s1-high", + }, + // Customer 1 scenarios + { + Name: "Customer1-AKS1-VnetA1-S1-LowNic", + Cluster: "aks-1", + VnetName: "cx_vnet_a1", + SubnetName: "s1", + NodeSelector: "low-nic", + PodNameSuffix: "c1-aks1-a1s1-low", + }, + { + Name: "Customer1-AKS1-VnetA1-S2-LowNic", + Cluster: "aks-1", + VnetName: "cx_vnet_a1", + SubnetName: "s2", + NodeSelector: "low-nic", + PodNameSuffix: "c1-aks1-a1s2-low", + }, + { + Name: "Customer1-AKS1-VnetA1-S2-HighNic", + Cluster: "aks-1", + VnetName: "cx_vnet_a1", + SubnetName: "s2", + NodeSelector: "high-nic", + PodNameSuffix: "c1-aks1-a1s2-high", + }, + { + Name: "Customer1-AKS1-VnetA2-S1-HighNic", + Cluster: "aks-1", + VnetName: "cx_vnet_a2", + SubnetName: "s1", + NodeSelector: "high-nic", + PodNameSuffix: "c1-aks1-a2s1-high", + }, + { + Name: "Customer1-AKS2-VnetA2-S1-LowNic", + Cluster: "aks-2", + VnetName: "cx_vnet_a2", + SubnetName: "s1", + NodeSelector: "low-nic", + PodNameSuffix: "c1-aks2-a2s1-low", + }, + { + Name: "Customer1-AKS2-VnetA3-S1-HighNic", + Cluster: "aks-2", + VnetName: "cx_vnet_a3", + SubnetName: "s1", + NodeSelector: "high-nic", + PodNameSuffix: "c1-aks2-a3s1-high", + }, + } + + // Initialize test scenarios with cache + testScenarios := TestScenarios{ + ResourceGroup: rg, + BuildID: buildId, + PodImage: "nicolaka/netshoot:latest", + Scenarios: scenarios, + VnetSubnetCache: make(map[string]VnetSubnetInfo), + UsedNodes: make(map[string]bool), + } + + // Create all scenario resources + ginkgo.By(fmt.Sprintf("Creating all test scenarios (%d scenarios)", len(scenarios))) + err := CreateAllScenarios(testScenarios) + gomega.Expect(err).To(gomega.BeNil(), "Failed to create test scenarios") + + ginkgo.By("Successfully created all test scenarios") + }) +}) diff --git a/test/integration/swiftv2/longRunningCluster/datapath_delete_test.go b/test/integration/swiftv2/longRunningCluster/datapath_delete_test.go new file mode 100644 index 0000000000..72020d609b --- /dev/null +++ b/test/integration/swiftv2/longRunningCluster/datapath_delete_test.go @@ -0,0 +1,117 @@ +// +build delete_test + +package longRunningCluster + +import ( + "fmt" + "os" + "testing" + + "github.com/onsi/ginkgo/v2" + "github.com/onsi/gomega" +) + +func TestDatapathDelete(t *testing.T) { + gomega.RegisterFailHandler(ginkgo.Fail) + suiteConfig, reporterConfig := ginkgo.GinkgoConfiguration() + suiteConfig.Timeout = 0 + ginkgo.RunSpecs(t, "Datapath Delete Suite", suiteConfig, reporterConfig) +} + +var _ = ginkgo.Describe("Datapath Delete Tests", func() { + rg := os.Getenv("RG") + buildId := os.Getenv("BUILD_ID") + + if rg == "" || buildId == "" { + ginkgo.Fail(fmt.Sprintf("Missing required environment variables: RG='%s', BUILD_ID='%s'", rg, buildId)) + } + + ginkgo.It("deletes PodNetwork, PodNetworkInstance, and Pods", ginkgo.NodeTimeout(0), func() { + // Define all test scenarios (same as create) + scenarios := []PodScenario{ + // Customer 2 scenarios on aks-2 with cx_vnet_b1 + { + Name: "Customer2-AKS2-VnetB1-S1-LowNic", + Cluster: "aks-2", + VnetName: "cx_vnet_b1", + SubnetName: "s1", + NodeSelector: "low-nic", + PodNameSuffix: "c2-aks2-b1s1-low", + }, + { + Name: "Customer2-AKS2-VnetB1-S1-HighNic", + Cluster: "aks-2", + VnetName: "cx_vnet_b1", + SubnetName: "s1", + NodeSelector: "high-nic", + PodNameSuffix: "c2-aks2-b1s1-high", + }, + // Customer 1 scenarios + { + Name: "Customer1-AKS1-VnetA1-S1-LowNic", + Cluster: "aks-1", + VnetName: "cx_vnet_a1", + SubnetName: "s1", + NodeSelector: "low-nic", + PodNameSuffix: "c1-aks1-a1s1-low", + }, + { + Name: "Customer1-AKS1-VnetA1-S2-LowNic", + Cluster: "aks-1", + VnetName: "cx_vnet_a1", + SubnetName: "s2", + NodeSelector: "low-nic", + PodNameSuffix: "c1-aks1-a1s2-low", + }, + { + Name: "Customer1-AKS1-VnetA1-S2-HighNic", + Cluster: "aks-1", + VnetName: "cx_vnet_a1", + SubnetName: "s2", + NodeSelector: "high-nic", + PodNameSuffix: "c1-aks1-a1s2-high", + }, + { + Name: "Customer1-AKS1-VnetA2-S1-HighNic", + Cluster: "aks-1", + VnetName: "cx_vnet_a2", + SubnetName: "s1", + NodeSelector: "high-nic", + PodNameSuffix: "c1-aks1-a2s1-high", + }, + { + Name: "Customer1-AKS2-VnetA2-S1-LowNic", + Cluster: "aks-2", + VnetName: "cx_vnet_a2", + SubnetName: "s1", + NodeSelector: "low-nic", + PodNameSuffix: "c1-aks2-a2s1-low", + }, + { + Name: "Customer1-AKS2-VnetA3-S1-HighNic", + Cluster: "aks-2", + VnetName: "cx_vnet_a3", + SubnetName: "s1", + NodeSelector: "high-nic", + PodNameSuffix: "c1-aks2-a3s1-high", + }, + } + + // Initialize test scenarios with cache + testScenarios := TestScenarios{ + ResourceGroup: rg, + BuildID: buildId, + PodImage: "nicolaka/netshoot:latest", + Scenarios: scenarios, + VnetSubnetCache: make(map[string]VnetSubnetInfo), + UsedNodes: make(map[string]bool), + } + + // Delete all scenario resources + ginkgo.By("Deleting all test scenarios") + err := DeleteAllScenarios(testScenarios) + gomega.Expect(err).To(gomega.BeNil(), "Failed to delete test scenarios") + + ginkgo.By("Successfully deleted all test scenarios") + }) +}) diff --git a/test/integration/swiftv2/longRunningCluster/datapath_private_endpoint_test.go b/test/integration/swiftv2/longRunningCluster/datapath_private_endpoint_test.go new file mode 100644 index 0000000000..dc77302db1 --- /dev/null +++ b/test/integration/swiftv2/longRunningCluster/datapath_private_endpoint_test.go @@ -0,0 +1,150 @@ +//go:build private_endpoint_test +// +build private_endpoint_test + +package longRunningCluster + +import ( + "fmt" + "os" + "testing" + + "github.com/onsi/ginkgo/v2" + "github.com/onsi/gomega" +) + +func TestDatapathPrivateEndpoint(t *testing.T) { + gomega.RegisterFailHandler(ginkgo.Fail) + suiteConfig, reporterConfig := ginkgo.GinkgoConfiguration() + suiteConfig.Timeout = 0 + ginkgo.RunSpecs(t, "Datapath Private Endpoint Suite", suiteConfig, reporterConfig) +} + +var _ = ginkgo.Describe("Private Endpoint Tests", func() { + rg := os.Getenv("RG") + buildId := os.Getenv("BUILD_ID") + storageAccount1 := os.Getenv("STORAGE_ACCOUNT_1") + storageAccount2 := os.Getenv("STORAGE_ACCOUNT_2") + + ginkgo.It("tests private endpoint access and isolation", func() { + // Validate environment variables inside the It block + if rg == "" || buildId == "" { + ginkgo.Fail(fmt.Sprintf("Missing required environment variables: RG='%s', BUILD_ID='%s'", rg, buildId)) + } + + if storageAccount1 == "" || storageAccount2 == "" { + ginkgo.Fail(fmt.Sprintf("Missing storage account environment variables: STORAGE_ACCOUNT_1='%s', STORAGE_ACCOUNT_2='%s'", storageAccount1, storageAccount2)) + } + + // Initialize test scenarios with cache + testScenarios := TestScenarios{ + ResourceGroup: rg, + BuildID: buildId, + PodImage: "nicolaka/netshoot:latest", + VnetSubnetCache: make(map[string]VnetSubnetInfo), + UsedNodes: make(map[string]bool), + } + + // Get storage account endpoint for Tenant A (Customer 1) + storageAccountName := storageAccount1 + ginkgo.By(fmt.Sprintf("Getting private endpoint for storage account: %s", storageAccountName)) + + storageEndpoint, err := GetStoragePrivateEndpoint(testScenarios.ResourceGroup, storageAccountName) + gomega.Expect(err).To(gomega.BeNil(), "Failed to get storage account private endpoint") + gomega.Expect(storageEndpoint).NotTo(gomega.BeEmpty(), "Storage account private endpoint is empty") + + ginkgo.By(fmt.Sprintf("Storage account private endpoint: %s", storageEndpoint)) + + // Test scenarios for Private Endpoint connectivity + privateEndpointTests := []ConnectivityTest{ + // Test 1: Private Endpoint Access (Tenant A) - Pod from VNet-A1 Subnet 1 + { + Name: "Private Endpoint Access: VNet-A1-S1 to Storage-A", + SourceCluster: "aks-1", + SourcePodName: "pod-c1-aks1-a1s1-low", + SourceNS: "pn-" + testScenarios.BuildID + "-a1-s1", + DestEndpoint: storageEndpoint, + ShouldFail: false, + TestType: "storage-access", + Purpose: "Verify Tenant A pod can access Storage-A via private endpoint", + }, + // Test 2: Private Endpoint Access (Tenant A) - Pod from VNet-A1 Subnet 2 + { + Name: "Private Endpoint Access: VNet-A1-S2 to Storage-A", + SourceCluster: "aks-1", + SourcePodName: "pod-c1-aks1-a1s2-low", + SourceNS: "pn-" + testScenarios.BuildID + "-a1-s2", + DestEndpoint: storageEndpoint, + ShouldFail: false, + TestType: "storage-access", + Purpose: "Verify Tenant A pod can access Storage-A via private endpoint", + }, + // Test 3: Private Endpoint Access (Tenant A) - Pod from VNet-A2 + { + Name: "Private Endpoint Access: VNet-A2-S1 to Storage-A", + SourceCluster: "aks-1", + SourcePodName: "pod-c1-aks1-a2s1-high", + SourceNS: "pn-" + testScenarios.BuildID + "-a2-s1", + DestEndpoint: storageEndpoint, + ShouldFail: false, + TestType: "storage-access", + Purpose: "Verify Tenant A pod from peered VNet can access Storage-A", + }, + // Test 4: Private Endpoint Access (Tenant A) - Pod from VNet-A3 (cross-cluster) + { + Name: "Private Endpoint Access: VNet-A3-S1 to Storage-A (cross-cluster)", + SourceCluster: "aks-2", + SourcePodName: "pod-c1-aks2-a3s1-high", + SourceNS: "pn-" + testScenarios.BuildID + "-a3-s1", + DestEndpoint: storageEndpoint, + ShouldFail: false, + TestType: "storage-access", + Purpose: "Verify Tenant A pod from different cluster can access Storage-A", + } + } + + ginkgo.By(fmt.Sprintf("Running %d Private Endpoint connectivity tests", len(privateEndpointTests))) + + successCount := 0 + failureCount := 0 + + for _, test := range privateEndpointTests { + ginkgo.By(fmt.Sprintf("\n=== Test: %s ===", test.Name)) + ginkgo.By(fmt.Sprintf("Purpose: %s", test.Purpose)) + ginkgo.By(fmt.Sprintf("Expected: %s", func() string { + if test.ShouldFail { + return "BLOCKED" + } + return "SUCCESS" + }())) + + err := RunPrivateEndpointTest(testScenarios, test) + + if test.ShouldFail { + // Expected to fail (e.g., tenant isolation) + if err != nil { + ginkgo.By(fmt.Sprintf("Test correctly BLOCKED as expected: %s", test.Name)) + successCount++ + } else { + ginkgo.By(fmt.Sprintf("Test FAILED: Expected connection to be blocked but it succeeded: %s", test.Name)) + failureCount++ + } + } else { + // Expected to succeed + if err != nil { + ginkgo.By(fmt.Sprintf("Test FAILED: %s - Error: %v", test.Name, err)) + failureCount++ + } else { + ginkgo.By(fmt.Sprintf("Test PASSED: %s", test.Name)) + successCount++ + } + } + } + + ginkgo.By(fmt.Sprintf("\n=== Private Endpoint Test Summary ===")) + ginkgo.By(fmt.Sprintf("Total tests: %d", len(privateEndpointTests))) + ginkgo.By(fmt.Sprintf("Successful connections: %d", successCount)) + ginkgo.By(fmt.Sprintf("Unexpected failures: %d", failureCount)) + + gomega.Expect(failureCount).To(gomega.Equal(0), "Some private endpoint tests failed unexpectedly") + }) +}) From 873c05e37b1ace49de6567dcf09d379044f3d99d Mon Sep 17 00:00:00 2001 From: sivakami Date: Mon, 24 Nov 2025 09:04:12 -0800 Subject: [PATCH 2/3] Update readme file. --- .pipelines/swiftv2-long-running/README.md | 251 ++-------------------- 1 file changed, 14 insertions(+), 237 deletions(-) diff --git a/.pipelines/swiftv2-long-running/README.md b/.pipelines/swiftv2-long-running/README.md index b513dcab00..5b47b43ce9 100644 --- a/.pipelines/swiftv2-long-running/README.md +++ b/.pipelines/swiftv2-long-running/README.md @@ -50,33 +50,11 @@ Examples: sv2-long-run-12345, sv2-long-run-67890 - **Lifecycle**: Can be cleaned up after testing completes - **Example**: PR validation run with Build ID 12345 → `sv2-long-run-12345` -**3. Parallel/Custom Environments**: -``` -Pattern: sv2-long-run-- -Examples: sv2-long-run-centraluseuap-dev, sv2-long-run-eastus-staging -``` -- **When to use**: Parallel environments, feature testing, version upgrades -- **Purpose**: Isolated environment alongside production -- **Lifecycle**: Persistent or temporary based on use case -- **Example**: Development environment in Central US EUAP → `sv2-long-run-centraluseuap-dev` - **Important Notes**: -- ⚠️ Always follow the naming pattern for scheduled runs on master: `sv2-long-run-` -- ⚠️ Do not use build IDs for production scheduled infrastructure (it breaks continuity) -- ⚠️ Region name should match the `location` parameter for consistency -- ✅ All resource names within the setup use the resource group name as BUILD_ID prefix - -### Mode 1: Scheduled Test Runs (Default) -**Trigger**: Automated cron schedule every 1 hour -**Purpose**: Continuous validation of long-running infrastructure -**Setup Stages**: Disabled -**Test Duration**: ~30-40 minutes per run -**Resource Group**: Static (default: `sv2-long-run-`, e.g., `sv2-long-run-centraluseuap`) +- Always follow the naming pattern for scheduled runs on master: `sv2-long-run-` +- Do not use build IDs for production scheduled infrastructure (it breaks continuity) +- All resource names within the setup use the resource group name as BUILD_ID prefix -```yaml -# Runs automatically every 1 hour -# No manual/external triggers allowed -``` ### Mode 2: Initial Setup or Rebuild **Trigger**: Manual run with parameter change @@ -120,15 +98,6 @@ Parameters are organized by usage: |-----------|---------|-------------| | `resourceGroupName` | `""` (empty) | **Leave empty** to auto-generate based on usage pattern. See Resource Group Naming Conventions below. | -**Resource Group Naming Conventions**: -- **For scheduled runs on master/main branch**: Use `sv2-long-run-` (e.g., `sv2-long-run-centraluseuap`) - - This ensures consistent naming for production scheduled tests - - Example: Creating infrastructure in `centraluseuap` for scheduled runs → `sv2-long-run-centraluseuap` -- **For test/dev runs or PR validation**: Use `sv2-long-run-$(Build.BuildId)` - - Auto-cleanup after testing - - Example: `sv2-long-run-12345` (where 12345 is the build ID) -- **For parallel environments**: Use descriptive suffix (e.g., `sv2-long-run-centraluseuap-dev`, `sv2-long-run-eastus-staging`) - **Note**: VM SKUs are hardcoded as constants in the pipeline template: - Default nodepool: `Standard_D4s_v3` (low-nic capacity, 1 NIC) - NPLinux nodepool: `Standard_D16s_v3` (high-nic capacity, 7 NICs) @@ -161,21 +130,21 @@ The pipeline is organized into stages based on workload type, allowing sequentia ### Future Stages (Planned Architecture) Additional stages can be added to test different workload types sequentially: -**Example: Stage 3 - BYONodeDataPathTests** +**Example: Stage 3 - LinuxBYONodeDataPathTests** ```yaml -- stage: BYONodeDataPathTests +- stage: LinuxBYONodeDataPathTests displayName: "SwiftV2 Data Path Tests - BYO Node ID" dependsOn: ManagedNodeDataPathTests variables: - WORKLOAD_TYPE: "swiftv2-byonodeid" + WORKLOAD_TYPE: "swiftv2-linuxbyon" # Same job structure as ManagedNodeDataPathTests # Tests run on nodes labeled: workload-type=swiftv2-byonodeid ``` -**Example: Stage 4 - WindowsNodeDataPathTests** +**Example: Stage 4 - L1vhAccelnetNodeDataPathTests** ```yaml -- stage: WindowsNodeDataPathTests - displayName: "SwiftV2 Data Path Tests - Windows Nodes" +- stage: L1vhAccelnetNodeDataPathTests + displayName: "SwiftV2 Data Path Tests - Windows Nodes Accelnet" dependsOn: BYONodeDataPathTests variables: WORKLOAD_TYPE: "swiftv2-windows" @@ -183,27 +152,20 @@ Additional stages can be added to test different workload types sequentially: # Tests run on nodes labeled: workload-type=swiftv2-windows ``` -**Benefits of Stage-Based Architecture**: -- ✅ Sequential execution: Each workload type tested independently -- ✅ Isolated node pools: No resource contention between workload types -- ✅ Same infrastructure: All stages use the same VNets, storage, NSGs -- ✅ Same test suite: Connectivity and private endpoint tests run for each workload type -- ✅ Easy extensibility: Add new stages without modifying existing ones -- ✅ Clear results: Separate test results per workload type - **Node Labeling for Multiple Workload Types**: Each node pool gets labeled with its designated workload type during setup: ```bash # During cluster creation or node pool addition: -kubectl label nodes -l agentpool=nodepool1 workload-type=swiftv2-linux -kubectl label nodes -l agentpool=byonodepool workload-type=swiftv2-byonodeid -kubectl label nodes -l agentpool=winnodepool workload-type=swiftv2-windows +kubectl label nodes -l workload-type=swiftv2-linux +kubectl label nodes -l workload-type=swiftv2-linuxbyon +kubectl label nodes -l workload-type=swiftv2-l1vhaccelnet +kubectl label nodes -l workload-type=swiftv2-l1vhib ``` ## How It Works ### Scheduled Test Flow -Every 1 hour, the pipeline: +Every 3 hour, the pipeline: 1. Skips setup stages (infrastructure already exists) 2. **Job 1 - Create Resources**: Creates 8 test scenarios (PodNetwork, PNI, Pods with HTTP servers on port 8080) 3. **Job 2 - Connectivity Tests**: Tests HTTP connectivity between pods (9 test cases), then waits 20 minutes @@ -361,142 +323,6 @@ pod-c1-aks1-a1s1-low **All infrastructure resources are tagged with `SkipAutoDeleteTill=2032-12-31`** to prevent automatic cleanup by Azure subscription policies. -## Resource Naming - -All test resources use the pattern: `-static-setup--` - -**Examples**: -- PodNetwork: `pn-static-setup-a1-s1` -- PodNetworkInstance: `pni-static-setup-a1-s1` -- Pod: `pod-c1-aks1-a1s1-low` -- Namespace: `pn-static-setup-a1-s1` - -VNet names are simplified: -- `cx_vnet_a1` → `a1` -- `cx_vnet_b1` → `b1` - -## Switching to a New Setup - -**Scenario**: You created a new setup in RG `sv2-long-run-eastus` and want scheduled runs to use it. - -**Steps**: -1. Go to Pipeline → Edit -2. Update location parameter default value: - ```yaml - - name: location - default: "centraluseuap" # Change this - ``` -3. Save and commit -4. RG name will automatically become `sv2-long-run-centraluseuap` - -Alternatively, manually trigger with the new location or override `resourceGroupName` directly. - -## Creating Multiple Test Setups - -**Use Case**: You want to create a new test environment without affecting the existing one (e.g., for testing different configurations, regions, or versions). - -**Steps**: -1. Go to Pipeline → Run pipeline -2. Set `runSetupStages` = `true` -3. **Set `resourceGroupName`** based on usage: - - **For scheduled runs on master/main branch**: `sv2-long-run-` (e.g., `sv2-long-run-centraluseuap`, `sv2-long-run-eastus`) - - Use this naming pattern for production scheduled tests - - **For test/dev runs**: `sv2-long-run-$(Build.BuildId)` or custom (e.g., `sv2-long-run-12345`) - - For temporary testing or PR validation - - **For parallel environments**: Custom with descriptive suffix (e.g., `sv2-long-run-centraluseuap-dev`, `sv2-long-run-centraluseuap-v2`) -4. Optionally adjust `location` -5. Run pipeline - -**After setup completes**: -- The new infrastructure will be tagged with `SkipAutoDeleteTill=2032-12-31` -- Resources are isolated by the unique resource group name -- To run tests against the new setup, the scheduled pipeline would need to be updated with the new RG name - -**Example Scenarios**: -| Scenario | Resource Group Name | Purpose | Naming Pattern | -|----------|-------------------|---------|----------------| -| Production scheduled (Central US EUAP) | `sv2-long-run-centraluseuap` | Daily scheduled tests on master | `sv2-long-run-` | -| Production scheduled (East US) | `sv2-long-run-eastus` | Regional scheduled testing on master | `sv2-long-run-` | -| Temporary test run | `sv2-long-run-12345` | One-time testing (Build ID: 12345) | `sv2-long-run-$(Build.BuildId)` | -| Development environment | `sv2-long-run-centraluseuap-dev` | Development/testing | Custom with suffix | -| Version upgrade testing | `sv2-long-run-centraluseuap-v2` | Parallel environment for upgrades | Custom with suffix | - -## Resource Naming - instead of ping use -The pipeline uses the **resource group name as the BUILD_ID** to ensure unique resource names per test setup. This allows multiple parallel test environments without naming collisions. - -**Generated Resource Names**: -``` -BUILD_ID = - -PodNetwork: pn--- -PodNetworkInstance: pni--- -Namespace: pn--- -Pod: pod- -``` - -**Example for `resourceGroupName=sv2-long-run-centraluseuap`**: -``` -pn-sv2-long-run-centraluseuap-b1-s1 (PodNetwork for cx_vnet_b1, subnet s1) -pni-sv2-long-run-centraluseuap-b1-s1 (PodNetworkInstance) -pn-sv2-long-run-centraluseuap-a1-s1 (PodNetwork for cx_vnet_a1, subnet s1) -pni-sv2-long-run-centraluseuap-a1-s2 (PodNetworkInstance for cx_vnet_a1, subnet s2) -``` - -**Example for different setup `resourceGroupName=sv2-long-run-eastus`**: -``` -pn-sv2-long-run-eastus-b1-s1 (Different from centraluseuap setup) -pni-sv2-long-run-eastus-b1-s1 -pn-sv2-long-run-eastus-a1-s1 -``` - -This ensures **no collision** between different test setups running in parallel. - -## Deletion Strategy -### Phase 1: Delete All Pods -Deletes all pods across all scenarios first. This ensures IP reservations are released. - -``` -Deleting pod pod-c2-aks2-b1s1-low... -Deleting pod pod-c2-aks2-b1s1-high... -... -``` - -### Phase 2: Delete Shared Resources -Groups resources by vnet/subnet/cluster and deletes PNI/PN/Namespace once per group. - -``` -Deleting PodNetworkInstance pni-static-setup-b1-s1... -Deleting PodNetwork pn-static-setup-b1-s1... -Deleting namespace pn-static-setup-b1-s1... -``` - -**Why**: Multiple pods can share the same PNI. Deleting PNI while pods exist causes "ReservationInUse" errors. - -## Troubleshooting - -### Tests are running on wrong cluster -- Check `resourceGroupName` parameter points to correct RG -- Verify RG contains aks-1 and aks-2 clusters -- Check kubeconfig retrieval in logs - -### Setup stages not running -- Verify `runSetupStages` parameter is set to `true` -- Check condition: `condition: eq(${{ parameters.runSetupStages }}, true)` - -### Schedule not triggering -- Verify cron expression: `"0 */1 * * *"` (every 1 hour) -- Check branch in schedule matches your working branch -- Ensure `always: true` is set (runs even without code changes) - -### PNI stuck with "ReservationInUse" -- Check if pods were deleted first (Phase 1 logs) -- Manual fix: Delete pod → Wait 10s → Patch PNI to remove finalizers - -### Pipeline timeout after 6 hours -- This is expected behavior (timeoutInMinutes: 360) -- Tests should complete in ~30-40 minutes -- If tests hang, check deletion logs for stuck resources ## Manual Testing @@ -544,21 +370,6 @@ kubectl label nodes -l agentpool=nodepool1 nic-capacity=low-nic --overwrite kubectl label nodes -l agentpool=nplinux nic-capacity=high-nic --overwrite ``` -**Example Node Labels**: -```yaml -# Low-NIC node (nodepool1) -labels: - agentpool: nodepool1 - workload-type: swiftv2-linux - nic-capacity: low-nic - -# High-NIC node (nplinux) -labels: - agentpool: nplinux - workload-type: swiftv2-linux - nic-capacity: high-nic -``` - ### Node Selection in Tests Tests use these labels to select appropriate nodes dynamically: @@ -588,20 +399,6 @@ Tests use these labels to select appropriate nodes dynamically: **Note**: VM SKUs are hardcoded as constants in the pipeline template and cannot be changed by users. -## Schedule Modification - -To change test frequency, edit the cron schedule: - -```yaml -schedules: - - cron: "0 */1 * * *" # Every 1 hour (current) - # Examples: - # - cron: "0 */2 * * *" # Every 2 hours - # - cron: "0 */6 * * *" # Every 6 hours - # - cron: "0 0,8,16 * * *" # At 12am, 8am, 4pm - # - cron: "0 0 * * *" # Daily at midnight -``` - ## File Structure ``` @@ -639,23 +436,3 @@ test/integration/swiftv2/longRunningCluster/ - Storage accounts 5. **Avoid resource group collisions**: Always use unique `resourceGroupName` when creating new setups 6. **Document changes**: Update this README when modifying test scenarios or infrastructure - -## Resource Tags - -All infrastructure resources are automatically tagged during creation: - -```bash -SkipAutoDeleteTill=2032-12-31 -``` - -This prevents automatic cleanup by Azure subscription policies that delete resources after a certain period. The tag is applied to: -- Resource group (via create_resource_group job) -- AKS clusters (aks-1, aks-2) -- AKS cluster VNets -- Customer VNets (cx_vnet_a1, cx_vnet_a2, cx_vnet_a3, cx_vnet_b1) -- Storage accounts (sa1xxxx, sa2xxxx) - -To manually update the tag date: -```bash -az resource update --ids --set tags.SkipAutoDeleteTill=2033-12-31 -``` From 33954156525ae67742b555a17c521125a6b87ac4 Mon Sep 17 00:00:00 2001 From: sivakami Date: Mon, 24 Nov 2025 09:37:55 -0800 Subject: [PATCH 3/3] fix syntax for pe test. --- .../longRunningCluster/datapath_private_endpoint_test.go | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/test/integration/swiftv2/longRunningCluster/datapath_private_endpoint_test.go b/test/integration/swiftv2/longRunningCluster/datapath_private_endpoint_test.go index dc77302db1..0d94087a50 100644 --- a/test/integration/swiftv2/longRunningCluster/datapath_private_endpoint_test.go +++ b/test/integration/swiftv2/longRunningCluster/datapath_private_endpoint_test.go @@ -99,7 +99,7 @@ var _ = ginkgo.Describe("Private Endpoint Tests", func() { ShouldFail: false, TestType: "storage-access", Purpose: "Verify Tenant A pod from different cluster can access Storage-A", - } + }, } ginkgo.By(fmt.Sprintf("Running %d Private Endpoint connectivity tests", len(privateEndpointTests)))