|
| 1 | +# SwiftV2 Long-Running Pipeline |
| 2 | + |
| 3 | +This pipeline tests SwiftV2 pod networking in a persistent environment with scheduled test runs. |
| 4 | + |
| 5 | +## Architecture Overview |
| 6 | + |
| 7 | +**Infrastructure (Persistent)**: |
| 8 | +- **2 AKS Clusters**: aks-1, aks-2 (4 nodes each: 2 low-NIC default pool, 2 high-NIC nplinux pool) |
| 9 | +- **4 VNets**: cx_vnet_a1, cx_vnet_a2, cx_vnet_a3 (Customer 1 with PE to storage), cx_vnet_b1 (Customer 2) |
| 10 | +- **VNet Peerings**: two of the three vnets of customer 1 are peered. |
| 11 | +- **Storage Account**: With private endpoint from cx_vnet_a1 |
| 12 | +- **NSGs**: Restricting traffic between subnets (s1, s2) in vnet cx_vnet_a1. |
| 13 | + |
| 14 | +**Test Scenarios (8 total)**: |
| 15 | +- Multiple pods across 2 clusters, 4 VNets, different subnets (s1, s2), and node types (low-NIC, high-NIC) |
| 16 | +- Each test run: Create all resources → Wait 20 minutes → Delete all resources |
| 17 | +- Tests run automatically every 1 hour via scheduled trigger |
| 18 | + |
| 19 | +## Pipeline Modes |
| 20 | + |
| 21 | +### Mode 1: Scheduled Test Runs (Default) |
| 22 | +**Trigger**: Automated cron schedule every 1 hour |
| 23 | +**Purpose**: Continuous validation of long-running infrastructure |
| 24 | +**Setup Stages**: Disabled |
| 25 | +**Test Duration**: ~30-40 minutes per run |
| 26 | +**Resource Group**: Static (default: `sv2-long-run-<region>`, e.g., `sv2-long-run-centraluseuap`) |
| 27 | + |
| 28 | +```yaml |
| 29 | +# Runs automatically every 1 hour |
| 30 | +# No manual/external triggers allowed |
| 31 | +``` |
| 32 | + |
| 33 | +### Mode 2: Initial Setup or Rebuild |
| 34 | +**Trigger**: Manual run with parameter change |
| 35 | +**Purpose**: Create new infrastructure or rebuild existing |
| 36 | +**Setup Stages**: Enabled via `runSetupStages: true` |
| 37 | +**Resource Group**: Configurable via parameter |
| 38 | + |
| 39 | +**To create new infrastructure**: |
| 40 | +1. Go to Pipeline → Run pipeline |
| 41 | +2. **IMPORTANT**: Change `resourceGroupName` to a unique value (e.g., `sv2-long-run-eastus-test2`) |
| 42 | + - Default uses location: `sv2-long-run-<location>` |
| 43 | + - To avoid collisions, always use a unique name for new setups |
| 44 | +3. Set `runSetupStages` = `true` |
| 45 | +4. Optionally change `location` if deploying to different region |
| 46 | +5. Run pipeline |
| 47 | + |
| 48 | +**⚠️ Warning**: If you don't change the resource group name when creating a new setup, it will overwrite/conflict with the existing default setup used by scheduled runs! |
| 49 | + |
| 50 | +## Pipeline Parameters |
| 51 | + |
| 52 | +| Parameter | Default | Description | |
| 53 | +|-----------|---------|-------------| |
| 54 | +| `subscriptionId` | `37deca37-c375-4a14-b90a-043849bd2bf1` | Azure subscription for deployment. | |
| 55 | +| `location` | `centraluseuap` | Azure region for resources. | |
| 56 | +| `resourceGroupName` | `sv2-long-run-<location>` | Static RG name for tests. Dynamically includes region (e.g., `sv2-long-run-centraluseuap`). **MUST be changed to unique value when creating new setup!** | |
| 57 | +| `runSetupStages` | `false` | Set to `true` to create/recreate AKS clusters and networking. **WARNING: Always set unique `resourceGroupName` when true!** | |
| 58 | +| `vmSkuDefault` | `Standard_D4s_v3` | VM SKU for low-NIC node pool (1 NIC). | |
| 59 | +| `vmSkuHighNIC` | `Standard_D16s_v3` | VM SKU for high-NIC node pool (7 NICs). | |
| 60 | +| `serviceConnection` | `Azure Container Networking - Standalone Test Service Connection` | Azure DevOps service connection. | |
| 61 | + |
| 62 | +## How It Works |
| 63 | + |
| 64 | +### Scheduled Test Flow |
| 65 | +Every 1 hour, the pipeline: |
| 66 | +1. Skips setup stages (infrastructure already exists) |
| 67 | +2. **Job 1 - Create and Wait**: Creates 8 test scenarios (PodNetwork, PNI, Pods), then waits 20 minutes |
| 68 | +3. **Job 2 - Delete Resources**: Deletes all test resources (Phase 1: Pods, Phase 2: PNI/PN/Namespaces) |
| 69 | +4. Reports results |
| 70 | + |
| 71 | +### Setup Flow (When runSetupStages = true) |
| 72 | +1. Create resource group with `SkipAutoDeleteTill=2032-12-31` tag |
| 73 | +2. Create 2 AKS clusters with 2 node pools each (tagged for persistence) |
| 74 | +3. Create 4 customer VNets with subnets and delegations (tagged for persistence) |
| 75 | +4. Create VNet peerings |
| 76 | +5. Create storage accounts with persistence tags |
| 77 | +6. Create NSGs for subnet isolation |
| 78 | +7. Run initial test (create → wait → delete) |
| 79 | + |
| 80 | +**All infrastructure resources are tagged with `SkipAutoDeleteTill=2032-12-31`** to prevent automatic cleanup by Azure subscription policies. |
| 81 | + |
| 82 | +## Resource Naming |
| 83 | + |
| 84 | +All test resources use the pattern: `<type>-static-setup-<vnet>-<subnet>` |
| 85 | + |
| 86 | +**Examples**: |
| 87 | +- PodNetwork: `pn-static-setup-a1-s1` |
| 88 | +- PodNetworkInstance: `pni-static-setup-a1-s1` |
| 89 | +- Pod: `pod-c1-aks1-a1s1-low` |
| 90 | +- Namespace: `pn-static-setup-a1-s1` |
| 91 | + |
| 92 | +VNet names are simplified: |
| 93 | +- `cx_vnet_a1` → `a1` |
| 94 | +- `cx_vnet_b1` → `b1` |
| 95 | + |
| 96 | +## Switching to a New Setup |
| 97 | + |
| 98 | +**Scenario**: You created a new setup in RG `sv2-long-run-eastus` and want scheduled runs to use it. |
| 99 | + |
| 100 | +**Steps**: |
| 101 | +1. Go to Pipeline → Edit |
| 102 | +2. Update location parameter default value: |
| 103 | + ```yaml |
| 104 | + - name: location |
| 105 | + default: "centraluseuap" # Change this |
| 106 | + ``` |
| 107 | +3. Save and commit |
| 108 | +4. RG name will automatically become `sv2-long-run-centraluseuap` |
| 109 | + |
| 110 | +Alternatively, manually trigger with the new location or override `resourceGroupName` directly. |
| 111 | + |
| 112 | +## Creating Multiple Test Setups |
| 113 | + |
| 114 | +**Use Case**: You want to create a new test environment without affecting the existing one (e.g., for testing different configurations, regions, or versions). |
| 115 | + |
| 116 | +**Steps**: |
| 117 | +1. Go to Pipeline → Run pipeline |
| 118 | +2. **Change `resourceGroupName`** to a unique value: |
| 119 | + - For different region: `sv2-long-run-eastus` |
| 120 | + - For parallel test: `sv2-long-run-centraluseuap-v2` |
| 121 | + - For experimental: `sv2-long-run-centraluseuap-experimental` |
| 122 | +3. Set `runSetupStages` = `true` |
| 123 | +4. Optionally change `location` parameter |
| 124 | +5. Run pipeline |
| 125 | + |
| 126 | +**After setup completes**: |
| 127 | +- The new infrastructure will be tagged with `SkipAutoDeleteTill=2032-12-31` |
| 128 | +- To run tests against this new setup, either: |
| 129 | + - **Option A**: Update the pipeline default `resourceGroupName` parameter |
| 130 | + - **Option B**: Manually trigger test runs with the new `resourceGroupName` |
| 131 | + |
| 132 | +**Example Scenarios**: |
| 133 | + |
| 134 | +| Scenario | Resource Group Name | Purpose | |
| 135 | +|----------|-------------------|---------| |
| 136 | +| Default production | `sv2-long-run-centraluseuap` | Daily scheduled tests | |
| 137 | +| East US environment | `sv2-long-run-eastus` | Regional testing | |
| 138 | +| Test new features | `sv2-long-run-centraluseuap-dev` | Development/testing | |
| 139 | +| Version upgrade | `sv2-long-run-centraluseuap-v2` | Parallel environment for upgrades | |
| 140 | + |
| 141 | +## Resource Naming |
| 142 | + |
| 143 | +The pipeline uses the **resource group name as the BUILD_ID** to ensure unique resource names per test setup. This allows multiple parallel test environments without naming collisions. |
| 144 | + |
| 145 | +**Generated Resource Names**: |
| 146 | +``` |
| 147 | +BUILD_ID = <resourceGroupName> |
| 148 | + |
| 149 | +PodNetwork: pn-<BUILD_ID>-<vnet>-<subnet> |
| 150 | +PodNetworkInstance: pni-<BUILD_ID>-<vnet>-<subnet> |
| 151 | +Namespace: pn-<BUILD_ID>-<vnet>-<subnet> |
| 152 | +Pod: pod-<scenario-suffix> |
| 153 | +``` |
| 154 | +
|
| 155 | +**Example for `resourceGroupName=sv2-long-run-centraluseuap`**: |
| 156 | +``` |
| 157 | +pn-sv2-long-run-centraluseuap-b1-s1 (PodNetwork for cx_vnet_b1, subnet s1) |
| 158 | +pni-sv2-long-run-centraluseuap-b1-s1 (PodNetworkInstance) |
| 159 | +pn-sv2-long-run-centraluseuap-a1-s1 (PodNetwork for cx_vnet_a1, subnet s1) |
| 160 | +pni-sv2-long-run-centraluseuap-a1-s2 (PodNetworkInstance for cx_vnet_a1, subnet s2) |
| 161 | +``` |
| 162 | +
|
| 163 | +**Example for different setup `resourceGroupName=sv2-long-run-eastus`**: |
| 164 | +``` |
| 165 | +pn-sv2-long-run-eastus-b1-s1 (Different from centraluseuap setup) |
| 166 | +pni-sv2-long-run-eastus-b1-s1 |
| 167 | +pn-sv2-long-run-eastus-a1-s1 |
| 168 | +``` |
| 169 | +
|
| 170 | +This ensures **no collision** between different test setups running in parallel. |
| 171 | +
|
| 172 | +## Deletion Strategy |
| 173 | +### Phase 1: Delete All Pods |
| 174 | +Deletes all pods across all scenarios first. This ensures IP reservations are released. |
| 175 | +
|
| 176 | +``` |
| 177 | +Deleting pod pod-c2-aks2-b1s1-low... |
| 178 | +Deleting pod pod-c2-aks2-b1s1-high... |
| 179 | +... |
| 180 | +``` |
| 181 | +
|
| 182 | +### Phase 2: Delete Shared Resources |
| 183 | +Groups resources by vnet/subnet/cluster and deletes PNI/PN/Namespace once per group. |
| 184 | +
|
| 185 | +``` |
| 186 | +Deleting PodNetworkInstance pni-static-setup-b1-s1... |
| 187 | +Deleting PodNetwork pn-static-setup-b1-s1... |
| 188 | +Deleting namespace pn-static-setup-b1-s1... |
| 189 | +``` |
| 190 | +
|
| 191 | +**Why**: Multiple pods can share the same PNI. Deleting PNI while pods exist causes "ReservationInUse" errors. |
| 192 | +
|
| 193 | +## Troubleshooting |
| 194 | +
|
| 195 | +### Tests are running on wrong cluster |
| 196 | +- Check `resourceGroupName` parameter points to correct RG |
| 197 | +- Verify RG contains aks-1 and aks-2 clusters |
| 198 | +- Check kubeconfig retrieval in logs |
| 199 | +
|
| 200 | +### Setup stages not running |
| 201 | +- Verify `runSetupStages` parameter is set to `true` |
| 202 | +- Check condition: `condition: eq(${{ parameters.runSetupStages }}, true)` |
| 203 | +
|
| 204 | +### Schedule not triggering |
| 205 | +- Verify cron expression: `"0 */1 * * *"` (every 1 hour) |
| 206 | +- Check branch in schedule matches your working branch |
| 207 | +- Ensure `always: true` is set (runs even without code changes) |
| 208 | +
|
| 209 | +### PNI stuck with "ReservationInUse" |
| 210 | +- Check if pods were deleted first (Phase 1 logs) |
| 211 | +- Manual fix: Delete pod → Wait 10s → Patch PNI to remove finalizers |
| 212 | +
|
| 213 | +### Pipeline timeout after 6 hours |
| 214 | +- This is expected behavior (timeoutInMinutes: 360) |
| 215 | +- Tests should complete in ~30-40 minutes |
| 216 | +- If tests hang, check deletion logs for stuck resources |
| 217 | +
|
| 218 | +## Manual Testing |
| 219 | +
|
| 220 | +Run locally against existing infrastructure: |
| 221 | +
|
| 222 | +```bash |
| 223 | +export RG="sv2-long-run-centraluseuap" # Match your resource group |
| 224 | +export BUILD_ID="$RG" # Use same RG name as BUILD_ID for unique resource names |
| 225 | +
|
| 226 | +cd test/integration/swiftv2/longRunningCluster |
| 227 | +ginkgo -v -trace --timeout=6h . |
| 228 | +``` |
| 229 | + |
| 230 | +## Node Pool Configuration |
| 231 | + |
| 232 | +- **Low-NIC nodes** (`Standard_D4s_v3`): 1 NIC, label `agentpool!=nplinux` |
| 233 | + - Can only run 1 pod at a time |
| 234 | + |
| 235 | +- **High-NIC nodes** (`Standard_D16s_v3`): 7 NICs, label `agentpool=nplinux` |
| 236 | + - Currently limited to 1 pod per node in test logic |
| 237 | + |
| 238 | +## Schedule Modification |
| 239 | + |
| 240 | +To change test frequency, edit the cron schedule: |
| 241 | + |
| 242 | +```yaml |
| 243 | +schedules: |
| 244 | + - cron: "0 */1 * * *" # Every 1 hour (current) |
| 245 | + # Examples: |
| 246 | + # - cron: "0 */2 * * *" # Every 2 hours |
| 247 | + # - cron: "0 */6 * * *" # Every 6 hours |
| 248 | + # - cron: "0 0,8,16 * * *" # At 12am, 8am, 4pm |
| 249 | + # - cron: "0 0 * * *" # Daily at midnight |
| 250 | +``` |
| 251 | + |
| 252 | +## File Structure |
| 253 | + |
| 254 | +``` |
| 255 | +.pipelines/swiftv2-long-running/ |
| 256 | +├── pipeline.yaml # Main pipeline with schedule |
| 257 | +├── README.md # This file |
| 258 | +├── template/ |
| 259 | +│ └── long-running-pipeline-template.yaml # Stage definitions (2 jobs) |
| 260 | +└── scripts/ |
| 261 | + ├── create_aks.sh # AKS cluster creation |
| 262 | + ├── create_vnets.sh # VNet and subnet creation |
| 263 | + ├── create_peerings.sh # VNet peering setup |
| 264 | + ├── create_storage.sh # Storage account creation |
| 265 | + ├── create_nsg.sh # Network security groups |
| 266 | + └── create_pe.sh # Private endpoint setup |
| 267 | +
|
| 268 | +test/integration/swiftv2/longRunningCluster/ |
| 269 | +├── datapath_test.go # Original combined test (deprecated) |
| 270 | +├── datapath_create_test.go # Create test scenarios (Job 1) |
| 271 | +├── datapath_delete_test.go # Delete test scenarios (Job 2) |
| 272 | +├── datapath.go # Resource orchestration |
| 273 | +└── helpers/ |
| 274 | + └── az_helpers.go # Azure/kubectl helper functions |
| 275 | +``` |
| 276 | + |
| 277 | +## Best Practices |
| 278 | + |
| 279 | +1. **Keep infrastructure persistent**: Only recreate when necessary (cluster upgrades, config changes) |
| 280 | +2. **Monitor scheduled runs**: Set up alerts for test failures |
| 281 | +3. **Resource naming**: BUILD_ID is automatically set to the resource group name, ensuring unique resource names per setup |
| 282 | +4. **Tag resources appropriately**: All setup resources automatically tagged with `SkipAutoDeleteTill=2032-12-31` |
| 283 | + - AKS clusters |
| 284 | + - AKS VNets |
| 285 | + - Customer VNets (cx_vnet_a1, cx_vnet_a2, cx_vnet_a3, cx_vnet_b1) |
| 286 | + - Storage accounts |
| 287 | +5. **Avoid resource group collisions**: Always use unique `resourceGroupName` when creating new setups |
| 288 | +6. **Document changes**: Update this README when modifying test scenarios or infrastructure |
| 289 | + |
| 290 | +## Resource Tags |
| 291 | + |
| 292 | +All infrastructure resources are automatically tagged during creation: |
| 293 | + |
| 294 | +```bash |
| 295 | +SkipAutoDeleteTill=2032-12-31 |
| 296 | +``` |
| 297 | + |
| 298 | +This prevents automatic cleanup by Azure subscription policies that delete resources after a certain period. The tag is applied to: |
| 299 | +- Resource group (via create_resource_group job) |
| 300 | +- AKS clusters (aks-1, aks-2) |
| 301 | +- AKS cluster VNets |
| 302 | +- Customer VNets (cx_vnet_a1, cx_vnet_a2, cx_vnet_a3, cx_vnet_b1) |
| 303 | +- Storage accounts (sa1xxxx, sa2xxxx) |
| 304 | + |
| 305 | +To manually update the tag date: |
| 306 | +```bash |
| 307 | +az resource update --ids <resource-id> --set tags.SkipAutoDeleteTill=2033-12-31 |
| 308 | +``` |
0 commit comments