|
| 1 | +# SwiftV2 Long-Running Pipeline |
| 2 | + |
| 3 | +This pipeline tests SwiftV2 pod networking in a persistent environment with scheduled test runs. |
| 4 | + |
| 5 | +## Architecture Overview |
| 6 | + |
| 7 | +**Infrastructure (Persistent)**: |
| 8 | +- **2 AKS Clusters**: aks-1, aks-2 (4 nodes each: 2 low-NIC default pool, 2 high-NIC nplinux pool) |
| 9 | +- **4 VNets**: cx_vnet_a1, cx_vnet_a2, cx_vnet_a3 (Customer 1 with PE to storage), cx_vnet_b1 (Customer 2) |
| 10 | +- **VNet Peerings**: two of the three vnets of customer 1 are peered. |
| 11 | +- **Storage Account**: With private endpoint from cx_vnet_a1 |
| 12 | +- **NSGs**: Restricting traffic between subnets (s1, s2) in vnet cx_vnet_a1. |
| 13 | + |
| 14 | +**Test Scenarios (8 total)**: |
| 15 | +- Multiple pods across 2 clusters, 4 VNets, different subnets (s1, s2), and node types (low-NIC, high-NIC) |
| 16 | +- Each test run: Create all resources → Wait 20 minutes → Delete all resources |
| 17 | +- Tests run automatically every 1 hour via scheduled trigger |
| 18 | + |
| 19 | +## Pipeline Modes |
| 20 | + |
| 21 | +### Mode 1: Scheduled Test Runs (Default) |
| 22 | +**Trigger**: Automated cron schedule every 1 hour |
| 23 | +**Purpose**: Continuous validation of long-running infrastructure |
| 24 | +**Setup Stages**: Disabled |
| 25 | +**Test Duration**: ~30-40 minutes per run |
| 26 | +**Resource Group**: Static (default: `sv2-long-run-<region>`, e.g., `sv2-long-run-centraluseuap`) |
| 27 | + |
| 28 | +```yaml |
| 29 | +# Runs automatically every 1 hour |
| 30 | +# No manual/external triggers allowed |
| 31 | +``` |
| 32 | + |
| 33 | +### Mode 2: Initial Setup or Rebuild |
| 34 | +**Trigger**: Manual run with parameter change |
| 35 | +**Purpose**: Create new infrastructure or rebuild existing |
| 36 | +**Setup Stages**: Enabled via `runSetupStages: true` |
| 37 | +**Resource Group**: Auto-generated or custom |
| 38 | + |
| 39 | +**To create new infrastructure**: |
| 40 | +1. Go to Pipeline → Run pipeline |
| 41 | +2. Set `runSetupStages` = `true` |
| 42 | +3. **Optional**: Leave `resourceGroupName` empty to auto-generate `sv2-long-run-<location>` |
| 43 | + - Or provide custom name for parallel setups (e.g., `sv2-long-run-eastus-dev`) |
| 44 | +4. Optionally adjust `location`, `vmSkuDefault`, `vmSkuHighNIC` |
| 45 | +5. Run pipeline |
| 46 | + |
| 47 | +## Pipeline Parameters |
| 48 | + |
| 49 | +Parameters are organized by usage: |
| 50 | + |
| 51 | +### Common Parameters (Always Relevant) |
| 52 | +| Parameter | Default | Description | |
| 53 | +|-----------|---------|-------------| |
| 54 | +| `location` | `centraluseuap` | Azure region for resources. Auto-generates RG name: `sv2-long-run-<location>`. | |
| 55 | +| `runSetupStages` | `false` | Set to `true` to create new infrastructure. `false` for scheduled test runs. | |
| 56 | +| `subscriptionId` | `37deca37-...` | Azure subscription ID. | |
| 57 | +| `serviceConnection` | `Azure Container Networking...` | Azure DevOps service connection. | |
| 58 | + |
| 59 | +### Setup-Only Parameters (Only Used When runSetupStages=true) |
| 60 | + |
| 61 | +| Parameter | Default | Description | |
| 62 | +|-----------|---------|-------------| |
| 63 | +| `resourceGroupName` | `""` (empty) | **Leave empty** to auto-generate `sv2-long-run-<location>`. Provide custom name only for parallel setups (e.g., `sv2-long-run-eastus-dev`). | |
| 64 | +| `vmSkuDefault` | `Standard_D4s_v3` | VM SKU for low-NIC node pool (1 NIC). | |
| 65 | +| `vmSkuHighNIC` | `Standard_D16s_v3` | VM SKU for high-NIC node pool (7 NICs). | |
| 66 | + |
| 67 | +**Note**: Setup-only parameters are ignored when `runSetupStages=false` (scheduled runs). |
| 68 | + |
| 69 | +## How It Works |
| 70 | + |
| 71 | +### Scheduled Test Flow |
| 72 | +Every 1 hour, the pipeline: |
| 73 | +1. Skips setup stages (infrastructure already exists) |
| 74 | +2. **Job 1 - Create and Wait**: Creates 8 test scenarios (PodNetwork, PNI, Pods), then waits 20 minutes |
| 75 | +3. **Job 2 - Delete Resources**: Deletes all test resources (Phase 1: Pods, Phase 2: PNI/PN/Namespaces) |
| 76 | +4. Reports results |
| 77 | + |
| 78 | +### Setup Flow (When runSetupStages = true) |
| 79 | +1. Create resource group with `SkipAutoDeleteTill=2032-12-31` tag |
| 80 | +2. Create 2 AKS clusters with 2 node pools each (tagged for persistence) |
| 81 | +3. Create 4 customer VNets with subnets and delegations (tagged for persistence) |
| 82 | +4. Create VNet peerings |
| 83 | +5. Create storage accounts with persistence tags |
| 84 | +6. Create NSGs for subnet isolation |
| 85 | +7. Run initial test (create → wait → delete) |
| 86 | + |
| 87 | +**All infrastructure resources are tagged with `SkipAutoDeleteTill=2032-12-31`** to prevent automatic cleanup by Azure subscription policies. |
| 88 | + |
| 89 | +## Resource Naming |
| 90 | + |
| 91 | +All test resources use the pattern: `<type>-static-setup-<vnet>-<subnet>` |
| 92 | + |
| 93 | +**Examples**: |
| 94 | +- PodNetwork: `pn-static-setup-a1-s1` |
| 95 | +- PodNetworkInstance: `pni-static-setup-a1-s1` |
| 96 | +- Pod: `pod-c1-aks1-a1s1-low` |
| 97 | +- Namespace: `pn-static-setup-a1-s1` |
| 98 | + |
| 99 | +VNet names are simplified: |
| 100 | +- `cx_vnet_a1` → `a1` |
| 101 | +- `cx_vnet_b1` → `b1` |
| 102 | + |
| 103 | +## Switching to a New Setup |
| 104 | + |
| 105 | +**Scenario**: You created a new setup in RG `sv2-long-run-eastus` and want scheduled runs to use it. |
| 106 | + |
| 107 | +**Steps**: |
| 108 | +1. Go to Pipeline → Edit |
| 109 | +2. Update location parameter default value: |
| 110 | + ```yaml |
| 111 | + - name: location |
| 112 | + default: "centraluseuap" # Change this |
| 113 | + ``` |
| 114 | +3. Save and commit |
| 115 | +4. RG name will automatically become `sv2-long-run-centraluseuap` |
| 116 | + |
| 117 | +Alternatively, manually trigger with the new location or override `resourceGroupName` directly. |
| 118 | + |
| 119 | +## Creating Multiple Test Setups |
| 120 | + |
| 121 | +**Use Case**: You want to create a new test environment without affecting the existing one (e.g., for testing different configurations, regions, or versions). |
| 122 | + |
| 123 | +**Steps**: |
| 124 | +1. Go to Pipeline → Run pipeline |
| 125 | +2. Set `runSetupStages` = `true` |
| 126 | +3. **Set `resourceGroupName`** to a unique value: |
| 127 | + - For different region: `sv2-long-run-eastus` |
| 128 | + - For parallel test: `sv2-long-run-centraluseuap-dev` |
| 129 | + - For experimental: `sv2-long-run-centraluseuap-v2` |
| 130 | + - Or leave empty to use auto-generated `sv2-long-run-<location>` |
| 131 | +4. Optionally adjust `location`, `vmSkuDefault`, `vmSkuHighNIC` |
| 132 | +5. Run pipeline |
| 133 | + |
| 134 | +**After setup completes**: |
| 135 | +- The new infrastructure will be tagged with `SkipAutoDeleteTill=2032-12-31` |
| 136 | +- Resources are isolated by the unique resource group name |
| 137 | +- To run tests against the new setup, the scheduled pipeline would need to be updated with the new RG name |
| 138 | + |
| 139 | +**Example Scenarios**: |
| 140 | +| Scenario | Resource Group Name | Purpose | |
| 141 | +|----------|-------------------|---------| |
| 142 | +| Default production | `sv2-long-run-centraluseuap` | Daily scheduled tests | |
| 143 | +| East US environment | `sv2-long-run-eastus` | Regional testing | |
| 144 | +| Test new features | `sv2-long-run-centraluseuap-dev` | Development/testing | |
| 145 | +| Version upgrade | `sv2-long-run-centraluseuap-v2` | Parallel environment for upgrades | |
| 146 | + |
| 147 | +## Resource Naming |
| 148 | + |
| 149 | +The pipeline uses the **resource group name as the BUILD_ID** to ensure unique resource names per test setup. This allows multiple parallel test environments without naming collisions. |
| 150 | + |
| 151 | +**Generated Resource Names**: |
| 152 | +``` |
| 153 | +BUILD_ID = <resourceGroupName> |
| 154 | + |
| 155 | +PodNetwork: pn-<BUILD_ID>-<vnet>-<subnet> |
| 156 | +PodNetworkInstance: pni-<BUILD_ID>-<vnet>-<subnet> |
| 157 | +Namespace: pn-<BUILD_ID>-<vnet>-<subnet> |
| 158 | +Pod: pod-<scenario-suffix> |
| 159 | +``` |
| 160 | +
|
| 161 | +**Example for `resourceGroupName=sv2-long-run-centraluseuap`**: |
| 162 | +``` |
| 163 | +pn-sv2-long-run-centraluseuap-b1-s1 (PodNetwork for cx_vnet_b1, subnet s1) |
| 164 | +pni-sv2-long-run-centraluseuap-b1-s1 (PodNetworkInstance) |
| 165 | +pn-sv2-long-run-centraluseuap-a1-s1 (PodNetwork for cx_vnet_a1, subnet s1) |
| 166 | +pni-sv2-long-run-centraluseuap-a1-s2 (PodNetworkInstance for cx_vnet_a1, subnet s2) |
| 167 | +``` |
| 168 | +
|
| 169 | +**Example for different setup `resourceGroupName=sv2-long-run-eastus`**: |
| 170 | +``` |
| 171 | +pn-sv2-long-run-eastus-b1-s1 (Different from centraluseuap setup) |
| 172 | +pni-sv2-long-run-eastus-b1-s1 |
| 173 | +pn-sv2-long-run-eastus-a1-s1 |
| 174 | +``` |
| 175 | +
|
| 176 | +This ensures **no collision** between different test setups running in parallel. |
| 177 | +
|
| 178 | +## Deletion Strategy |
| 179 | +### Phase 1: Delete All Pods |
| 180 | +Deletes all pods across all scenarios first. This ensures IP reservations are released. |
| 181 | +
|
| 182 | +``` |
| 183 | +Deleting pod pod-c2-aks2-b1s1-low... |
| 184 | +Deleting pod pod-c2-aks2-b1s1-high... |
| 185 | +... |
| 186 | +``` |
| 187 | +
|
| 188 | +### Phase 2: Delete Shared Resources |
| 189 | +Groups resources by vnet/subnet/cluster and deletes PNI/PN/Namespace once per group. |
| 190 | +
|
| 191 | +``` |
| 192 | +Deleting PodNetworkInstance pni-static-setup-b1-s1... |
| 193 | +Deleting PodNetwork pn-static-setup-b1-s1... |
| 194 | +Deleting namespace pn-static-setup-b1-s1... |
| 195 | +``` |
| 196 | +
|
| 197 | +**Why**: Multiple pods can share the same PNI. Deleting PNI while pods exist causes "ReservationInUse" errors. |
| 198 | +
|
| 199 | +## Troubleshooting |
| 200 | +
|
| 201 | +### Tests are running on wrong cluster |
| 202 | +- Check `resourceGroupName` parameter points to correct RG |
| 203 | +- Verify RG contains aks-1 and aks-2 clusters |
| 204 | +- Check kubeconfig retrieval in logs |
| 205 | +
|
| 206 | +### Setup stages not running |
| 207 | +- Verify `runSetupStages` parameter is set to `true` |
| 208 | +- Check condition: `condition: eq(${{ parameters.runSetupStages }}, true)` |
| 209 | +
|
| 210 | +### Schedule not triggering |
| 211 | +- Verify cron expression: `"0 */1 * * *"` (every 1 hour) |
| 212 | +- Check branch in schedule matches your working branch |
| 213 | +- Ensure `always: true` is set (runs even without code changes) |
| 214 | +
|
| 215 | +### PNI stuck with "ReservationInUse" |
| 216 | +- Check if pods were deleted first (Phase 1 logs) |
| 217 | +- Manual fix: Delete pod → Wait 10s → Patch PNI to remove finalizers |
| 218 | +
|
| 219 | +### Pipeline timeout after 6 hours |
| 220 | +- This is expected behavior (timeoutInMinutes: 360) |
| 221 | +- Tests should complete in ~30-40 minutes |
| 222 | +- If tests hang, check deletion logs for stuck resources |
| 223 | +
|
| 224 | +## Manual Testing |
| 225 | +
|
| 226 | +Run locally against existing infrastructure: |
| 227 | +
|
| 228 | +```bash |
| 229 | +export RG="sv2-long-run-centraluseuap" # Match your resource group |
| 230 | +export BUILD_ID="$RG" # Use same RG name as BUILD_ID for unique resource names |
| 231 | +
|
| 232 | +cd test/integration/swiftv2/longRunningCluster |
| 233 | +ginkgo -v -trace --timeout=6h . |
| 234 | +``` |
| 235 | + |
| 236 | +## Node Pool Configuration |
| 237 | + |
| 238 | +- **Low-NIC nodes** (`Standard_D4s_v3`): 1 NIC, label `agentpool!=nplinux` |
| 239 | + - Can only run 1 pod at a time |
| 240 | + |
| 241 | +- **High-NIC nodes** (`Standard_D16s_v3`): 7 NICs, label `agentpool=nplinux` |
| 242 | + - Currently limited to 1 pod per node in test logic |
| 243 | + |
| 244 | +## Schedule Modification |
| 245 | + |
| 246 | +To change test frequency, edit the cron schedule: |
| 247 | + |
| 248 | +```yaml |
| 249 | +schedules: |
| 250 | + - cron: "0 */1 * * *" # Every 1 hour (current) |
| 251 | + # Examples: |
| 252 | + # - cron: "0 */2 * * *" # Every 2 hours |
| 253 | + # - cron: "0 */6 * * *" # Every 6 hours |
| 254 | + # - cron: "0 0,8,16 * * *" # At 12am, 8am, 4pm |
| 255 | + # - cron: "0 0 * * *" # Daily at midnight |
| 256 | +``` |
| 257 | + |
| 258 | +## File Structure |
| 259 | + |
| 260 | +``` |
| 261 | +.pipelines/swiftv2-long-running/ |
| 262 | +├── pipeline.yaml # Main pipeline with schedule |
| 263 | +├── README.md # This file |
| 264 | +├── template/ |
| 265 | +│ └── long-running-pipeline-template.yaml # Stage definitions (2 jobs) |
| 266 | +└── scripts/ |
| 267 | + ├── create_aks.sh # AKS cluster creation |
| 268 | + ├── create_vnets.sh # VNet and subnet creation |
| 269 | + ├── create_peerings.sh # VNet peering setup |
| 270 | + ├── create_storage.sh # Storage account creation |
| 271 | + ├── create_nsg.sh # Network security groups |
| 272 | + └── create_pe.sh # Private endpoint setup |
| 273 | +
|
| 274 | +test/integration/swiftv2/longRunningCluster/ |
| 275 | +├── datapath_test.go # Original combined test (deprecated) |
| 276 | +├── datapath_create_test.go # Create test scenarios (Job 1) |
| 277 | +├── datapath_delete_test.go # Delete test scenarios (Job 2) |
| 278 | +├── datapath.go # Resource orchestration |
| 279 | +└── helpers/ |
| 280 | + └── az_helpers.go # Azure/kubectl helper functions |
| 281 | +``` |
| 282 | + |
| 283 | +## Best Practices |
| 284 | + |
| 285 | +1. **Keep infrastructure persistent**: Only recreate when necessary (cluster upgrades, config changes) |
| 286 | +2. **Monitor scheduled runs**: Set up alerts for test failures |
| 287 | +3. **Resource naming**: BUILD_ID is automatically set to the resource group name, ensuring unique resource names per setup |
| 288 | +4. **Tag resources appropriately**: All setup resources automatically tagged with `SkipAutoDeleteTill=2032-12-31` |
| 289 | + - AKS clusters |
| 290 | + - AKS VNets |
| 291 | + - Customer VNets (cx_vnet_a1, cx_vnet_a2, cx_vnet_a3, cx_vnet_b1) |
| 292 | + - Storage accounts |
| 293 | +5. **Avoid resource group collisions**: Always use unique `resourceGroupName` when creating new setups |
| 294 | +6. **Document changes**: Update this README when modifying test scenarios or infrastructure |
| 295 | + |
| 296 | +## Resource Tags |
| 297 | + |
| 298 | +All infrastructure resources are automatically tagged during creation: |
| 299 | + |
| 300 | +```bash |
| 301 | +SkipAutoDeleteTill=2032-12-31 |
| 302 | +``` |
| 303 | + |
| 304 | +This prevents automatic cleanup by Azure subscription policies that delete resources after a certain period. The tag is applied to: |
| 305 | +- Resource group (via create_resource_group job) |
| 306 | +- AKS clusters (aks-1, aks-2) |
| 307 | +- AKS cluster VNets |
| 308 | +- Customer VNets (cx_vnet_a1, cx_vnet_a2, cx_vnet_a3, cx_vnet_b1) |
| 309 | +- Storage accounts (sa1xxxx, sa2xxxx) |
| 310 | + |
| 311 | +To manually update the tag date: |
| 312 | +```bash |
| 313 | +az resource update --ids <resource-id> --set tags.SkipAutoDeleteTill=2033-12-31 |
| 314 | +``` |
0 commit comments