Skip to content

Commit b07b697

Browse files
author
sivakami
committed
Add SwiftV2 long-running pipeline with scheduled tests
- Implemented scheduled pipeline running every 1 hour with persistent infrastructure - Split test execution into 2 jobs: Create (with 20min wait) and Delete - Added 8 test scenarios across 2 AKS clusters, 4 VNets, different subnets - Implemented two-phase deletion strategy to prevent PNI ReservationInUse errors - Added context timeouts on kubectl commands with force delete fallbacks - Resource naming uses RG name as BUILD_ID for uniqueness across parallel setups - Added SkipAutoDeleteTill tags to prevent automatic resource cleanup - Conditional setup stages controlled by runSetupStages parameter - Auto-generate RG name from location or allow custom names for parallel setups - Added comprehensive README with setup instructions and troubleshooting - Node selection by agentpool labels with usage tracking to prevent conflicts - Kubernetes naming compliance (RFC 1123) for all resources
1 parent b9f406a commit b07b697

File tree

17 files changed

+2020
-137
lines changed

17 files changed

+2020
-137
lines changed
Lines changed: 314 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,314 @@
1+
# SwiftV2 Long-Running Pipeline
2+
3+
This pipeline tests SwiftV2 pod networking in a persistent environment with scheduled test runs.
4+
5+
## Architecture Overview
6+
7+
**Infrastructure (Persistent)**:
8+
- **2 AKS Clusters**: aks-1, aks-2 (4 nodes each: 2 low-NIC default pool, 2 high-NIC nplinux pool)
9+
- **4 VNets**: cx_vnet_a1, cx_vnet_a2, cx_vnet_a3 (Customer 1 with PE to storage), cx_vnet_b1 (Customer 2)
10+
- **VNet Peerings**: two of the three vnets of customer 1 are peered.
11+
- **Storage Account**: With private endpoint from cx_vnet_a1
12+
- **NSGs**: Restricting traffic between subnets (s1, s2) in vnet cx_vnet_a1.
13+
14+
**Test Scenarios (8 total)**:
15+
- Multiple pods across 2 clusters, 4 VNets, different subnets (s1, s2), and node types (low-NIC, high-NIC)
16+
- Each test run: Create all resources → Wait 20 minutes → Delete all resources
17+
- Tests run automatically every 1 hour via scheduled trigger
18+
19+
## Pipeline Modes
20+
21+
### Mode 1: Scheduled Test Runs (Default)
22+
**Trigger**: Automated cron schedule every 1 hour
23+
**Purpose**: Continuous validation of long-running infrastructure
24+
**Setup Stages**: Disabled
25+
**Test Duration**: ~30-40 minutes per run
26+
**Resource Group**: Static (default: `sv2-long-run-<region>`, e.g., `sv2-long-run-centraluseuap`)
27+
28+
```yaml
29+
# Runs automatically every 1 hour
30+
# No manual/external triggers allowed
31+
```
32+
33+
### Mode 2: Initial Setup or Rebuild
34+
**Trigger**: Manual run with parameter change
35+
**Purpose**: Create new infrastructure or rebuild existing
36+
**Setup Stages**: Enabled via `runSetupStages: true`
37+
**Resource Group**: Auto-generated or custom
38+
39+
**To create new infrastructure**:
40+
1. Go to Pipeline → Run pipeline
41+
2. Set `runSetupStages` = `true`
42+
3. **Optional**: Leave `resourceGroupName` empty to auto-generate `sv2-long-run-<location>`
43+
- Or provide custom name for parallel setups (e.g., `sv2-long-run-eastus-dev`)
44+
4. Optionally adjust `location`, `vmSkuDefault`, `vmSkuHighNIC`
45+
5. Run pipeline
46+
47+
## Pipeline Parameters
48+
49+
Parameters are organized by usage:
50+
51+
### Common Parameters (Always Relevant)
52+
| Parameter | Default | Description |
53+
|-----------|---------|-------------|
54+
| `location` | `centraluseuap` | Azure region for resources. Auto-generates RG name: `sv2-long-run-<location>`. |
55+
| `runSetupStages` | `false` | Set to `true` to create new infrastructure. `false` for scheduled test runs. |
56+
| `subscriptionId` | `37deca37-...` | Azure subscription ID. |
57+
| `serviceConnection` | `Azure Container Networking...` | Azure DevOps service connection. |
58+
59+
### Setup-Only Parameters (Only Used When runSetupStages=true)
60+
61+
| Parameter | Default | Description |
62+
|-----------|---------|-------------|
63+
| `resourceGroupName` | `""` (empty) | **Leave empty** to auto-generate `sv2-long-run-<location>`. Provide custom name only for parallel setups (e.g., `sv2-long-run-eastus-dev`). |
64+
| `vmSkuDefault` | `Standard_D4s_v3` | VM SKU for low-NIC node pool (1 NIC). |
65+
| `vmSkuHighNIC` | `Standard_D16s_v3` | VM SKU for high-NIC node pool (7 NICs). |
66+
67+
**Note**: Setup-only parameters are ignored when `runSetupStages=false` (scheduled runs).
68+
69+
## How It Works
70+
71+
### Scheduled Test Flow
72+
Every 1 hour, the pipeline:
73+
1. Skips setup stages (infrastructure already exists)
74+
2. **Job 1 - Create and Wait**: Creates 8 test scenarios (PodNetwork, PNI, Pods), then waits 20 minutes
75+
3. **Job 2 - Delete Resources**: Deletes all test resources (Phase 1: Pods, Phase 2: PNI/PN/Namespaces)
76+
4. Reports results
77+
78+
### Setup Flow (When runSetupStages = true)
79+
1. Create resource group with `SkipAutoDeleteTill=2032-12-31` tag
80+
2. Create 2 AKS clusters with 2 node pools each (tagged for persistence)
81+
3. Create 4 customer VNets with subnets and delegations (tagged for persistence)
82+
4. Create VNet peerings
83+
5. Create storage accounts with persistence tags
84+
6. Create NSGs for subnet isolation
85+
7. Run initial test (create → wait → delete)
86+
87+
**All infrastructure resources are tagged with `SkipAutoDeleteTill=2032-12-31`** to prevent automatic cleanup by Azure subscription policies.
88+
89+
## Resource Naming
90+
91+
All test resources use the pattern: `<type>-static-setup-<vnet>-<subnet>`
92+
93+
**Examples**:
94+
- PodNetwork: `pn-static-setup-a1-s1`
95+
- PodNetworkInstance: `pni-static-setup-a1-s1`
96+
- Pod: `pod-c1-aks1-a1s1-low`
97+
- Namespace: `pn-static-setup-a1-s1`
98+
99+
VNet names are simplified:
100+
- `cx_vnet_a1``a1`
101+
- `cx_vnet_b1``b1`
102+
103+
## Switching to a New Setup
104+
105+
**Scenario**: You created a new setup in RG `sv2-long-run-eastus` and want scheduled runs to use it.
106+
107+
**Steps**:
108+
1. Go to Pipeline → Edit
109+
2. Update location parameter default value:
110+
```yaml
111+
- name: location
112+
default: "centraluseuap" # Change this
113+
```
114+
3. Save and commit
115+
4. RG name will automatically become `sv2-long-run-centraluseuap`
116+
117+
Alternatively, manually trigger with the new location or override `resourceGroupName` directly.
118+
119+
## Creating Multiple Test Setups
120+
121+
**Use Case**: You want to create a new test environment without affecting the existing one (e.g., for testing different configurations, regions, or versions).
122+
123+
**Steps**:
124+
1. Go to Pipeline → Run pipeline
125+
2. Set `runSetupStages` = `true`
126+
3. **Set `resourceGroupName`** to a unique value:
127+
- For different region: `sv2-long-run-eastus`
128+
- For parallel test: `sv2-long-run-centraluseuap-dev`
129+
- For experimental: `sv2-long-run-centraluseuap-v2`
130+
- Or leave empty to use auto-generated `sv2-long-run-<location>`
131+
4. Optionally adjust `location`, `vmSkuDefault`, `vmSkuHighNIC`
132+
5. Run pipeline
133+
134+
**After setup completes**:
135+
- The new infrastructure will be tagged with `SkipAutoDeleteTill=2032-12-31`
136+
- Resources are isolated by the unique resource group name
137+
- To run tests against the new setup, the scheduled pipeline would need to be updated with the new RG name
138+
139+
**Example Scenarios**:
140+
| Scenario | Resource Group Name | Purpose |
141+
|----------|-------------------|---------|
142+
| Default production | `sv2-long-run-centraluseuap` | Daily scheduled tests |
143+
| East US environment | `sv2-long-run-eastus` | Regional testing |
144+
| Test new features | `sv2-long-run-centraluseuap-dev` | Development/testing |
145+
| Version upgrade | `sv2-long-run-centraluseuap-v2` | Parallel environment for upgrades |
146+
147+
## Resource Naming
148+
149+
The pipeline uses the **resource group name as the BUILD_ID** to ensure unique resource names per test setup. This allows multiple parallel test environments without naming collisions.
150+
151+
**Generated Resource Names**:
152+
```
153+
BUILD_ID = <resourceGroupName>
154+
155+
PodNetwork: pn-<BUILD_ID>-<vnet>-<subnet>
156+
PodNetworkInstance: pni-<BUILD_ID>-<vnet>-<subnet>
157+
Namespace: pn-<BUILD_ID>-<vnet>-<subnet>
158+
Pod: pod-<scenario-suffix>
159+
```
160+
161+
**Example for `resourceGroupName=sv2-long-run-centraluseuap`**:
162+
```
163+
pn-sv2-long-run-centraluseuap-b1-s1 (PodNetwork for cx_vnet_b1, subnet s1)
164+
pni-sv2-long-run-centraluseuap-b1-s1 (PodNetworkInstance)
165+
pn-sv2-long-run-centraluseuap-a1-s1 (PodNetwork for cx_vnet_a1, subnet s1)
166+
pni-sv2-long-run-centraluseuap-a1-s2 (PodNetworkInstance for cx_vnet_a1, subnet s2)
167+
```
168+
169+
**Example for different setup `resourceGroupName=sv2-long-run-eastus`**:
170+
```
171+
pn-sv2-long-run-eastus-b1-s1 (Different from centraluseuap setup)
172+
pni-sv2-long-run-eastus-b1-s1
173+
pn-sv2-long-run-eastus-a1-s1
174+
```
175+
176+
This ensures **no collision** between different test setups running in parallel.
177+
178+
## Deletion Strategy
179+
### Phase 1: Delete All Pods
180+
Deletes all pods across all scenarios first. This ensures IP reservations are released.
181+
182+
```
183+
Deleting pod pod-c2-aks2-b1s1-low...
184+
Deleting pod pod-c2-aks2-b1s1-high...
185+
...
186+
```
187+
188+
### Phase 2: Delete Shared Resources
189+
Groups resources by vnet/subnet/cluster and deletes PNI/PN/Namespace once per group.
190+
191+
```
192+
Deleting PodNetworkInstance pni-static-setup-b1-s1...
193+
Deleting PodNetwork pn-static-setup-b1-s1...
194+
Deleting namespace pn-static-setup-b1-s1...
195+
```
196+
197+
**Why**: Multiple pods can share the same PNI. Deleting PNI while pods exist causes "ReservationInUse" errors.
198+
199+
## Troubleshooting
200+
201+
### Tests are running on wrong cluster
202+
- Check `resourceGroupName` parameter points to correct RG
203+
- Verify RG contains aks-1 and aks-2 clusters
204+
- Check kubeconfig retrieval in logs
205+
206+
### Setup stages not running
207+
- Verify `runSetupStages` parameter is set to `true`
208+
- Check condition: `condition: eq(${{ parameters.runSetupStages }}, true)`
209+
210+
### Schedule not triggering
211+
- Verify cron expression: `"0 */1 * * *"` (every 1 hour)
212+
- Check branch in schedule matches your working branch
213+
- Ensure `always: true` is set (runs even without code changes)
214+
215+
### PNI stuck with "ReservationInUse"
216+
- Check if pods were deleted first (Phase 1 logs)
217+
- Manual fix: Delete pod → Wait 10s → Patch PNI to remove finalizers
218+
219+
### Pipeline timeout after 6 hours
220+
- This is expected behavior (timeoutInMinutes: 360)
221+
- Tests should complete in ~30-40 minutes
222+
- If tests hang, check deletion logs for stuck resources
223+
224+
## Manual Testing
225+
226+
Run locally against existing infrastructure:
227+
228+
```bash
229+
export RG="sv2-long-run-centraluseuap" # Match your resource group
230+
export BUILD_ID="$RG" # Use same RG name as BUILD_ID for unique resource names
231+
232+
cd test/integration/swiftv2/longRunningCluster
233+
ginkgo -v -trace --timeout=6h .
234+
```
235+
236+
## Node Pool Configuration
237+
238+
- **Low-NIC nodes** (`Standard_D4s_v3`): 1 NIC, label `agentpool!=nplinux`
239+
- Can only run 1 pod at a time
240+
241+
- **High-NIC nodes** (`Standard_D16s_v3`): 7 NICs, label `agentpool=nplinux`
242+
- Currently limited to 1 pod per node in test logic
243+
244+
## Schedule Modification
245+
246+
To change test frequency, edit the cron schedule:
247+
248+
```yaml
249+
schedules:
250+
- cron: "0 */1 * * *" # Every 1 hour (current)
251+
# Examples:
252+
# - cron: "0 */2 * * *" # Every 2 hours
253+
# - cron: "0 */6 * * *" # Every 6 hours
254+
# - cron: "0 0,8,16 * * *" # At 12am, 8am, 4pm
255+
# - cron: "0 0 * * *" # Daily at midnight
256+
```
257+
258+
## File Structure
259+
260+
```
261+
.pipelines/swiftv2-long-running/
262+
├── pipeline.yaml # Main pipeline with schedule
263+
├── README.md # This file
264+
├── template/
265+
│ └── long-running-pipeline-template.yaml # Stage definitions (2 jobs)
266+
└── scripts/
267+
├── create_aks.sh # AKS cluster creation
268+
├── create_vnets.sh # VNet and subnet creation
269+
├── create_peerings.sh # VNet peering setup
270+
├── create_storage.sh # Storage account creation
271+
├── create_nsg.sh # Network security groups
272+
└── create_pe.sh # Private endpoint setup
273+
274+
test/integration/swiftv2/longRunningCluster/
275+
├── datapath_test.go # Original combined test (deprecated)
276+
├── datapath_create_test.go # Create test scenarios (Job 1)
277+
├── datapath_delete_test.go # Delete test scenarios (Job 2)
278+
├── datapath.go # Resource orchestration
279+
└── helpers/
280+
└── az_helpers.go # Azure/kubectl helper functions
281+
```
282+
283+
## Best Practices
284+
285+
1. **Keep infrastructure persistent**: Only recreate when necessary (cluster upgrades, config changes)
286+
2. **Monitor scheduled runs**: Set up alerts for test failures
287+
3. **Resource naming**: BUILD_ID is automatically set to the resource group name, ensuring unique resource names per setup
288+
4. **Tag resources appropriately**: All setup resources automatically tagged with `SkipAutoDeleteTill=2032-12-31`
289+
- AKS clusters
290+
- AKS VNets
291+
- Customer VNets (cx_vnet_a1, cx_vnet_a2, cx_vnet_a3, cx_vnet_b1)
292+
- Storage accounts
293+
5. **Avoid resource group collisions**: Always use unique `resourceGroupName` when creating new setups
294+
6. **Document changes**: Update this README when modifying test scenarios or infrastructure
295+
296+
## Resource Tags
297+
298+
All infrastructure resources are automatically tagged during creation:
299+
300+
```bash
301+
SkipAutoDeleteTill=2032-12-31
302+
```
303+
304+
This prevents automatic cleanup by Azure subscription policies that delete resources after a certain period. The tag is applied to:
305+
- Resource group (via create_resource_group job)
306+
- AKS clusters (aks-1, aks-2)
307+
- AKS cluster VNets
308+
- Customer VNets (cx_vnet_a1, cx_vnet_a2, cx_vnet_a3, cx_vnet_b1)
309+
- Storage accounts (sa1xxxx, sa2xxxx)
310+
311+
To manually update the tag date:
312+
```bash
313+
az resource update --ids <resource-id> --set tags.SkipAutoDeleteTill=2033-12-31
314+
```
Lines changed: 27 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -1,36 +1,52 @@
11
trigger: none
2+
pr: none
3+
4+
# Schedule: Run every 1 hour
5+
schedules:
6+
- cron: "0 */1 * * *" # Every 1 hour at minute 0
7+
displayName: "Run tests every 1 hour"
8+
branches:
9+
include:
10+
- sv2-long-running-pipeline
11+
always: true # Run even if there are no code changes
212

313
parameters:
414
- name: subscriptionId
515
displayName: "Azure Subscription ID"
616
type: string
717
default: "37deca37-c375-4a14-b90a-043849bd2bf1"
818

19+
- name: serviceConnection
20+
displayName: "Azure Service Connection"
21+
type: string
22+
default: "Azure Container Networking - Standalone Test Service Connection"
23+
924
- name: location
1025
displayName: "Deployment Region"
1126
type: string
1227
default: "centraluseuap"
1328

29+
- name: runSetupStages
30+
displayName: "Create New Infrastructure Setup"
31+
type: boolean
32+
default: false
33+
34+
# Setup-only parameters (only used when runSetupStages=true)
1435
- name: resourceGroupName
15-
displayName: "Resource Group Name"
36+
displayName: "Resource Group Name used when runSetupStages is true"
1637
type: string
17-
default: "long-run-$(Build.BuildId)"
38+
default: "sv2-long-run-$(Build.BuildId)"
1839

1940
- name: vmSkuDefault
20-
displayName: "VM SKU for Default Node Pool"
41+
displayName: "VM SKU for Default Node Pool used when runSetupStages is true"
2142
type: string
22-
default: "Standard_D2s_v3"
43+
default: "Standard_D4s_v3"
2344

2445
- name: vmSkuHighNIC
25-
displayName: "VM SKU for High NIC Node Pool"
46+
displayName: "VM SKU for additional Node Pool used when runSetupStages is true"
2647
type: string
2748
default: "Standard_D16s_v3"
2849

29-
- name: serviceConnection
30-
displayName: "Azure Service Connection"
31-
type: string
32-
default: "Azure Container Networking - Standalone Test Service Connection"
33-
3450
extends:
3551
template: template/long-running-pipeline-template.yaml
3652
parameters:
@@ -40,3 +56,4 @@ extends:
4056
vmSkuDefault: ${{ parameters.vmSkuDefault }}
4157
vmSkuHighNIC: ${{ parameters.vmSkuHighNIC }}
4258
serviceConnection: ${{ parameters.serviceConnection }}
59+
runSetupStages: ${{ parameters.runSetupStages }}

0 commit comments

Comments
 (0)