Skip to content

Commit 4e465cc

Browse files
author
sivakami
committed
Update long running pipeline template.
1 parent d61b0f6 commit 4e465cc

File tree

9 files changed

+660
-38
lines changed

9 files changed

+660
-38
lines changed
Lines changed: 308 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,308 @@
1+
# SwiftV2 Long-Running Pipeline
2+
3+
This pipeline tests SwiftV2 pod networking in a persistent environment with scheduled test runs.
4+
5+
## Architecture Overview
6+
7+
**Infrastructure (Persistent)**:
8+
- **2 AKS Clusters**: aks-1, aks-2 (4 nodes each: 2 low-NIC default pool, 2 high-NIC nplinux pool)
9+
- **4 VNets**: cx_vnet_a1, cx_vnet_a2, cx_vnet_a3 (Customer 1 with PE to storage), cx_vnet_b1 (Customer 2)
10+
- **VNet Peerings**: two of the three vnets of customer 1 are peered.
11+
- **Storage Account**: With private endpoint from cx_vnet_a1
12+
- **NSGs**: Restricting traffic between subnets (s1, s2) in vnet cx_vnet_a1.
13+
14+
**Test Scenarios (8 total)**:
15+
- Multiple pods across 2 clusters, 4 VNets, different subnets (s1, s2), and node types (low-NIC, high-NIC)
16+
- Each test run: Create all resources → Wait 20 minutes → Delete all resources
17+
- Tests run automatically every 1 hour via scheduled trigger
18+
19+
## Pipeline Modes
20+
21+
### Mode 1: Scheduled Test Runs (Default)
22+
**Trigger**: Automated cron schedule every 1 hour
23+
**Purpose**: Continuous validation of long-running infrastructure
24+
**Setup Stages**: Disabled
25+
**Test Duration**: ~30-40 minutes per run
26+
**Resource Group**: Static (default: `sv2-long-run-<region>`, e.g., `sv2-long-run-centraluseuap`)
27+
28+
```yaml
29+
# Runs automatically every 1 hour
30+
# No manual/external triggers allowed
31+
```
32+
33+
### Mode 2: Initial Setup or Rebuild
34+
**Trigger**: Manual run with parameter change
35+
**Purpose**: Create new infrastructure or rebuild existing
36+
**Setup Stages**: Enabled via `runSetupStages: true`
37+
**Resource Group**: Configurable via parameter
38+
39+
**To create new infrastructure**:
40+
1. Go to Pipeline → Run pipeline
41+
2. **IMPORTANT**: Change `resourceGroupName` to a unique value (e.g., `sv2-long-run-eastus-test2`)
42+
- Default uses location: `sv2-long-run-<location>`
43+
- To avoid collisions, always use a unique name for new setups
44+
3. Set `runSetupStages` = `true`
45+
4. Optionally change `location` if deploying to different region
46+
5. Run pipeline
47+
48+
**⚠️ Warning**: If you don't change the resource group name when creating a new setup, it will overwrite/conflict with the existing default setup used by scheduled runs!
49+
50+
## Pipeline Parameters
51+
52+
| Parameter | Default | Description |
53+
|-----------|---------|-------------|
54+
| `subscriptionId` | `37deca37-c375-4a14-b90a-043849bd2bf1` | Azure subscription for deployment. |
55+
| `location` | `centraluseuap` | Azure region for resources. |
56+
| `resourceGroupName` | `sv2-long-run-<location>` | Static RG name for tests. Dynamically includes region (e.g., `sv2-long-run-centraluseuap`). **MUST be changed to unique value when creating new setup!** |
57+
| `runSetupStages` | `false` | Set to `true` to create/recreate AKS clusters and networking. **WARNING: Always set unique `resourceGroupName` when true!** |
58+
| `vmSkuDefault` | `Standard_D4s_v3` | VM SKU for low-NIC node pool (1 NIC). |
59+
| `vmSkuHighNIC` | `Standard_D16s_v3` | VM SKU for high-NIC node pool (7 NICs). |
60+
| `serviceConnection` | `Azure Container Networking - Standalone Test Service Connection` | Azure DevOps service connection. |
61+
62+
## How It Works
63+
64+
### Scheduled Test Flow
65+
Every 1 hour, the pipeline:
66+
1. Skips setup stages (infrastructure already exists)
67+
2. **Job 1 - Create and Wait**: Creates 8 test scenarios (PodNetwork, PNI, Pods), then waits 20 minutes
68+
3. **Job 2 - Delete Resources**: Deletes all test resources (Phase 1: Pods, Phase 2: PNI/PN/Namespaces)
69+
4. Reports results
70+
71+
### Setup Flow (When runSetupStages = true)
72+
1. Create resource group with `SkipAutoDeleteTill=2032-12-31` tag
73+
2. Create 2 AKS clusters with 2 node pools each (tagged for persistence)
74+
3. Create 4 customer VNets with subnets and delegations (tagged for persistence)
75+
4. Create VNet peerings
76+
5. Create storage accounts with persistence tags
77+
6. Create NSGs for subnet isolation
78+
7. Run initial test (create → wait → delete)
79+
80+
**All infrastructure resources are tagged with `SkipAutoDeleteTill=2032-12-31`** to prevent automatic cleanup by Azure subscription policies.
81+
82+
## Resource Naming
83+
84+
All test resources use the pattern: `<type>-static-setup-<vnet>-<subnet>`
85+
86+
**Examples**:
87+
- PodNetwork: `pn-static-setup-a1-s1`
88+
- PodNetworkInstance: `pni-static-setup-a1-s1`
89+
- Pod: `pod-c1-aks1-a1s1-low`
90+
- Namespace: `pn-static-setup-a1-s1`
91+
92+
VNet names are simplified:
93+
- `cx_vnet_a1``a1`
94+
- `cx_vnet_b1``b1`
95+
96+
## Switching to a New Setup
97+
98+
**Scenario**: You created a new setup in RG `sv2-long-run-eastus` and want scheduled runs to use it.
99+
100+
**Steps**:
101+
1. Go to Pipeline → Edit
102+
2. Update location parameter default value:
103+
```yaml
104+
- name: location
105+
default: "centraluseuap" # Change this
106+
```
107+
3. Save and commit
108+
4. RG name will automatically become `sv2-long-run-centraluseuap`
109+
110+
Alternatively, manually trigger with the new location or override `resourceGroupName` directly.
111+
112+
## Creating Multiple Test Setups
113+
114+
**Use Case**: You want to create a new test environment without affecting the existing one (e.g., for testing different configurations, regions, or versions).
115+
116+
**Steps**:
117+
1. Go to Pipeline → Run pipeline
118+
2. **Change `resourceGroupName`** to a unique value:
119+
- For different region: `sv2-long-run-eastus`
120+
- For parallel test: `sv2-long-run-centraluseuap-v2`
121+
- For experimental: `sv2-long-run-centraluseuap-experimental`
122+
3. Set `runSetupStages` = `true`
123+
4. Optionally change `location` parameter
124+
5. Run pipeline
125+
126+
**After setup completes**:
127+
- The new infrastructure will be tagged with `SkipAutoDeleteTill=2032-12-31`
128+
- To run tests against this new setup, either:
129+
- **Option A**: Update the pipeline default `resourceGroupName` parameter
130+
- **Option B**: Manually trigger test runs with the new `resourceGroupName`
131+
132+
**Example Scenarios**:
133+
134+
| Scenario | Resource Group Name | Purpose |
135+
|----------|-------------------|---------|
136+
| Default production | `sv2-long-run-centraluseuap` | Daily scheduled tests |
137+
| East US environment | `sv2-long-run-eastus` | Regional testing |
138+
| Test new features | `sv2-long-run-centraluseuap-dev` | Development/testing |
139+
| Version upgrade | `sv2-long-run-centraluseuap-v2` | Parallel environment for upgrades |
140+
141+
## Resource Naming
142+
143+
The pipeline uses the **resource group name as the BUILD_ID** to ensure unique resource names per test setup. This allows multiple parallel test environments without naming collisions.
144+
145+
**Generated Resource Names**:
146+
```
147+
BUILD_ID = <resourceGroupName>
148+
149+
PodNetwork: pn-<BUILD_ID>-<vnet>-<subnet>
150+
PodNetworkInstance: pni-<BUILD_ID>-<vnet>-<subnet>
151+
Namespace: pn-<BUILD_ID>-<vnet>-<subnet>
152+
Pod: pod-<scenario-suffix>
153+
```
154+
155+
**Example for `resourceGroupName=sv2-long-run-centraluseuap`**:
156+
```
157+
pn-sv2-long-run-centraluseuap-b1-s1 (PodNetwork for cx_vnet_b1, subnet s1)
158+
pni-sv2-long-run-centraluseuap-b1-s1 (PodNetworkInstance)
159+
pn-sv2-long-run-centraluseuap-a1-s1 (PodNetwork for cx_vnet_a1, subnet s1)
160+
pni-sv2-long-run-centraluseuap-a1-s2 (PodNetworkInstance for cx_vnet_a1, subnet s2)
161+
```
162+
163+
**Example for different setup `resourceGroupName=sv2-long-run-eastus`**:
164+
```
165+
pn-sv2-long-run-eastus-b1-s1 (Different from centraluseuap setup)
166+
pni-sv2-long-run-eastus-b1-s1
167+
pn-sv2-long-run-eastus-a1-s1
168+
```
169+
170+
This ensures **no collision** between different test setups running in parallel.
171+
172+
## Deletion Strategy
173+
### Phase 1: Delete All Pods
174+
Deletes all pods across all scenarios first. This ensures IP reservations are released.
175+
176+
```
177+
Deleting pod pod-c2-aks2-b1s1-low...
178+
Deleting pod pod-c2-aks2-b1s1-high...
179+
...
180+
```
181+
182+
### Phase 2: Delete Shared Resources
183+
Groups resources by vnet/subnet/cluster and deletes PNI/PN/Namespace once per group.
184+
185+
```
186+
Deleting PodNetworkInstance pni-static-setup-b1-s1...
187+
Deleting PodNetwork pn-static-setup-b1-s1...
188+
Deleting namespace pn-static-setup-b1-s1...
189+
```
190+
191+
**Why**: Multiple pods can share the same PNI. Deleting PNI while pods exist causes "ReservationInUse" errors.
192+
193+
## Troubleshooting
194+
195+
### Tests are running on wrong cluster
196+
- Check `resourceGroupName` parameter points to correct RG
197+
- Verify RG contains aks-1 and aks-2 clusters
198+
- Check kubeconfig retrieval in logs
199+
200+
### Setup stages not running
201+
- Verify `runSetupStages` parameter is set to `true`
202+
- Check condition: `condition: eq(${{ parameters.runSetupStages }}, true)`
203+
204+
### Schedule not triggering
205+
- Verify cron expression: `"0 */1 * * *"` (every 1 hour)
206+
- Check branch in schedule matches your working branch
207+
- Ensure `always: true` is set (runs even without code changes)
208+
209+
### PNI stuck with "ReservationInUse"
210+
- Check if pods were deleted first (Phase 1 logs)
211+
- Manual fix: Delete pod → Wait 10s → Patch PNI to remove finalizers
212+
213+
### Pipeline timeout after 6 hours
214+
- This is expected behavior (timeoutInMinutes: 360)
215+
- Tests should complete in ~30-40 minutes
216+
- If tests hang, check deletion logs for stuck resources
217+
218+
## Manual Testing
219+
220+
Run locally against existing infrastructure:
221+
222+
```bash
223+
export RG="sv2-long-run-centraluseuap" # Match your resource group
224+
export BUILD_ID="$RG" # Use same RG name as BUILD_ID for unique resource names
225+
226+
cd test/integration/swiftv2/longRunningCluster
227+
ginkgo -v -trace --timeout=6h .
228+
```
229+
230+
## Node Pool Configuration
231+
232+
- **Low-NIC nodes** (`Standard_D4s_v3`): 1 NIC, label `agentpool!=nplinux`
233+
- Can only run 1 pod at a time
234+
235+
- **High-NIC nodes** (`Standard_D16s_v3`): 7 NICs, label `agentpool=nplinux`
236+
- Currently limited to 1 pod per node in test logic
237+
238+
## Schedule Modification
239+
240+
To change test frequency, edit the cron schedule:
241+
242+
```yaml
243+
schedules:
244+
- cron: "0 */1 * * *" # Every 1 hour (current)
245+
# Examples:
246+
# - cron: "0 */2 * * *" # Every 2 hours
247+
# - cron: "0 */6 * * *" # Every 6 hours
248+
# - cron: "0 0,8,16 * * *" # At 12am, 8am, 4pm
249+
# - cron: "0 0 * * *" # Daily at midnight
250+
```
251+
252+
## File Structure
253+
254+
```
255+
.pipelines/swiftv2-long-running/
256+
├── pipeline.yaml # Main pipeline with schedule
257+
├── README.md # This file
258+
├── template/
259+
│ └── long-running-pipeline-template.yaml # Stage definitions (2 jobs)
260+
└── scripts/
261+
├── create_aks.sh # AKS cluster creation
262+
├── create_vnets.sh # VNet and subnet creation
263+
├── create_peerings.sh # VNet peering setup
264+
├── create_storage.sh # Storage account creation
265+
├── create_nsg.sh # Network security groups
266+
└── create_pe.sh # Private endpoint setup
267+
268+
test/integration/swiftv2/longRunningCluster/
269+
├── datapath_test.go # Original combined test (deprecated)
270+
├── datapath_create_test.go # Create test scenarios (Job 1)
271+
├── datapath_delete_test.go # Delete test scenarios (Job 2)
272+
├── datapath.go # Resource orchestration
273+
└── helpers/
274+
└── az_helpers.go # Azure/kubectl helper functions
275+
```
276+
277+
## Best Practices
278+
279+
1. **Keep infrastructure persistent**: Only recreate when necessary (cluster upgrades, config changes)
280+
2. **Monitor scheduled runs**: Set up alerts for test failures
281+
3. **Resource naming**: BUILD_ID is automatically set to the resource group name, ensuring unique resource names per setup
282+
4. **Tag resources appropriately**: All setup resources automatically tagged with `SkipAutoDeleteTill=2032-12-31`
283+
- AKS clusters
284+
- AKS VNets
285+
- Customer VNets (cx_vnet_a1, cx_vnet_a2, cx_vnet_a3, cx_vnet_b1)
286+
- Storage accounts
287+
5. **Avoid resource group collisions**: Always use unique `resourceGroupName` when creating new setups
288+
6. **Document changes**: Update this README when modifying test scenarios or infrastructure
289+
290+
## Resource Tags
291+
292+
All infrastructure resources are automatically tagged during creation:
293+
294+
```bash
295+
SkipAutoDeleteTill=2032-12-31
296+
```
297+
298+
This prevents automatic cleanup by Azure subscription policies that delete resources after a certain period. The tag is applied to:
299+
- Resource group (via create_resource_group job)
300+
- AKS clusters (aks-1, aks-2)
301+
- AKS cluster VNets
302+
- Customer VNets (cx_vnet_a1, cx_vnet_a2, cx_vnet_a3, cx_vnet_b1)
303+
- Storage accounts (sa1xxxx, sa2xxxx)
304+
305+
To manually update the tag date:
306+
```bash
307+
az resource update --ids <resource-id> --set tags.SkipAutoDeleteTill=2033-12-31
308+
```

.pipelines/swiftv2-long-running/pipeline.yaml

Lines changed: 18 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,14 @@
11
trigger: none
2+
pr: none
3+
4+
# Schedule: Run every 1 hour
5+
schedules:
6+
- cron: "0 */1 * * *" # Every 1 hour at minute 0
7+
displayName: "Run tests every 1 hour"
8+
branches:
9+
include:
10+
- sv2-long-running-pipeline
11+
always: true # Run even if there are no code changes
212

313
parameters:
414
- name: subscriptionId
@@ -12,9 +22,14 @@ parameters:
1222
default: "centraluseuap"
1323

1424
- name: resourceGroupName
15-
displayName: "Resource Group Name"
25+
displayName: "Resource Group Name (IMPORTANT: Change this when creating new setup to avoid collisions!)"
1626
type: string
17-
default: "long-run-$(Build.BuildId)"
27+
default: "sv2-long-run-${{ parameters.location }}"
28+
29+
- name: runSetupStages
30+
displayName: "Create new setup(AKS + Network) - WARNING: Set resourceGroupName to unique value if creating new setup!"
31+
type: boolean
32+
default: false
1833

1934
- name: vmSkuDefault
2035
displayName: "VM SKU for Default Node Pool"
@@ -29,7 +44,6 @@ parameters:
2944
- name: serviceConnection
3045
displayName: "Azure Service Connection"
3146
type: string
32-
#default: "Azure Network Agent - Test Standalone - Service Connection"
3347
default: "Azure Container Networking - Standalone Test Service Connection"
3448

3549
extends:
@@ -41,3 +55,4 @@ extends:
4155
vmSkuDefault: ${{ parameters.vmSkuDefault }}
4256
vmSkuHighNIC: ${{ parameters.vmSkuHighNIC }}
4357
serviceConnection: ${{ parameters.serviceConnection }}
58+
runSetupStages: ${{ parameters.runSetupStages }}

.pipelines/swiftv2-long-running/scripts/create_aks.sh

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -69,17 +69,24 @@ for i in $(seq 1 "$CLUSTER_COUNT"); do
6969

7070
make -C ./hack/aks azcfg AZCLI=az REGION=$LOCATION
7171

72+
# Create cluster with SkipAutoDeleteTill tag for persistent infrastructure
7273
make -C ./hack/aks swiftv2-podsubnet-cluster-up \
7374
AZCLI=az REGION=$LOCATION \
7475
SUB=$SUBSCRIPTION_ID \
7576
GROUP=$RG \
7677
CLUSTER=$CLUSTER_NAME \
7778
VM_SIZE=$VM_SKU_DEFAULT
79+
80+
# Add SkipAutoDeleteTill tag to cluster (2032-12-31 for long-term persistence)
81+
az aks update -g "$RG" -n "$CLUSTER_NAME" --tags SkipAutoDeleteTill=2032-12-31 || echo "Warning: Failed to add tag to cluster"
7882

7983
wait_for_provisioning "$RG" "$CLUSTER_NAME"
8084

8185
vnet_id=$(az network vnet show -g "$RG" --name "$CLUSTER_NAME" --query id -o tsv)
8286
echo "Found VNET: $vnet_id"
87+
88+
# Add SkipAutoDeleteTill tag to AKS VNet
89+
az network vnet update --ids "$vnet_id" --set tags.SkipAutoDeleteTill=2032-12-31 || echo "Warning: Failed to add tag to vnet"
8390

8491
stamp_vnet "$vnet_id"
8592

.pipelines/swiftv2-long-running/scripts/create_storage.sh

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -26,6 +26,7 @@ for SA in "$SA1" "$SA2"; do
2626
--allow-shared-key-access false \
2727
--https-only true \
2828
--min-tls-version TLS1_2 \
29+
--tags SkipAutoDeleteTill=2032-12-31 \
2930
--query "name" -o tsv \
3031
&& echo "Storage account $SA created successfully."
3132
# Verify creation success

.pipelines/swiftv2-long-running/scripts/create_vnets.sh

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -53,7 +53,8 @@ create_vnet_subets() {
5353
local extra_cidrs="$5"
5454

5555
echo "==> Creating VNet: $vnet with CIDR: $vnet_cidr"
56-
az network vnet create -g "$RG" -l "$LOCATION" --name "$vnet" --address-prefixes "$vnet_cidr" -o none
56+
az network vnet create -g "$RG" -l "$LOCATION" --name "$vnet" --address-prefixes "$vnet_cidr" \
57+
--tags SkipAutoDeleteTill=2032-12-31 -o none
5758

5859
IFS=' ' read -r -a extra_subnet_array <<< "$extra_subnets"
5960
IFS=',' read -r -a extra_cidr_array <<< "$extra_cidrs"

0 commit comments

Comments
 (0)