Azure
diff --git a/‎.pipelines/swiftv2-long-running/README.md‎
Lines changed: 308 additions & 0 deletions b/‎.pipelines/swiftv2-long-running/README.md‎
Lines changed: 308 additions & 0 deletions
diff --git a/‎.pipelines/swiftv2-long-running/pipeline.yaml‎
Lines changed: 18 additions & 3 deletions b/‎.pipelines/swiftv2-long-running/pipeline.yaml‎
Lines changed: 18 additions & 3 deletions
diff --git a/‎.pipelines/swiftv2-long-running/scripts/create_aks.sh‎
Lines changed: 7 additions & 0 deletions b/‎.pipelines/swiftv2-long-running/scripts/create_aks.sh‎
Lines changed: 7 additions & 0 deletions
diff --git a/‎.pipelines/swiftv2-long-running/scripts/create_storage.sh‎
Lines changed: 1 addition & 0 deletions b/‎.pipelines/swiftv2-long-running/scripts/create_storage.sh‎
Lines changed: 1 addition & 0 deletions
diff --git a/‎.pipelines/swiftv2-long-running/scripts/create_vnets.sh‎
Lines changed: 2 additions & 1 deletion b/‎.pipelines/swiftv2-long-running/scripts/create_vnets.sh‎
Lines changed: 2 additions & 1 deletion
@@ -0,0 +1,308 @@
+# SwiftV2 Long-Running Pipeline
+
+This pipeline tests SwiftV2 pod networking in a persistent environment with scheduled test runs.
+
+## Architecture Overview
+
+**Infrastructure (Persistent)**:
+- **2 AKS Clusters**: aks-1, aks-2 (4 nodes each: 2 low-NIC default pool, 2 high-NIC nplinux pool)
+- **4 VNets**: cx_vnet_a1, cx_vnet_a2, cx_vnet_a3 (Customer 1 with PE to storage), cx_vnet_b1 (Customer 2)
+- **VNet Peerings**: two of the three vnets of customer 1 are peered.
+- **Storage Account**: With private endpoint from cx_vnet_a1
+- **NSGs**: Restricting traffic between subnets (s1, s2) in vnet cx_vnet_a1.
+
+**Test Scenarios (8 total)**:
+- Multiple pods across 2 clusters, 4 VNets, different subnets (s1, s2), and node types (low-NIC, high-NIC)
+- Each test run: Create all resources → Wait 20 minutes → Delete all resources
+- Tests run automatically every 1 hour via scheduled trigger
+
+## Pipeline Modes
+
+### Mode 1: Scheduled Test Runs (Default)
+**Trigger**: Automated cron schedule every 1 hour  
+**Purpose**: Continuous validation of long-running infrastructure  
+**Setup Stages**: Disabled  
+**Test Duration**: ~30-40 minutes per run  
+**Resource Group**: Static (default: `sv2-long-run-<region>`, e.g., `sv2-long-run-centraluseuap`)
+
+```yaml
+# Runs automatically every 1 hour
+# No manual/external triggers allowed
+```
+
+### Mode 2: Initial Setup or Rebuild
+**Trigger**: Manual run with parameter change  
+**Purpose**: Create new infrastructure or rebuild existing  
+**Setup Stages**: Enabled via `runSetupStages: true`  
+**Resource Group**: Configurable via parameter
+
+**To create new infrastructure**:
+1. Go to Pipeline → Run pipeline
+2. **IMPORTANT**: Change `resourceGroupName` to a unique value (e.g., `sv2-long-run-eastus-test2`)
+   - Default uses location: `sv2-long-run-<location>`
+   - To avoid collisions, always use a unique name for new setups
+3. Set `runSetupStages` = `true`
+4. Optionally change `location` if deploying to different region
+5. Run pipeline
+
+**⚠️ Warning**: If you don't change the resource group name when creating a new setup, it will overwrite/conflict with the existing default setup used by scheduled runs!
+
+## Pipeline Parameters
+
+| Parameter | Default | Description |
+|-----------|---------|-------------|
+| `subscriptionId` | `37deca37-c375-4a14-b90a-043849bd2bf1` | Azure subscription for deployment. |
+| `location` | `centraluseuap` | Azure region for resources. |
+| `resourceGroupName` | `sv2-long-run-<location>` | Static RG name for tests. Dynamically includes region (e.g., `sv2-long-run-centraluseuap`). **MUST be changed to unique value when creating new setup!** |
+| `runSetupStages` | `false` | Set to `true` to create/recreate AKS clusters and networking. **WARNING: Always set unique `resourceGroupName` when true!** |
+| `vmSkuDefault` | `Standard_D4s_v3` | VM SKU for low-NIC node pool (1 NIC). |
+| `vmSkuHighNIC` | `Standard_D16s_v3` | VM SKU for high-NIC node pool (7 NICs). |
+| `serviceConnection` | `Azure Container Networking - Standalone Test Service Connection` | Azure DevOps service connection. |
+
+## How It Works
+
+### Scheduled Test Flow
+Every 1 hour, the pipeline:
+1. Skips setup stages (infrastructure already exists)
+2. **Job 1 - Create and Wait**: Creates 8 test scenarios (PodNetwork, PNI, Pods), then waits 20 minutes
+3. **Job 2 - Delete Resources**: Deletes all test resources (Phase 1: Pods, Phase 2: PNI/PN/Namespaces)
+4. Reports results
+
+### Setup Flow (When runSetupStages = true)
+1. Create resource group with `SkipAutoDeleteTill=2032-12-31` tag
+2. Create 2 AKS clusters with 2 node pools each (tagged for persistence)
+3. Create 4 customer VNets with subnets and delegations (tagged for persistence)
+4. Create VNet peerings 
+5. Create storage accounts with persistence tags
+6. Create NSGs for subnet isolation
+7. Run initial test (create → wait → delete)
+
+**All infrastructure resources are tagged with `SkipAutoDeleteTill=2032-12-31`** to prevent automatic cleanup by Azure subscription policies.
+
+## Resource Naming
+
+All test resources use the pattern: `<type>-static-setup-<vnet>-<subnet>`
+
+**Examples**:
+- PodNetwork: `pn-static-setup-a1-s1`
+- PodNetworkInstance: `pni-static-setup-a1-s1`  
+- Pod: `pod-c1-aks1-a1s1-low`
+- Namespace: `pn-static-setup-a1-s1`
+
+VNet names are simplified:
+- `cx_vnet_a1` → `a1`
+- `cx_vnet_b1` → `b1`
+
+## Switching to a New Setup
+
+**Scenario**: You created a new setup in RG `sv2-long-run-eastus` and want scheduled runs to use it.
+
+**Steps**:
+1. Go to Pipeline → Edit
+2. Update location parameter default value:
+   ```yaml
+   - name: location
+     default: "centraluseuap"  # Change this
+   ```
+3. Save and commit
+4. RG name will automatically become `sv2-long-run-centraluseuap`
+
+Alternatively, manually trigger with the new location or override `resourceGroupName` directly.
+
+## Creating Multiple Test Setups
+
+**Use Case**: You want to create a new test environment without affecting the existing one (e.g., for testing different configurations, regions, or versions).
+
+**Steps**:
+1. Go to Pipeline → Run pipeline
+2. **Change `resourceGroupName`** to a unique value:
+   - For different region: `sv2-long-run-eastus`
+   - For parallel test: `sv2-long-run-centraluseuap-v2`
+   - For experimental: `sv2-long-run-centraluseuap-experimental`
+3. Set `runSetupStages` = `true`
+4. Optionally change `location` parameter
+5. Run pipeline
+
+**After setup completes**:
+- The new infrastructure will be tagged with `SkipAutoDeleteTill=2032-12-31`
+- To run tests against this new setup, either:
+  - **Option A**: Update the pipeline default `resourceGroupName` parameter
+  - **Option B**: Manually trigger test runs with the new `resourceGroupName`
+
+**Example Scenarios**:
+
+| Scenario | Resource Group Name | Purpose |
+|----------|-------------------|---------|
+| Default production | `sv2-long-run-centraluseuap` | Daily scheduled tests |
+| East US environment | `sv2-long-run-eastus` | Regional testing |
+| Test new features | `sv2-long-run-centraluseuap-dev` | Development/testing |
+| Version upgrade | `sv2-long-run-centraluseuap-v2` | Parallel environment for upgrades |
+
+## Resource Naming
+
+The pipeline uses the **resource group name as the BUILD_ID** to ensure unique resource names per test setup. This allows multiple parallel test environments without naming collisions.
+
+**Generated Resource Names**:
+```
+BUILD_ID = <resourceGroupName>
+
+PodNetwork:         pn-<BUILD_ID>-<vnet>-<subnet>
+PodNetworkInstance: pni-<BUILD_ID>-<vnet>-<subnet>
+Namespace:          pn-<BUILD_ID>-<vnet>-<subnet>
+Pod:                pod-<scenario-suffix>
+```
+
+**Example for `resourceGroupName=sv2-long-run-centraluseuap`**:
+```
+pn-sv2-long-run-centraluseuap-b1-s1       (PodNetwork for cx_vnet_b1, subnet s1)
+pni-sv2-long-run-centraluseuap-b1-s1      (PodNetworkInstance)
+pn-sv2-long-run-centraluseuap-a1-s1       (PodNetwork for cx_vnet_a1, subnet s1)
+pni-sv2-long-run-centraluseuap-a1-s2      (PodNetworkInstance for cx_vnet_a1, subnet s2)
+```
+
+**Example for different setup `resourceGroupName=sv2-long-run-eastus`**:
+```
+pn-sv2-long-run-eastus-b1-s1       (Different from centraluseuap setup)
+pni-sv2-long-run-eastus-b1-s1
+pn-sv2-long-run-eastus-a1-s1
+```
+
+This ensures **no collision** between different test setups running in parallel.
+
+## Deletion Strategy
+### Phase 1: Delete All Pods
+Deletes all pods across all scenarios first. This ensures IP reservations are released.
+
+```
+Deleting pod pod-c2-aks2-b1s1-low...
+Deleting pod pod-c2-aks2-b1s1-high...
+...
+```
+
+### Phase 2: Delete Shared Resources
+Groups resources by vnet/subnet/cluster and deletes PNI/PN/Namespace once per group.
+
+```
+Deleting PodNetworkInstance pni-static-setup-b1-s1...
+Deleting PodNetwork pn-static-setup-b1-s1...
+Deleting namespace pn-static-setup-b1-s1...
+```
+
+**Why**: Multiple pods can share the same PNI. Deleting PNI while pods exist causes "ReservationInUse" errors.
+
+## Troubleshooting
+
+### Tests are running on wrong cluster
+- Check `resourceGroupName` parameter points to correct RG
+- Verify RG contains aks-1 and aks-2 clusters
+- Check kubeconfig retrieval in logs
+
+### Setup stages not running
+- Verify `runSetupStages` parameter is set to `true`
+- Check condition: `condition: eq(${{ parameters.runSetupStages }}, true)`
+
+### Schedule not triggering
+- Verify cron expression: `"0 */1 * * *"` (every 1 hour)
+- Check branch in schedule matches your working branch
+- Ensure `always: true` is set (runs even without code changes)
+
+### PNI stuck with "ReservationInUse"
+- Check if pods were deleted first (Phase 1 logs)
+- Manual fix: Delete pod → Wait 10s → Patch PNI to remove finalizers
+
+### Pipeline timeout after 6 hours
+- This is expected behavior (timeoutInMinutes: 360)
+- Tests should complete in ~30-40 minutes
+- If tests hang, check deletion logs for stuck resources
+
+## Manual Testing
+
+Run locally against existing infrastructure:
+
+```bash
+export RG="sv2-long-run-centraluseuap"  # Match your resource group
+export BUILD_ID="$RG"  # Use same RG name as BUILD_ID for unique resource names
+
+cd test/integration/swiftv2/longRunningCluster
+ginkgo -v -trace --timeout=6h .
+```
+
+## Node Pool Configuration
+
+- **Low-NIC nodes** (`Standard_D4s_v3`): 1 NIC, label `agentpool!=nplinux`
+  - Can only run 1 pod at a time
+  
+- **High-NIC nodes** (`Standard_D16s_v3`): 7 NICs, label `agentpool=nplinux`
+  - Currently limited to 1 pod per node in test logic
+
+## Schedule Modification
+
+To change test frequency, edit the cron schedule:
+
+```yaml
+schedules:
+  - cron: "0 */1 * * *"  # Every 1 hour (current)
+  # Examples:
+  # - cron: "0 */2 * * *"  # Every 2 hours
+  # - cron: "0 */6 * * *"  # Every 6 hours
+  # - cron: "0 0,8,16 * * *"  # At 12am, 8am, 4pm
+  # - cron: "0 0 * * *"  # Daily at midnight
+```
+
+## File Structure
+
+```
+.pipelines/swiftv2-long-running/
+├── pipeline.yaml                    # Main pipeline with schedule
+├── README.md                        # This file
+├── template/
+│   └── long-running-pipeline-template.yaml  # Stage definitions (2 jobs)
+└── scripts/
+    ├── create_aks.sh               # AKS cluster creation
+    ├── create_vnets.sh             # VNet and subnet creation
+    ├── create_peerings.sh          # VNet peering setup
+    ├── create_storage.sh           # Storage account creation
+    ├── create_nsg.sh               # Network security groups
+    └── create_pe.sh                # Private endpoint setup
+
+test/integration/swiftv2/longRunningCluster/
+├── datapath_test.go                # Original combined test (deprecated)
+├── datapath_create_test.go         # Create test scenarios (Job 1)
+├── datapath_delete_test.go         # Delete test scenarios (Job 2)
+├── datapath.go                     # Resource orchestration
+└── helpers/
+    └── az_helpers.go               # Azure/kubectl helper functions
+```
+
+## Best Practices
+
+1. **Keep infrastructure persistent**: Only recreate when necessary (cluster upgrades, config changes)
+2. **Monitor scheduled runs**: Set up alerts for test failures
+3. **Resource naming**: BUILD_ID is automatically set to the resource group name, ensuring unique resource names per setup
+4. **Tag resources appropriately**: All setup resources automatically tagged with `SkipAutoDeleteTill=2032-12-31`
+   - AKS clusters
+   - AKS VNets
+   - Customer VNets (cx_vnet_a1, cx_vnet_a2, cx_vnet_a3, cx_vnet_b1)
+   - Storage accounts
+5. **Avoid resource group collisions**: Always use unique `resourceGroupName` when creating new setups
+6. **Document changes**: Update this README when modifying test scenarios or infrastructure
+
+## Resource Tags
+
+All infrastructure resources are automatically tagged during creation:
+
+```bash
+SkipAutoDeleteTill=2032-12-31
+```
+
+This prevents automatic cleanup by Azure subscription policies that delete resources after a certain period. The tag is applied to:
+- Resource group (via create_resource_group job)
+- AKS clusters (aks-1, aks-2)
+- AKS cluster VNets
+- Customer VNets (cx_vnet_a1, cx_vnet_a2, cx_vnet_a3, cx_vnet_b1)
+- Storage accounts (sa1xxxx, sa2xxxx)
+
+To manually update the tag date:
+```bash
+az resource update --ids <resource-id> --set tags.SkipAutoDeleteTill=2033-12-31
+```
@@ -1,4 +1,14 @@
 trigger: none
+pr: none
+
+# Schedule: Run every 1 hour
+schedules:
+  - cron: "0 */1 * * *"  # Every 1 hour at minute 0
+    displayName: "Run tests every 1 hour"
+    branches:
+      include:
+        - sv2-long-running-pipeline
+    always: true  # Run even if there are no code changes
 
 parameters:
   - name: subscriptionId
@@ -12,9 +22,14 @@ parameters:
     default: "centraluseuap"
 
   - name: resourceGroupName
-    displayName: "Resource Group Name"
+    displayName: "Resource Group Name (IMPORTANT: Change this when creating new setup to avoid collisions!)"
     type: string
-    default: "long-run-$(Build.BuildId)"
+    default: "sv2-long-run-${{ parameters.location }}"
+
+  - name: runSetupStages
+    displayName: "Create new setup(AKS + Network) - WARNING: Set resourceGroupName to unique value if creating new setup!"
+    type: boolean
+    default: false
 
   - name: vmSkuDefault
     displayName: "VM SKU for Default Node Pool"
@@ -29,7 +44,6 @@ parameters:
   - name: serviceConnection
     displayName: "Azure Service Connection"
     type: string
-    #default: "Azure Network Agent - Test Standalone - Service Connection"
     default: "Azure Container Networking - Standalone Test Service Connection"
 
 extends:
@@ -41,3 +55,4 @@ extends:
     vmSkuDefault: ${{ parameters.vmSkuDefault }}
     vmSkuHighNIC: ${{ parameters.vmSkuHighNIC }}
     serviceConnection: ${{ parameters.serviceConnection }}
+    runSetupStages: ${{ parameters.runSetupStages }}
@@ -69,17 +69,24 @@ for i in $(seq 1 "$CLUSTER_COUNT"); do
 
     make -C ./hack/aks azcfg AZCLI=az REGION=$LOCATION
 
+    # Create cluster with SkipAutoDeleteTill tag for persistent infrastructure
     make -C ./hack/aks swiftv2-podsubnet-cluster-up \
       AZCLI=az REGION=$LOCATION \
       SUB=$SUBSCRIPTION_ID \
       GROUP=$RG \
       CLUSTER=$CLUSTER_NAME \
       VM_SIZE=$VM_SKU_DEFAULT
+    
+    # Add SkipAutoDeleteTill tag to cluster (2032-12-31 for long-term persistence)
+    az aks update -g "$RG" -n "$CLUSTER_NAME" --tags SkipAutoDeleteTill=2032-12-31 || echo "Warning: Failed to add tag to cluster"
 
     wait_for_provisioning "$RG" "$CLUSTER_NAME"
 
     vnet_id=$(az network vnet show -g "$RG" --name "$CLUSTER_NAME" --query id -o tsv)
     echo "Found VNET: $vnet_id"
+    
+    # Add SkipAutoDeleteTill tag to AKS VNet
+    az network vnet update --ids "$vnet_id" --set tags.SkipAutoDeleteTill=2032-12-31 || echo "Warning: Failed to add tag to vnet"
 
     stamp_vnet "$vnet_id"
 
 
@@ -26,6 +26,7 @@ for SA in "$SA1" "$SA2"; do
     --allow-shared-key-access false \
     --https-only true \
     --min-tls-version TLS1_2 \
+    --tags SkipAutoDeleteTill=2032-12-31 \
     --query "name" -o tsv \
   && echo "Storage account $SA created successfully."
   # Verify creation success
 
@@ -53,7 +53,8 @@ create_vnet_subets() {
   local extra_cidrs="$5"
 
   echo "==> Creating VNet: $vnet with CIDR: $vnet_cidr"
-  az network vnet create -g "$RG" -l "$LOCATION" --name "$vnet" --address-prefixes "$vnet_cidr" -o none
+  az network vnet create -g "$RG" -l "$LOCATION" --name "$vnet" --address-prefixes "$vnet_cidr" \
+    --tags SkipAutoDeleteTill=2032-12-31 -o none
 
   IFS=' ' read -r -a extra_subnet_array <<< "$extra_subnets"
   IFS=',' read -r -a extra_cidr_array <<< "$extra_cidrs"