Skip to content

Commit 873c05e

Browse files
author
sivakami
committed
Update readme file.
1 parent 896ff1f commit 873c05e

File tree

1 file changed

+14
-237
lines changed

1 file changed

+14
-237
lines changed

.pipelines/swiftv2-long-running/README.md

Lines changed: 14 additions & 237 deletions
Original file line numberDiff line numberDiff line change
@@ -50,33 +50,11 @@ Examples: sv2-long-run-12345, sv2-long-run-67890
5050
- **Lifecycle**: Can be cleaned up after testing completes
5151
- **Example**: PR validation run with Build ID 12345 → `sv2-long-run-12345`
5252

53-
**3. Parallel/Custom Environments**:
54-
```
55-
Pattern: sv2-long-run-<region>-<suffix>
56-
Examples: sv2-long-run-centraluseuap-dev, sv2-long-run-eastus-staging
57-
```
58-
- **When to use**: Parallel environments, feature testing, version upgrades
59-
- **Purpose**: Isolated environment alongside production
60-
- **Lifecycle**: Persistent or temporary based on use case
61-
- **Example**: Development environment in Central US EUAP → `sv2-long-run-centraluseuap-dev`
62-
6353
**Important Notes**:
64-
- ⚠️ Always follow the naming pattern for scheduled runs on master: `sv2-long-run-<region>`
65-
- ⚠️ Do not use build IDs for production scheduled infrastructure (it breaks continuity)
66-
- ⚠️ Region name should match the `location` parameter for consistency
67-
- ✅ All resource names within the setup use the resource group name as BUILD_ID prefix
68-
69-
### Mode 1: Scheduled Test Runs (Default)
70-
**Trigger**: Automated cron schedule every 1 hour
71-
**Purpose**: Continuous validation of long-running infrastructure
72-
**Setup Stages**: Disabled
73-
**Test Duration**: ~30-40 minutes per run
74-
**Resource Group**: Static (default: `sv2-long-run-<region>`, e.g., `sv2-long-run-centraluseuap`)
54+
- Always follow the naming pattern for scheduled runs on master: `sv2-long-run-<region>`
55+
- Do not use build IDs for production scheduled infrastructure (it breaks continuity)
56+
- All resource names within the setup use the resource group name as BUILD_ID prefix
7557

76-
```yaml
77-
# Runs automatically every 1 hour
78-
# No manual/external triggers allowed
79-
```
8058

8159
### Mode 2: Initial Setup or Rebuild
8260
**Trigger**: Manual run with parameter change
@@ -120,15 +98,6 @@ Parameters are organized by usage:
12098
|-----------|---------|-------------|
12199
| `resourceGroupName` | `""` (empty) | **Leave empty** to auto-generate based on usage pattern. See Resource Group Naming Conventions below. |
122100

123-
**Resource Group Naming Conventions**:
124-
- **For scheduled runs on master/main branch**: Use `sv2-long-run-<region>` (e.g., `sv2-long-run-centraluseuap`)
125-
- This ensures consistent naming for production scheduled tests
126-
- Example: Creating infrastructure in `centraluseuap` for scheduled runs → `sv2-long-run-centraluseuap`
127-
- **For test/dev runs or PR validation**: Use `sv2-long-run-$(Build.BuildId)`
128-
- Auto-cleanup after testing
129-
- Example: `sv2-long-run-12345` (where 12345 is the build ID)
130-
- **For parallel environments**: Use descriptive suffix (e.g., `sv2-long-run-centraluseuap-dev`, `sv2-long-run-eastus-staging`)
131-
132101
**Note**: VM SKUs are hardcoded as constants in the pipeline template:
133102
- Default nodepool: `Standard_D4s_v3` (low-nic capacity, 1 NIC)
134103
- NPLinux nodepool: `Standard_D16s_v3` (high-nic capacity, 7 NICs)
@@ -161,49 +130,42 @@ The pipeline is organized into stages based on workload type, allowing sequentia
161130
### Future Stages (Planned Architecture)
162131
Additional stages can be added to test different workload types sequentially:
163132

164-
**Example: Stage 3 - BYONodeDataPathTests**
133+
**Example: Stage 3 - LinuxBYONodeDataPathTests**
165134
```yaml
166-
- stage: BYONodeDataPathTests
135+
- stage: LinuxBYONodeDataPathTests
167136
displayName: "SwiftV2 Data Path Tests - BYO Node ID"
168137
dependsOn: ManagedNodeDataPathTests
169138
variables:
170-
WORKLOAD_TYPE: "swiftv2-byonodeid"
139+
WORKLOAD_TYPE: "swiftv2-linuxbyon"
171140
# Same job structure as ManagedNodeDataPathTests
172141
# Tests run on nodes labeled: workload-type=swiftv2-byonodeid
173142
```
174143

175-
**Example: Stage 4 - WindowsNodeDataPathTests**
144+
**Example: Stage 4 - L1vhAccelnetNodeDataPathTests**
176145
```yaml
177-
- stage: WindowsNodeDataPathTests
178-
displayName: "SwiftV2 Data Path Tests - Windows Nodes"
146+
- stage: L1vhAccelnetNodeDataPathTests
147+
displayName: "SwiftV2 Data Path Tests - Windows Nodes Accelnet"
179148
dependsOn: BYONodeDataPathTests
180149
variables:
181150
WORKLOAD_TYPE: "swiftv2-windows"
182151
# Same job structure
183152
# Tests run on nodes labeled: workload-type=swiftv2-windows
184153
```
185154

186-
**Benefits of Stage-Based Architecture**:
187-
- ✅ Sequential execution: Each workload type tested independently
188-
- ✅ Isolated node pools: No resource contention between workload types
189-
- ✅ Same infrastructure: All stages use the same VNets, storage, NSGs
190-
- ✅ Same test suite: Connectivity and private endpoint tests run for each workload type
191-
- ✅ Easy extensibility: Add new stages without modifying existing ones
192-
- ✅ Clear results: Separate test results per workload type
193-
194155
**Node Labeling for Multiple Workload Types**:
195156
Each node pool gets labeled with its designated workload type during setup:
196157
```bash
197158
# During cluster creation or node pool addition:
198-
kubectl label nodes -l agentpool=nodepool1 workload-type=swiftv2-linux
199-
kubectl label nodes -l agentpool=byonodepool workload-type=swiftv2-byonodeid
200-
kubectl label nodes -l agentpool=winnodepool workload-type=swiftv2-windows
159+
kubectl label nodes -l workload-type=swiftv2-linux
160+
kubectl label nodes -l workload-type=swiftv2-linuxbyon
161+
kubectl label nodes -l workload-type=swiftv2-l1vhaccelnet
162+
kubectl label nodes -l workload-type=swiftv2-l1vhib
201163
```
202164

203165
## How It Works
204166

205167
### Scheduled Test Flow
206-
Every 1 hour, the pipeline:
168+
Every 3 hour, the pipeline:
207169
1. Skips setup stages (infrastructure already exists)
208170
2. **Job 1 - Create Resources**: Creates 8 test scenarios (PodNetwork, PNI, Pods with HTTP servers on port 8080)
209171
3. **Job 2 - Connectivity Tests**: Tests HTTP connectivity between pods (9 test cases), then waits 20 minutes
@@ -361,142 +323,6 @@ pod-c1-aks1-a1s1-low
361323

362324
**All infrastructure resources are tagged with `SkipAutoDeleteTill=2032-12-31`** to prevent automatic cleanup by Azure subscription policies.
363325

364-
## Resource Naming
365-
366-
All test resources use the pattern: `<type>-static-setup-<vnet>-<subnet>`
367-
368-
**Examples**:
369-
- PodNetwork: `pn-static-setup-a1-s1`
370-
- PodNetworkInstance: `pni-static-setup-a1-s1`
371-
- Pod: `pod-c1-aks1-a1s1-low`
372-
- Namespace: `pn-static-setup-a1-s1`
373-
374-
VNet names are simplified:
375-
- `cx_vnet_a1``a1`
376-
- `cx_vnet_b1``b1`
377-
378-
## Switching to a New Setup
379-
380-
**Scenario**: You created a new setup in RG `sv2-long-run-eastus` and want scheduled runs to use it.
381-
382-
**Steps**:
383-
1. Go to Pipeline → Edit
384-
2. Update location parameter default value:
385-
```yaml
386-
- name: location
387-
default: "centraluseuap" # Change this
388-
```
389-
3. Save and commit
390-
4. RG name will automatically become `sv2-long-run-centraluseuap`
391-
392-
Alternatively, manually trigger with the new location or override `resourceGroupName` directly.
393-
394-
## Creating Multiple Test Setups
395-
396-
**Use Case**: You want to create a new test environment without affecting the existing one (e.g., for testing different configurations, regions, or versions).
397-
398-
**Steps**:
399-
1. Go to Pipeline → Run pipeline
400-
2. Set `runSetupStages` = `true`
401-
3. **Set `resourceGroupName`** based on usage:
402-
- **For scheduled runs on master/main branch**: `sv2-long-run-<region>` (e.g., `sv2-long-run-centraluseuap`, `sv2-long-run-eastus`)
403-
- Use this naming pattern for production scheduled tests
404-
- **For test/dev runs**: `sv2-long-run-$(Build.BuildId)` or custom (e.g., `sv2-long-run-12345`)
405-
- For temporary testing or PR validation
406-
- **For parallel environments**: Custom with descriptive suffix (e.g., `sv2-long-run-centraluseuap-dev`, `sv2-long-run-centraluseuap-v2`)
407-
4. Optionally adjust `location`
408-
5. Run pipeline
409-
410-
**After setup completes**:
411-
- The new infrastructure will be tagged with `SkipAutoDeleteTill=2032-12-31`
412-
- Resources are isolated by the unique resource group name
413-
- To run tests against the new setup, the scheduled pipeline would need to be updated with the new RG name
414-
415-
**Example Scenarios**:
416-
| Scenario | Resource Group Name | Purpose | Naming Pattern |
417-
|----------|-------------------|---------|----------------|
418-
| Production scheduled (Central US EUAP) | `sv2-long-run-centraluseuap` | Daily scheduled tests on master | `sv2-long-run-<region>` |
419-
| Production scheduled (East US) | `sv2-long-run-eastus` | Regional scheduled testing on master | `sv2-long-run-<region>` |
420-
| Temporary test run | `sv2-long-run-12345` | One-time testing (Build ID: 12345) | `sv2-long-run-$(Build.BuildId)` |
421-
| Development environment | `sv2-long-run-centraluseuap-dev` | Development/testing | Custom with suffix |
422-
| Version upgrade testing | `sv2-long-run-centraluseuap-v2` | Parallel environment for upgrades | Custom with suffix |
423-
424-
## Resource Naming
425-
instead of ping use
426-
The pipeline uses the **resource group name as the BUILD_ID** to ensure unique resource names per test setup. This allows multiple parallel test environments without naming collisions.
427-
428-
**Generated Resource Names**:
429-
```
430-
BUILD_ID = <resourceGroupName>
431-
432-
PodNetwork: pn-<BUILD_ID>-<vnet>-<subnet>
433-
PodNetworkInstance: pni-<BUILD_ID>-<vnet>-<subnet>
434-
Namespace: pn-<BUILD_ID>-<vnet>-<subnet>
435-
Pod: pod-<scenario-suffix>
436-
```
437-
438-
**Example for `resourceGroupName=sv2-long-run-centraluseuap`**:
439-
```
440-
pn-sv2-long-run-centraluseuap-b1-s1 (PodNetwork for cx_vnet_b1, subnet s1)
441-
pni-sv2-long-run-centraluseuap-b1-s1 (PodNetworkInstance)
442-
pn-sv2-long-run-centraluseuap-a1-s1 (PodNetwork for cx_vnet_a1, subnet s1)
443-
pni-sv2-long-run-centraluseuap-a1-s2 (PodNetworkInstance for cx_vnet_a1, subnet s2)
444-
```
445-
446-
**Example for different setup `resourceGroupName=sv2-long-run-eastus`**:
447-
```
448-
pn-sv2-long-run-eastus-b1-s1 (Different from centraluseuap setup)
449-
pni-sv2-long-run-eastus-b1-s1
450-
pn-sv2-long-run-eastus-a1-s1
451-
```
452-
453-
This ensures **no collision** between different test setups running in parallel.
454-
455-
## Deletion Strategy
456-
### Phase 1: Delete All Pods
457-
Deletes all pods across all scenarios first. This ensures IP reservations are released.
458-
459-
```
460-
Deleting pod pod-c2-aks2-b1s1-low...
461-
Deleting pod pod-c2-aks2-b1s1-high...
462-
...
463-
```
464-
465-
### Phase 2: Delete Shared Resources
466-
Groups resources by vnet/subnet/cluster and deletes PNI/PN/Namespace once per group.
467-
468-
```
469-
Deleting PodNetworkInstance pni-static-setup-b1-s1...
470-
Deleting PodNetwork pn-static-setup-b1-s1...
471-
Deleting namespace pn-static-setup-b1-s1...
472-
```
473-
474-
**Why**: Multiple pods can share the same PNI. Deleting PNI while pods exist causes "ReservationInUse" errors.
475-
476-
## Troubleshooting
477-
478-
### Tests are running on wrong cluster
479-
- Check `resourceGroupName` parameter points to correct RG
480-
- Verify RG contains aks-1 and aks-2 clusters
481-
- Check kubeconfig retrieval in logs
482-
483-
### Setup stages not running
484-
- Verify `runSetupStages` parameter is set to `true`
485-
- Check condition: `condition: eq(${{ parameters.runSetupStages }}, true)`
486-
487-
### Schedule not triggering
488-
- Verify cron expression: `"0 */1 * * *"` (every 1 hour)
489-
- Check branch in schedule matches your working branch
490-
- Ensure `always: true` is set (runs even without code changes)
491-
492-
### PNI stuck with "ReservationInUse"
493-
- Check if pods were deleted first (Phase 1 logs)
494-
- Manual fix: Delete pod → Wait 10s → Patch PNI to remove finalizers
495-
496-
### Pipeline timeout after 6 hours
497-
- This is expected behavior (timeoutInMinutes: 360)
498-
- Tests should complete in ~30-40 minutes
499-
- If tests hang, check deletion logs for stuck resources
500326

501327
## Manual Testing
502328

@@ -544,21 +370,6 @@ kubectl label nodes -l agentpool=nodepool1 nic-capacity=low-nic --overwrite
544370
kubectl label nodes -l agentpool=nplinux nic-capacity=high-nic --overwrite
545371
```
546372

547-
**Example Node Labels**:
548-
```yaml
549-
# Low-NIC node (nodepool1)
550-
labels:
551-
agentpool: nodepool1
552-
workload-type: swiftv2-linux
553-
nic-capacity: low-nic
554-
555-
# High-NIC node (nplinux)
556-
labels:
557-
agentpool: nplinux
558-
workload-type: swiftv2-linux
559-
nic-capacity: high-nic
560-
```
561-
562373
### Node Selection in Tests
563374

564375
Tests use these labels to select appropriate nodes dynamically:
@@ -588,20 +399,6 @@ Tests use these labels to select appropriate nodes dynamically:
588399

589400
**Note**: VM SKUs are hardcoded as constants in the pipeline template and cannot be changed by users.
590401

591-
## Schedule Modification
592-
593-
To change test frequency, edit the cron schedule:
594-
595-
```yaml
596-
schedules:
597-
- cron: "0 */1 * * *" # Every 1 hour (current)
598-
# Examples:
599-
# - cron: "0 */2 * * *" # Every 2 hours
600-
# - cron: "0 */6 * * *" # Every 6 hours
601-
# - cron: "0 0,8,16 * * *" # At 12am, 8am, 4pm
602-
# - cron: "0 0 * * *" # Daily at midnight
603-
```
604-
605402
## File Structure
606403

607404
```
@@ -639,23 +436,3 @@ test/integration/swiftv2/longRunningCluster/
639436
- Storage accounts
640437
5. **Avoid resource group collisions**: Always use unique `resourceGroupName` when creating new setups
641438
6. **Document changes**: Update this README when modifying test scenarios or infrastructure
642-
643-
## Resource Tags
644-
645-
All infrastructure resources are automatically tagged during creation:
646-
647-
```bash
648-
SkipAutoDeleteTill=2032-12-31
649-
```
650-
651-
This prevents automatic cleanup by Azure subscription policies that delete resources after a certain period. The tag is applied to:
652-
- Resource group (via create_resource_group job)
653-
- AKS clusters (aks-1, aks-2)
654-
- AKS cluster VNets
655-
- Customer VNets (cx_vnet_a1, cx_vnet_a2, cx_vnet_a3, cx_vnet_b1)
656-
- Storage accounts (sa1xxxx, sa2xxxx)
657-
658-
To manually update the tag date:
659-
```bash
660-
az resource update --ids <resource-id> --set tags.SkipAutoDeleteTill=2033-12-31
661-
```

0 commit comments

Comments
 (0)