Skip to content

Commit 39229b1

Browse files
committed
Expand /prow-job:analyze-test-failure skill to process intervals
This will give indicators of unusual things going on in the cluster at the time an e2e test failed, and provides helpful context when debugging some failures.
1 parent 681ca28 commit 39229b1

File tree

1 file changed

+31
-1
lines changed
  • plugins/prow-job/skills/prow-job-analyze-test-failure

1 file changed

+31
-1
lines changed

plugins/prow-job/skills/prow-job-analyze-test-failure/SKILL.md

Lines changed: 31 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -18,7 +18,9 @@ Identical with "Prow Job Analyze Resource" skill.
1818
## Input Format
1919

2020
The user will provide:
21+
2122
1. **Prow job URL** - gcsweb URL containing `test-platform-results/`
23+
2224
- Example: `https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/pr-logs/pull/openshift_hypershift/6731/pull-ci-openshift-hypershift-main-e2e-aws/1962527613477982208`
2325
- URL may or may not have trailing slash
2426

@@ -37,6 +39,7 @@ Use the "Parse and Validate URL" steps from "Prow Job Analyze Resource" skill
3739
### Step 2: Create Working Directory
3840

3941
1. **Check for existing artifacts first**
42+
4043
- Check if `.work/prow-job-analyze-test-failure/{build_id}/logs/` directory exists and has content
4144
- If it exists with content:
4245
- Use AskUserQuestion tool to ask:
@@ -70,16 +73,41 @@ Use the "Download and Validate prowjob.json" steps from "Prow Job Analyze Resour
7073
### Step 4: Analyze Test Failure
7174

7275
1. **Download build-log.txt**
76+
7377
```bash
7478
gcloud storage cp gs://test-platform-results/{bucket-path}/build-log.txt .work/prow-job-analyze-test-failure/{build_id}/logs/build-log.txt --no-user-output-enabled
7579
```
7680

7781
2. **Parse and validate**
82+
7883
- Read `.work/prow-job-analyze-resource/{build_id}/logs/build-log.txt`
7984
- Search for the Test name
8085
- Gather stack trace related to the test
8186

82-
3. **Determine root cause**
87+
3. **Examine intervals files for cluster activity during E2E failures**
88+
89+
- Search recursively for E2E timeline artifacts (known as "interval files") within the bucket-path:
90+
```bash
91+
gcloud storage ls 'gs://test-platform-results/{bucket-path}/**/e2e-timelines_spyglass_*json'
92+
```
93+
- The files can be nested at unpredictable levels below the bucket-path
94+
- There could be as many as two matching files
95+
- Download all matching interval files (use the full paths from the search results):
96+
```bash
97+
gcloud storage cp gs://test-platform-results/{bucket-path}/**/e2e-timelines_spyglass_*.json .work/prow-job-analyze-test-failure/{build_id}/logs/ --no-user-output-enabled
98+
```
99+
- If the wildcard copy doesn't work, copy each file individually using the full paths from the search results
100+
- **Scan interval files for test failure timing:**
101+
- Look for intervals where `source = "E2ETest"` and `message.annotations.status = "Failed"`
102+
- Note the `from` and `to` timestamps on this interval - this indicates when the test was running
103+
- **Scan interval files for related cluster events:**
104+
- Look for intervals that overlap the timeframe when the failed test was running
105+
- Filter for intervals with:
106+
- `level = "Error"` or `level = "Warning"`
107+
- `source = "OperatorState"`
108+
- These events may indicate cluster issues that caused or contributed to the test failure
109+
110+
4. **Determine root cause**
83111
- Determine a possible root cause for the test failure
84112
- Analyze stack traces
85113
- Analyze related code in the code repository
@@ -91,6 +119,7 @@ Use the "Download and Validate prowjob.json" steps from "Prow Job Analyze Resour
91119
### Step 5: Present Results to User
92120
93121
1. **Display summary**
122+
94123
```text
95124
Test Failure Analysis Complete
96125
@@ -104,6 +133,7 @@ Use the "Download and Validate prowjob.json" steps from "Prow Job Analyze Resour
104133
105134
Artifacts downloaded to: .work/prow-job-analyze-test-failure/{build_id}/logs/
106135
```
136+
107137
## Error Handling
108138
109139
Handle errors in the same way as "Error handling" in "Prow Job Analyze Resource" skill

0 commit comments

Comments
 (0)