Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 0 additions & 1 deletion __init__.py
Original file line number Diff line number Diff line change
@@ -1,3 +1,2 @@
"""CI Analysis coordinator: provide root cause analysis for CI failures"""

from . import agent
76 changes: 26 additions & 50 deletions ci_analysis_agent/prompt.py
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,30 @@
If you do not know the answer, you acknowledge the fact and end your response.
Your responses must be as short as possible.

CI JOB ANALYSIS WORKFLOW:
-------------------------
When analyzing a job failure, follow this MANDATORY workflow for every job analysis:
0. Parse the Prow job URL provided by the user to extract job_name and build_id
1. Use the installation_analyst subagent to see if the installation of the cluster for the given job was successful. Be very concise in the request. Don't pass all the thinking context in the request. Provide the job_name and build_id in the request to the subagent.
2. ALWAYS use the e2e_test_analyst subagent to identify test failures and patterns. The e2e_test_analyst subagent MUST return a comprehensive analysis of the e2e test execution, including:
- openshift-tests binary commit information and source code links
- Failed test details with GitHub links to test source code
- Test execution patterns and performance insights
- Root cause analysis of test failures
3. Only if needed for deeper insights, check the must-gather logs for more detailed cluster information

IMPORTANT: Steps 1 and 2 are MANDATORY for every job analysis request. Do not skip e2e analysis.

At each step, clearly inform the user about the current subagent being called and the specific information required from them.
After each subagent completes its task, explain the output provided and how it contributes to the overall root cause analysis process.
Ensure all state keys are correctly used to pass information between subagents.

IMPORTANT NOTES:
- If any analyst returns an error (starting with "❌"), acknowledge the error and provide the suggested troubleshooting steps
- Always include the manual check URLs provided by the analysts for user verification
- If logs are not available, suggest the user try a more recent job or verify the URL is correct
- Provide clear, actionable recommendations based on the available analysis

URL PARSING GUIDE:
-----------------
Common Prow job URL formats:
Expand All @@ -26,8 +50,8 @@
2. JOB_NAME is typically a long string like: periodic-ci-openshift-release-master-ci-4.20-e2e-aws-ovn-upgrade
3. BUILD_ID is a long numeric string like: 1879536719736156160

ERROR HANDLING:
--------------
ERROR HANDLING GUIDE:
---------------------
If either analyst returns an error message starting with "❌", this indicates:
1. Invalid job name or build ID
2. Logs not available for this job/build
Expand All @@ -39,53 +63,5 @@
3. Suggest the user try a different, more recent job
4. Provide the manual check URL for user verification

CI JOB ANALYSIS WORKFLOW:
-------------------------
When analyzing a job failure, follow this MANDATORY workflow for every job analysis:
1. ALWAYS start with installation analysis to understand the cluster setup
2. ALWAYS perform e2e test analysis to identify test failures and patterns
3. Only if needed for deeper insights, check the must-gather logs for more detailed cluster information

IMPORTANT: Steps 1 and 2 are MANDATORY for every job analysis request. Do not skip e2e analysis.

At each step, clearly inform the user about the current subagent being called and the specific information required from them.
After each subagent completes its task, explain the output provided and how it contributes to the overall root cause analysis process.
Ensure all state keys are correctly used to pass information between subagents.
Here's the step-by-step breakdown.
For each step, explicitly call the designated subagent and adhere strictly to the specified input and output formats:

* Installation Analysis (Subagent: installation_analyst) - MANDATORY

Input: Prompt the user to provide the link to the prow job they wish to analyze.
Action: Parse the URL for the job_name and build_id. Call the installation_analyst subagent, passing the user-provided job_name and build_id.
Expected Output: The installation_analyst subagent MUST return the job's job_name, build_id, test_name and a comprehensive data analysis for the installation of the cluster for the given job.

* E2E Test Analysis (Subagent: e2e_test_analyst) - MANDATORY

Input: The installation_analysis_output from the installation_analyst subagent.
Action: ALWAYS call the e2e_test_analyst subagent, passing the job_name and build_id from the installation analysis. This will analyze the e2e test logs, extract openshift-tests binary commit information, identify failed tests, and provide source code links.
Expected Output: The e2e_test_analyst subagent MUST return a comprehensive analysis of the e2e test execution, including:
- openshift-tests binary commit information and source code links
- Failed test details with GitHub links to test source code
- Test execution patterns and performance insights
- Root cause analysis of test failures

* Must_Gather Analysis (Subagent: mustgather_analyst) - OPTIONAL

Input: The installation_analysis_output from the installation_analyst subagent. Use /tmp/must-gather as the target_folder for the must-gather directory.
Action: Only call if additional cluster-level debugging is needed. Call the mustgather_analyst subagent, passing the job_name, test_name and build_id. Download the must-gather logs: use /tmp/must-gather as the target_folder. Then analyze them by navigating the directory structure, reading files and searching for relevant information.
Expected Output: The mustgather_analyst subagent MUST return a comprehensive data analysis for the execution of the given job.

WORKFLOW EXECUTION:
1. Parse the Prow job URL to extract job_name and build_id
2. Call installation_analyst with job_name and build_id
3. IMMEDIATELY call e2e_test_analyst with the same job_name and build_id
4. Provide a comprehensive summary combining both analyses
5. Only call mustgather_analyst if specifically requested or if deeper analysis is needed

IMPORTANT NOTES:
- If any analyst returns an error (starting with "❌"), acknowledge the error and provide the suggested troubleshooting steps
- Always include the manual check URLs provided by the analysts for user verification
- If logs are not available, suggest the user try a more recent job or verify the URL is correct
- Provide clear, actionable recommendations based on the available analysis
"""
62 changes: 42 additions & 20 deletions sub_agents/installation_analyst/agent.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,7 @@

import asyncio
import httpx
import re
import threading
import concurrent.futures
import re
Expand All @@ -30,8 +31,8 @@ def extract_installation_info(log_content: str) -> Dict[str, Any]:

# Extract openshift-install version and commit (can be on separate lines)
version_patterns = [
r'openshift-install v([^\s"]+)',
r'"openshift-install v([^\s"]+)"'
r'openshift-install v([0-9][^\s"]+)',
r'"openshift-install v([0-9][^\s"]+)"'
]

for pattern in version_patterns:
Expand Down Expand Up @@ -135,36 +136,57 @@ def extract_installation_info(log_content: str) -> Dict[str, Any]:

return install_info

def get_job_metadata(raw_data: str) -> Dict[str, Any]:
# Extract test_name from data using regex
test_name = None
match_test_name = re.search(r"Running multi-stage test ([^\s]*)", raw_data)
if match_test_name:
test_name = match_test_name.group(1)
print(f"test_name: {test_name}")
status = None
match_status = re.search(r"Reporting job state '([^']*)'", raw_data)
if match_status:
status = match_status.group(1)
print(f"status: {status}")
data = {
"status": status,
"test_name": test_name,
}

match_reason = re.search(r"Reporting job state '([^']*)' with reason '([^']*)'", raw_data)
if match_reason:
status = match_reason.group(1)
failure_reason = match_reason.group(2)
print(f"failure_reason: {failure_reason}")
data["failure_reason"] = failure_reason


return data
# Prow tool functions for installation analysis
async def get_job_metadata_async(job_name: str, build_id: str) -> Dict[str, Any]:
"""Get the metadata and status for a specific Prow job name and build id."""
url = f"{GCS_URL}/{job_name}/{build_id}/prowjob.json"
url = f"{GCS_URL}/{job_name}/{build_id}/build-log.txt"
try:
async with httpx.AsyncClient() as client:
response = await client.get(url)
response.raise_for_status()
data = response.json()
data = response.text

if not data:
return {"error": "No response from Prow API"}

job_spec = data.get("spec", {})
job_status = data.get("status", {})

build_id_from_status = job_status.get("build_id")
status = job_status.get("state")
args = job_spec.get("pod_spec", {}).get("containers", [])[0].get("args", [])
test_name = ""
for arg in args:
if arg.startswith("--target="):
test_name = arg.replace("--target=", "")

return {
"status": status,
"build_id": build_id_from_status,
print(f"was able to get build-log.txt")
somedata = get_job_metadata(data)
metadata = {
"build_id": build_id,
"job_name": job_name,
"test_name": test_name
"test_name": somedata["test_name"],
# "job_overall_status": somedata["status"],
# "job_overall_failure_reason": somedata["failure_reason"]
}
print(f"metadata: {metadata}")

return metadata


except Exception as e:
return {"error": f"Failed to fetch job info: {str(e)}"}
Expand Down
4 changes: 2 additions & 2 deletions sub_agents/installation_analyst/prompt.py
Original file line number Diff line number Diff line change
Expand Up @@ -49,8 +49,8 @@ def get_user_prompt():
- get_install_logs: Fetch and analyze build-log.txt with structured information extraction

ANALYSIS WORKFLOW:
1. Start with job metadata to understand the test context
2. Fetch installation logs from build-log.txt which automatically extracts:
1. call tool get_job_metadata, which provides the test_name
2. Fetch installation logs from build-log.txt (calling tool get_install_logs with build_id, job_name and test_name) which automatically extracts:
- Installer binary version and commit
- Instance types and cluster configuration
- Installation duration and success status
Expand Down
Loading