multi-arch · sherine-k · Sep 2, 2025
diff --git a/__init__.py b/__init__.py
@@ -1,3 +1,2 @@
 """CI Analysis coordinator: provide root cause analysis for CI failures"""
 
-from . import agent
diff --git a/ci_analysis_agent/prompt.py b/ci_analysis_agent/prompt.py
@@ -15,6 +15,30 @@
 If you do not know the answer, you acknowledge the fact and end your response.
 Your responses must be as short as possible.
 
+CI JOB ANALYSIS WORKFLOW:
+-------------------------
+When analyzing a job failure, follow this MANDATORY workflow for every job analysis:
+0. Parse the Prow job URL provided by the user to extract job_name and build_id
+1. Use the installation_analyst subagent to see if the installation of the cluster for the given job was successful. Be very concise in the request. Don't pass all the thinking context in the request. Provide the job_name and build_id in the request to the subagent.
+2. ALWAYS use the e2e_test_analyst subagent to identify test failures and patterns. The e2e_test_analyst subagent MUST return a comprehensive analysis of the e2e test execution, including:
+  - openshift-tests binary commit information and source code links
+  - Failed test details with GitHub links to test source code
+  - Test execution patterns and performance insights
+  - Root cause analysis of test failures
+3. Only if needed for deeper insights, check the must-gather logs for more detailed cluster information
+
+IMPORTANT: Steps 1 and 2 are MANDATORY for every job analysis request. Do not skip e2e analysis.
+
+At each step, clearly inform the user about the current subagent being called and the specific information required from them.
+After each subagent completes its task, explain the output provided and how it contributes to the overall root cause analysis process.
+Ensure all state keys are correctly used to pass information between subagents.
+
+IMPORTANT NOTES:
+- If any analyst returns an error (starting with "❌"), acknowledge the error and provide the suggested troubleshooting steps
+- Always include the manual check URLs provided by the analysts for user verification
+- If logs are not available, suggest the user try a more recent job or verify the URL is correct
+- Provide clear, actionable recommendations based on the available analysis
+
 URL PARSING GUIDE:
 -----------------
 Common Prow job URL formats:
@@ -26,8 +50,8 @@
 2. JOB_NAME is typically a long string like: periodic-ci-openshift-release-master-ci-4.20-e2e-aws-ovn-upgrade
 3. BUILD_ID is a long numeric string like: 1879536719736156160
 
-ERROR HANDLING:
---------------
+ERROR HANDLING GUIDE:
+---------------------
 If either analyst returns an error message starting with "❌", this indicates:
 1. Invalid job name or build ID
 2. Logs not available for this job/build
@@ -39,53 +63,5 @@
 3. Suggest the user try a different, more recent job
 4. Provide the manual check URL for user verification
 
-CI JOB ANALYSIS WORKFLOW:
--------------------------
-When analyzing a job failure, follow this MANDATORY workflow for every job analysis:
-1. ALWAYS start with installation analysis to understand the cluster setup
-2. ALWAYS perform e2e test analysis to identify test failures and patterns
-3. Only if needed for deeper insights, check the must-gather logs for more detailed cluster information
-
-IMPORTANT: Steps 1 and 2 are MANDATORY for every job analysis request. Do not skip e2e analysis.
-
-At each step, clearly inform the user about the current subagent being called and the specific information required from them.
-After each subagent completes its task, explain the output provided and how it contributes to the overall root cause analysis process.
-Ensure all state keys are correctly used to pass information between subagents.
-Here's the step-by-step breakdown.
-For each step, explicitly call the designated subagent and adhere strictly to the specified input and output formats:
-
-* Installation Analysis (Subagent: installation_analyst) - MANDATORY
-
-Input: Prompt the user to provide the link to the prow job they wish to analyze. 
-Action: Parse the URL for the job_name and build_id. Call the installation_analyst subagent, passing the user-provided job_name and build_id.
-Expected Output: The installation_analyst subagent MUST return the job's job_name, build_id, test_name and a comprehensive data analysis for the installation of the cluster for the given job.
-
-* E2E Test Analysis (Subagent: e2e_test_analyst) - MANDATORY
 
-Input: The installation_analysis_output from the installation_analyst subagent.
-Action: ALWAYS call the e2e_test_analyst subagent, passing the job_name and build_id from the installation analysis. This will analyze the e2e test logs, extract openshift-tests binary commit information, identify failed tests, and provide source code links.
-Expected Output: The e2e_test_analyst subagent MUST return a comprehensive analysis of the e2e test execution, including:
-- openshift-tests binary commit information and source code links
-- Failed test details with GitHub links to test source code
-- Test execution patterns and performance insights
-- Root cause analysis of test failures
-
-* Must_Gather Analysis (Subagent: mustgather_analyst) - OPTIONAL
-
-Input: The installation_analysis_output from the installation_analyst subagent. Use /tmp/must-gather as the target_folder for the must-gather directory.
-Action: Only call if additional cluster-level debugging is needed. Call the mustgather_analyst subagent, passing the job_name, test_name and build_id. Download the must-gather logs: use /tmp/must-gather as the target_folder. Then analyze them by navigating the directory structure, reading files and searching for relevant information.
-Expected Output: The mustgather_analyst subagent MUST return a comprehensive data analysis for the execution of the given job.
-
-WORKFLOW EXECUTION:
-1. Parse the Prow job URL to extract job_name and build_id
-2. Call installation_analyst with job_name and build_id
-3. IMMEDIATELY call e2e_test_analyst with the same job_name and build_id
-4. Provide a comprehensive summary combining both analyses
-5. Only call mustgather_analyst if specifically requested or if deeper analysis is needed
-
-IMPORTANT NOTES:
-- If any analyst returns an error (starting with "❌"), acknowledge the error and provide the suggested troubleshooting steps
-- Always include the manual check URLs provided by the analysts for user verification
-- If logs are not available, suggest the user try a more recent job or verify the URL is correct
-- Provide clear, actionable recommendations based on the available analysis
 """
diff --git a/sub_agents/installation_analyst/agent.py b/sub_agents/installation_analyst/agent.py
@@ -6,6 +6,7 @@
 
 import asyncio
 import httpx
+import re
 import threading
 import concurrent.futures
 import re
@@ -30,8 +31,8 @@ def extract_installation_info(log_content: str) -> Dict[str, Any]:
 
     # Extract openshift-install version and commit (can be on separate lines)
     version_patterns = [
-        r'openshift-install v([^\s"]+)',
-        r'"openshift-install v([^\s"]+)"'
+        r'openshift-install v([0-9][^\s"]+)',
+        r'"openshift-install v([0-9][^\s"]+)"'
     ]
 
     for pattern in version_patterns:
@@ -135,36 +136,57 @@ def extract_installation_info(log_content: str) -> Dict[str, Any]:
 
     return install_info
 
+def get_job_metadata(raw_data: str) -> Dict[str, Any]:
+        # Extract test_name from data using regex
+        test_name = None
+        match_test_name = re.search(r"Running multi-stage test ([^\s]*)", raw_data)
+        if match_test_name:
+            test_name = match_test_name.group(1)
+        print(f"test_name: {test_name}")
+        status = None
+        match_status = re.search(r"Reporting job state '([^']*)'", raw_data)
+        if match_status:
+            status = match_status.group(1)
+        print(f"status: {status}")
+        data = {
+            "status": status, 
+            "test_name": test_name,
+        }
+
+        match_reason = re.search(r"Reporting job state '([^']*)' with reason '([^']*)'", raw_data)
+        if match_reason:
+            status = match_reason.group(1)
+            failure_reason = match_reason.group(2)
+            print(f"failure_reason: {failure_reason}")
+            data["failure_reason"] = failure_reason
+
+
+        return data
 # Prow tool functions for installation analysis
 async def get_job_metadata_async(job_name: str, build_id: str) -> Dict[str, Any]:
     """Get the metadata and status for a specific Prow job name and build id."""
-    url = f"{GCS_URL}/{job_name}/{build_id}/prowjob.json"
+    url = f"{GCS_URL}/{job_name}/{build_id}/build-log.txt"
     try:
         async with httpx.AsyncClient() as client:
             response = await client.get(url)
             response.raise_for_status()
-            data = response.json()
+            data = response.text
 
         if not data:
             return {"error": "No response from Prow API"}
-
-        job_spec = data.get("spec", {})
-        job_status = data.get("status", {})
-
-        build_id_from_status = job_status.get("build_id")
-        status = job_status.get("state")
-        args = job_spec.get("pod_spec", {}).get("containers", [])[0].get("args", [])
-        test_name = ""
-        for arg in args: 
-            if arg.startswith("--target="):
-                test_name = arg.replace("--target=", "")
-
-        return {
-            "status": status, 
-            "build_id": build_id_from_status, 
+        print(f"was able to get build-log.txt")
+        somedata = get_job_metadata(data)
+        metadata = {
+            "build_id": build_id,
             "job_name": job_name,
-            "test_name": test_name
+            "test_name": somedata["test_name"],
+            # "job_overall_status": somedata["status"],
+            # "job_overall_failure_reason": somedata["failure_reason"]
         }
+        print(f"metadata: {metadata}")        
+
+        return metadata
+
 
     except Exception as e:
         return {"error": f"Failed to fetch job info: {str(e)}"}

diff --git a/sub_agents/installation_analyst/prompt.py b/sub_agents/installation_analyst/prompt.py
@@ -49,8 +49,8 @@ def get_user_prompt():
 - get_install_logs: Fetch and analyze build-log.txt with structured information extraction
 
 ANALYSIS WORKFLOW:
-1. Start with job metadata to understand the test context
-2. Fetch installation logs from build-log.txt which automatically extracts:
+1. call tool get_job_metadata, which provides the test_name
+2. Fetch installation logs from build-log.txt (calling tool get_install_logs with build_id, job_name and test_name) which automatically extracts:
    - Installer binary version and commit
    - Instance types and cluster configuration
    - Installation duration and success status
Original file line number	Diff line number	Diff line change
		@@ -1,3 +1,2 @@
		"""CI Analysis coordinator: provide root cause analysis for CI failures"""

		from . import agent