-
Notifications
You must be signed in to change notification settings - Fork 7
Integrate known failure pattern classification into RCA skill #13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
e51712b
63b1bd7
b6fc32e
6715cee
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -44,7 +44,7 @@ The `cli.py analyze` command automatically runs all steps: | |
| - **Step 3**: Correlate → Merge AAP and Splunk events into unified timeline | ||
| - **Step 4**: Fetch GitHub files → Parse job metadata, fetch AgnosticV configs and AgnosticD workload code (requires `GITHUB_TOKEN` to be configured) | ||
|
|
||
| **Outputs**: `.analysis/<job-id>/step1_job_context.json`, `step2_splunk_logs.json`, `step3_correlation.json`, `step4_github_fetch_history.json` | ||
| **Outputs**: `.analysis/<job-id>/step1_job_context.json`, `step2_splunk_logs.json`, `step3_correlation.json`, `step4_github_fetch_history.json`, `classification.json` | ||
|
|
||
| This skill automatically searches for job logs in the configured `JOB_LOGS_DIR`. | ||
|
|
||
|
|
@@ -77,7 +77,8 @@ python3 -m venv .venv | |
| 1. **REQUIRED**: `step1_job_context.json` - Job metadata and failed task details | ||
| 2. **REQUIRED**: `step3_correlation.json` - Correlated timeline with relevant pod logs (DO NOT read step2 unless needed) | ||
| 3. **REQUIRED**: `step4_github_fetch_history.json` - Configuration and code context | ||
| 4. **CONDITIONAL**: `step2_splunk_logs.json` - Only read if step3 indicates errors needing deeper investigation | ||
| 4. **REQUIRED**: `classification.json` - Known failure pattern matches (if present). Use these verified categories instead of guessing. If a match exists, use its `error_category` as the root cause category. If no matches, flag as novel/unclassified failure. | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Clarify Line 80 is internally contradictory ( Proposed doc fix-4. **REQUIRED**: `classification.json` - Known failure pattern matches (if present). Use these verified categories instead of guessing. If a match exists, use its `error_category` as the root cause category. If no matches, flag as novel/unclassified failure.
+4. **REQUIRED**: `classification.json` - Known failure pattern matching result. Always read this file. If a match exists, use its `error_category` as the root cause category. If no matches, treat as novel/unclassified failure.As per coding guidelines, “Focus on major issues impacting performance, readability, maintainability and security. Avoid nitpicks and avoid verbosity.” 🤖 Prompt for AI Agents |
||
| 5. **CONDITIONAL**: `step2_splunk_logs.json` - Only read if step3 indicates errors needing deeper investigation | ||
|
|
||
| **Output**: `.analysis/<job-id>/step5_analysis_summary.json` | ||
|
|
||
|
|
@@ -96,7 +97,7 @@ python3 -m venv .venv | |
|
|
||
| ### Summary Requirements | ||
|
|
||
| 1. **Root Cause**: Category (`configuration|infrastructure|workload_bug|credential|resource|dependency`), summary, confidence | ||
| 1. **Root Cause**: Category — prefer `classification.json` categories when matched (`platform_failure|connectivity_failure|authentication_failure|resource_failure|timeout_failure|automation_failure|infrastructure_failure`). Fall back to (`configuration|infrastructure|application_bug|secrets|resource|dependency`) only for novel/unclassified errors. Include summary and confidence. | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Include Line 100’s preferred list omits Proposed doc fix-1. **Root Cause**: Category — prefer `classification.json` categories when matched (`platform_failure|connectivity_failure|authentication_failure|resource_failure|timeout_failure|automation_failure|infrastructure_failure`). Fall back to (`configuration|infrastructure|application_bug|secrets|resource|dependency`) only for novel/unclassified errors. Include summary and confidence.
+1. **Root Cause**: Category — prefer `classification.json` categories when matched (`platform_failure|connectivity_failure|authentication_failure|resource_failure|timeout_failure|automation_failure|infrastructure_failure|general_failure`). Fall back to (`configuration|infrastructure|application_bug|secrets|resource|dependency`) only for novel/unclassified errors. Include summary and confidence.As per coding guidelines, “Focus on major issues impacting performance, readability, maintainability and security. Avoid nitpicks and avoid verbosity.” 🤖 Prompt for AI Agents |
||
| 2. **Evidence**: Supporting evidence from AAP logs, Splunk logs, and GitHub configs/code | ||
| - **REQUIRED**: When `source` is `agnosticv_config` or `agnosticd_code`, **MUST** include `github_path` in format `owner/repo:path/to/file.yml:line` | ||
| - Extract GitHub paths from step4: | ||
|
|
@@ -180,6 +181,7 @@ See `schemas/summary.schema.json` for complete structure. Example: | |
| | 2 | `step2_splunk_logs.json` | Python | | ||
| | 3 | `step3_correlation.json` | Python | | ||
| | 4 | `step4_github_fetch_history.json` | Python (Optional Claude updates for MCP verification) | | ||
| | — | `classification.json` | Python (known failure pattern matching) | | ||
| | 5 | `step5_analysis_summary.json` | Claude | | ||
|
|
||
| All files in `.analysis/<job-id>/` | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,170 @@ | ||
| """Classify error messages against known failure patterns. | ||
|
|
||
| Loads a curated YAML file of regex-based error patterns and matches them | ||
| against error messages from RCA steps 1 and 3. The YAML file can be | ||
| provided via URL, local file path, or CLI flags. | ||
|
|
||
| Configuration (in .claude/settings.local.json env block): | ||
| KNOWN_FAILED_YAML_URL — URL to fetch the YAML file (cached locally) | ||
| KNOWN_FAILED_YAML — local file path (fallback) | ||
| """ | ||
|
|
||
| import os | ||
| import re | ||
| import tempfile | ||
| from pathlib import Path | ||
|
|
||
| import yaml | ||
|
|
||
| # Cache dir for downloaded known_failed.yaml | ||
| _CACHE_DIR = Path(tempfile.gettempdir()) / "rhdp-rca" | ||
| _CACHE_FILE = _CACHE_DIR / "known_failed.yaml" | ||
|
|
||
|
|
||
| def fetch_known_failures_from_url(url: str) -> list[dict]: | ||
| """Fetch known failure patterns YAML from a URL. | ||
|
|
||
| Caches the file locally. Returns the parsed failures list. | ||
| """ | ||
| _CACHE_DIR.mkdir(parents=True, exist_ok=True) | ||
|
|
||
| import requests | ||
|
|
||
| headers = {} | ||
| github_token = os.environ.get("GITHUB_TOKEN", "") | ||
| if github_token and "api.github.com" in url: | ||
| headers["Authorization"] = f"token {github_token}" | ||
| headers["Accept"] = "application/vnd.github.v3.raw" | ||
|
|
||
| try: | ||
| resp = requests.get(url, headers=headers, timeout=15) | ||
| resp.raise_for_status() | ||
| _CACHE_FILE.write_text(resp.text) | ||
| return _parse_yaml_content(resp.text) | ||
| except (requests.RequestException, yaml.YAMLError) as e: | ||
| # Fall back to cache if fetch fails | ||
| if _CACHE_FILE.exists(): | ||
| return load_known_failures(_CACHE_FILE) | ||
| print(f" Warning: Failed to fetch known failure patterns: {e}") | ||
| return [] | ||
|
Comment on lines
+19
to
+49
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Cache the YAML per source, not in one global file.
🧰 Tools🪛 GitHub Actions: CI[error] 31-31: mypy: Library stubs not installed for "requests" [import-untyped] 🤖 Prompt for AI Agents
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Valid point — the single global cache file could cause cross-contamination if multiple URLs are used. In practice this is low risk since the configuration pattern is one URL per environment, but it is a real correctness issue. Tracking as a follow-up improvement. For now the single-URL assumption holds for all current usage. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
|
||
|
|
||
|
|
||
| def load_known_failures(yaml_path: str | Path) -> list[dict]: | ||
| """Load known failure patterns from a local YAML file.""" | ||
| path = Path(yaml_path) | ||
| if not path.exists(): | ||
| return [] | ||
| try: | ||
| with open(path) as f: | ||
| return _parse_yaml_content(f.read()) | ||
| except (yaml.YAMLError, OSError): | ||
| return [] | ||
|
|
||
|
|
||
| def _parse_yaml_content(content: str) -> list[dict]: | ||
| """Parse YAML content and extract the failures list.""" | ||
| data = yaml.safe_load(content) | ||
| if not isinstance(data, dict): | ||
| return [] | ||
| failures = data.get("failures", []) | ||
| if not isinstance(failures, list): | ||
| return [] | ||
| return [f for f in failures if isinstance(f, dict)] | ||
|
|
||
|
|
||
| def classify_error(error_message: str, known_failures: list[dict]) -> dict | None: | ||
| """Match an error message against known failure patterns. | ||
|
|
||
| Returns a dict with classification info on match, or None. | ||
| """ | ||
| if not error_message or not known_failures: | ||
| return None | ||
|
|
||
| error_message = error_message.strip() | ||
|
|
||
| for failure in known_failures: | ||
| pattern = failure.get("error_string", "") | ||
| if not pattern: | ||
| continue | ||
| try: | ||
| if re.search(pattern, error_message, re.IGNORECASE | re.DOTALL): | ||
| return { | ||
| "error_category": failure.get("category", "general_failure"), | ||
| "matched_pattern": pattern, | ||
| "failure_description": failure.get("description", ""), | ||
| } | ||
| except re.error: | ||
| continue | ||
|
|
||
| return None | ||
|
|
||
|
|
||
| def classify_job_errors( | ||
| job_context: dict, correlation: dict, known_failures: list[dict] | ||
| ) -> list[dict]: | ||
| """Classify all error messages found in step1 and step3 outputs. | ||
|
|
||
| Returns a list of classification results (one per matched error). | ||
| """ | ||
| results: list[dict] = [] | ||
| seen_messages: set[str] = set() | ||
|
|
||
| # Collect error messages from step1 failed tasks | ||
| for task in job_context.get("failed_tasks", []): | ||
| msg = task.get("error_message", "") | ||
| if msg and msg not in seen_messages: | ||
| seen_messages.add(msg) | ||
| match = classify_error(msg, known_failures) | ||
| if match: | ||
| match["source"] = "aap_failed_task" | ||
| match["task"] = task.get("task", "") | ||
| results.append(match) | ||
|
|
||
| # Collect error messages from step3 timeline events | ||
| # Timeline events store messages in details.message (for aap_job) or | ||
| # details.message (for splunk_ocp), not at the top level. | ||
| for event in correlation.get("timeline_events", []): | ||
| details = event.get("details", {}) | ||
| msg = details.get("message", "") or details.get("error_message", "") | ||
| if msg and msg not in seen_messages: | ||
| seen_messages.add(msg) | ||
| match = classify_error(msg, known_failures) | ||
| if match: | ||
| match["source"] = "correlation_timeline" | ||
| results.append(match) | ||
|
|
||
| return results | ||
|
|
||
|
|
||
| def resolve_known_failures(url: str | None = None, local_path: str | None = None) -> list[dict]: | ||
| """Resolve and load known failure patterns. | ||
|
|
||
| Args: | ||
| url: URL to fetch YAML from (overrides env var) | ||
| local_path: Local file path (overrides env var) | ||
|
|
||
| Priority: | ||
| 1. Explicit url/local_path arguments (from CLI flags) | ||
| 2. KNOWN_FAILED_YAML_URL env var — fetch from URL (cached locally) | ||
| 3. KNOWN_FAILED_YAML env var — read from local file path | ||
| 4. Returns empty list if none configured | ||
| """ | ||
| # CLI flag: URL | ||
| if url: | ||
| return fetch_known_failures_from_url(url) | ||
|
|
||
| # CLI flag: local path | ||
| if local_path: | ||
| return load_known_failures(local_path) | ||
|
|
||
| # Env var: URL | ||
| env_url = os.environ.get("KNOWN_FAILED_YAML_URL", "") | ||
| if env_url: | ||
| return fetch_known_failures_from_url(env_url) | ||
|
|
||
| # Env var: local path | ||
| env_path = os.environ.get("KNOWN_FAILED_YAML", "") | ||
| if env_path: | ||
| return load_known_failures(env_path) | ||
|
|
||
| return [] | ||
Uh oh!
There was an error while loading. Please reload this page.