Integrate known failure pattern classification into RCA skill#13
Integrate known failure pattern classification into RCA skill#13PalmPalm7 wants to merge 4 commits intoredhat-et:mainfrom
Conversation
Add optional error classification step that matches job errors against a curated YAML of regex-based failure patterns. This gives Claude verified ground-truth categories instead of guessing, using a standardized 8-category taxonomy (platform_failure, connectivity_failure, authentication_failure, resource_failure, timeout_failure, etc.). The known_failed.yaml is fetched at runtime from a configurable URL or local path — nothing is vendored. Configure via: - CLI: --known-failures-url <url> or --known-failures-file <path> - Env: KNOWN_FAILED_YAML_URL or KNOWN_FAILED_YAML in .claude/settings.local.json Classification is fully optional — the pipeline works without it. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
📝 WalkthroughWalkthroughAdds a regex-based error classification feature that loads known-failure patterns from YAML (URL or local, with caching), integrates a classification step into the CLI pipeline producing Changes
Sequence DiagramsequenceDiagram
participant CLI
participant Classifier as classify.py
participant Fetcher as YAML Fetcher
participant Cache as Local Cache
participant Matcher as Regex Matcher
participant Output as classification.json
CLI->>Classifier: analyze (with known-failures url/file)
Classifier->>Fetcher: resolve_known_failures(url, local_path)
Fetcher->>Cache: check cache file (_CACHE_FILE)
alt cached content available
Cache-->>Fetcher: return cached YAML
else fetch required
Fetcher->>Fetcher: requests.get(url) (may use GH token)
Fetcher->>Cache: write downloaded YAML
end
Fetcher-->>Classifier: parsed failures list
CLI->>Classifier: classify_job_errors(job_context, correlation, patterns)
Classifier->>Matcher: for each unique error message -> classify_error(message, patterns)
loop per pattern
Matcher->>Matcher: re.search(pattern, message, IGNORECASE|DOTALL)
alt match
Matcher-->>Classifier: classification {error_category, matched_pattern, description}
end
end
Classifier->>Output: write {patterns_loaded, matches} to classification.json
Output-->>CLI: success / results
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes 🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
The `requests` import in classify.py was top-level, causing ModuleNotFoundError in CI since it's not in pyproject.toml deps. Move it to a lazy import inside fetch_known_failures_from_url() where it's actually needed. Add PyYAML to pyproject.toml since classify.py directly uses it. Fix ruff I001 import sorting in test_classify.py. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Actionable comments posted: 5
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
skills/root-cause-analysis/scripts/cli.py (1)
10-22:⚠️ Potential issue | 🟠 MajorDirect-execution support is incomplete.
The conditional imports on lines 10-27 correctly support both direct and module execution, but
cmd_query()uses a relative import (from .splunk_client import SplunkClienton line 316) that will fail when the script is run directly. Whenpython scripts/cli.py query ...is executed,__package__will beNone, causing the relative import to raise an error. Thequerycommand will not work in direct-execution mode.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@skills/root-cause-analysis/scripts/cli.py` around lines 10 - 22, The cmd_query function currently does a relative import "from .splunk_client import SplunkClient" which fails when the script is executed directly; update the import resolution to match the top-of-file pattern by using a conditional/try-import fallback so SplunkClient is imported correctly in both direct and module execution modes (e.g., attempt relative import first, and if that fails import using the top-level path or vice versa), and apply this change where SplunkClient is referenced (cmd_query) so the query command works for both "python scripts/cli.py" and "python -m scripts.cli".
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@pyproject.toml`:
- Around line 23-26: The pyproject.toml dependencies list is missing the runtime
dependency for the requests library used by fetch_known_failures_from_url() in
classify.py; add "requests>=2.0.0" (or an appropriate minimum version) to the
dependencies array so installations include requests and the URL-backed
known-failures code path (e.g., --known-failures-url / KNOWN_FAILED_YAML_URL)
does not raise ModuleNotFoundError.
In `@skills/root-cause-analysis/scripts/classify.py`:
- Around line 19-49: fetch_known_failures_from_url currently writes every
fetched YAML to a single global _CACHE_FILE so a later failed fetch for a
different URL can incorrectly return another URL's patterns; change the caching
to be URL-specific by deriving a per-source cache filename (e.g., hash the url
or sanitize it) inside _CACHE_DIR, write/read that per-URL cache instead of
_CACHE_FILE, and when falling back in the except block call load_known_failures
with the per-URL cache path (ensure _CACHE_DIR.mkdir remains and keep
headers/github token logic intact).
- Around line 64-69: The YAML loader in _parse_yaml_content should validate the
parsed shape before using it: ensure the top-level result is a dict and that the
"failures" key is a list of dicts; if not, return an empty list. Update
_parse_yaml_content to treat non-dict top-level values (e.g., [], str, int) as
invalid, treat missing or non-list "failures" as empty, and filter the failures
list to include only dict entries so downstream classify_error() never receives
unexpected types.
In `@skills/root-cause-analysis/scripts/cli.py`:
- Around line 209-237: The code currently skips writing classification.json when
no known failures are configured; change the logic so classification_path is
always written with a stable payload. Specifically, when
resolve_known_failures(...) returns falsy, set classifications = [] and
patterns_loaded = 0 (or compute patterns_loaded = len(known_failures) in the
truthy case), then always build classification_result = {"patterns_loaded":
patterns_loaded, "matches": classifications} and json.dump it to
classification_path; keep the existing prints for matched vs no matches but
ensure the file is created even in the else branch — update calls around
resolve_known_failures, classify_job_errors, classification_path, and
classification_result accordingly.
In `@skills/root-cause-analysis/SKILL.md`:
- Line 100: The fallback category list in the "Root Cause" instruction of
SKILL.md includes invalid enums ("workload_bug" and "credential") that don't
match schemas/summary.schema.json and will produce invalid
step5_analysis_summary.json; update the fallback list in the "Root Cause" line
to only include enum values that exist in schemas/summary.schema.json (remove or
replace "workload_bug" and "credential" with the schema-approved equivalents),
and ensure any example outputs or references to step5_analysis_summary.json are
updated accordingly and validate against schemas/summary.schema.json.
---
Outside diff comments:
In `@skills/root-cause-analysis/scripts/cli.py`:
- Around line 10-22: The cmd_query function currently does a relative import
"from .splunk_client import SplunkClient" which fails when the script is
executed directly; update the import resolution to match the top-of-file pattern
by using a conditional/try-import fallback so SplunkClient is imported correctly
in both direct and module execution modes (e.g., attempt relative import first,
and if that fails import using the top-level path or vice versa), and apply this
change where SplunkClient is referenced (cmd_query) so the query command works
for both "python scripts/cli.py" and "python -m scripts.cli".
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: cbad026d-413d-4ac5-a4d5-8d718dd66cb3
📒 Files selected for processing (7)
pyproject.tomlskills/root-cause-analysis/README.mdskills/root-cause-analysis/SKILL.mdskills/root-cause-analysis/schemas/summary.schema.jsonskills/root-cause-analysis/scripts/classify.pyskills/root-cause-analysis/scripts/cli.pyskills/root-cause-analysis/tests/test_classify.py
| # Cache dir for downloaded known_failed.yaml | ||
| _CACHE_DIR = Path(tempfile.gettempdir()) / "rhdp-rca" | ||
| _CACHE_FILE = _CACHE_DIR / "known_failed.yaml" | ||
|
|
||
|
|
||
| def fetch_known_failures_from_url(url: str) -> list[dict]: | ||
| """Fetch known failure patterns YAML from a URL. | ||
|
|
||
| Caches the file locally. Returns the parsed failures list. | ||
| """ | ||
| _CACHE_DIR.mkdir(parents=True, exist_ok=True) | ||
|
|
||
| import requests | ||
|
|
||
| headers = {} | ||
| github_token = os.environ.get("GITHUB_TOKEN", "") | ||
| if github_token and "api.github.com" in url: | ||
| headers["Authorization"] = f"token {github_token}" | ||
| headers["Accept"] = "application/vnd.github.v3.raw" | ||
|
|
||
| try: | ||
| resp = requests.get(url, headers=headers, timeout=15) | ||
| resp.raise_for_status() | ||
| _CACHE_FILE.write_text(resp.text) | ||
| return _parse_yaml_content(resp.text) | ||
| except (requests.RequestException, yaml.YAMLError) as e: | ||
| # Fall back to cache if fetch fails | ||
| if _CACHE_FILE.exists(): | ||
| return load_known_failures(_CACHE_FILE) | ||
| print(f" Warning: Failed to fetch known failure patterns: {e}") | ||
| return [] |
There was a problem hiding this comment.
Cache the YAML per source, not in one global file.
_CACHE_FILE is shared across every configured URL. If URL A is cached and URL B later fails to fetch, this code silently classifies B with A’s patterns and returns the wrong “verified” category set. Use a URL-derived cache key, or persist the source URL alongside the cache before falling back.
🧰 Tools
🪛 GitHub Actions: CI
[error] 31-31: mypy: Library stubs not installed for "requests" [import-untyped]
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@skills/root-cause-analysis/scripts/classify.py` around lines 19 - 49,
fetch_known_failures_from_url currently writes every fetched YAML to a single
global _CACHE_FILE so a later failed fetch for a different URL can incorrectly
return another URL's patterns; change the caching to be URL-specific by deriving
a per-source cache filename (e.g., hash the url or sanitize it) inside
_CACHE_DIR, write/read that per-URL cache instead of _CACHE_FILE, and when
falling back in the except block call load_known_failures with the per-URL cache
path (ensure _CACHE_DIR.mkdir remains and keep headers/github token logic
intact).
There was a problem hiding this comment.
Valid point — the single global cache file could cause cross-contamination if multiple URLs are used. In practice this is low risk since the configuration pattern is one URL per environment, but it is a real correctness issue. Tracking as a follow-up improvement. For now the single-URL assumption holds for all current usage.
There was a problem hiding this comment.
@PalmPalm7, understood — the single-URL-per-environment assumption makes the risk negligible for current usage. Would you like me to open a GitHub issue to track the per-URL cache key improvement so it doesn't get lost?
…n, schema-aligned categories - Validate YAML shape in _parse_yaml_content: reject non-dict top-level, non-list failures, and non-dict entries - Always emit classification.json even when no patterns are configured, so Step 5 has a stable pipeline contract - Fix SKILL.md fallback categories to match summary.schema.json enum (workload_bug → application_bug, credential → secrets) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
Re: CodeRabbit outside-diff comment on |
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@skills/root-cause-analysis/SKILL.md`:
- Line 80: The doc is contradictory about classification.json being "REQUIRED"
yet "if present"; update the text in SKILL.md to state classification.json is an
unconditional/deterministic input (since scripts/cli.py always writes it),
remove "if present", and clearly instruct: always consult classification.json
for known failure pattern matches and use its error_category when a match
exists, otherwise mark the failure as novel/unclassified; reference
classification.json and scripts/cli.py in the text so readers know the source
and determinism.
- Line 100: The preferred category list in SKILL.md omits `general_failure`,
causing mismatch with scripts/classify.py which may emit `general_failure` for
matched patterns; update the preferred categories line (the list containing
`platform_failure|connectivity_failure|authentication_failure|resource_failure|timeout_failure|automation_failure|infrastructure_failure`)
to include `general_failure` among the preferred classifications so that
`general_failure` is treated as a primary match and not force fallback
handling—ensure the doc text still mentions the fallback set
(`configuration|infrastructure|application_bug|secrets|resource|dependency`)
remains unchanged and keep summary and confidence requirements intact.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 6e94fc05-c7c8-44d4-b0c9-9815a9472120
📒 Files selected for processing (3)
skills/root-cause-analysis/SKILL.mdskills/root-cause-analysis/scripts/classify.pyskills/root-cause-analysis/scripts/cli.py
🚧 Files skipped from review as they are similar to previous changes (2)
- skills/root-cause-analysis/scripts/cli.py
- skills/root-cause-analysis/scripts/classify.py
| 2. **REQUIRED**: `step3_correlation.json` - Correlated timeline with relevant pod logs (DO NOT read step2 unless needed) | ||
| 3. **REQUIRED**: `step4_github_fetch_history.json` - Configuration and code context | ||
| 4. **CONDITIONAL**: `step2_splunk_logs.json` - Only read if step3 indicates errors needing deeper investigation | ||
| 4. **REQUIRED**: `classification.json` - Known failure pattern matches (if present). Use these verified categories instead of guessing. If a match exists, use its `error_category` as the root cause category. If no matches, flag as novel/unclassified failure. |
There was a problem hiding this comment.
Clarify classification.json as unconditionally required input.
Line 80 is internally contradictory (REQUIRED vs “if present”). scripts/cli.py always writes classification.json, so Step 5 guidance should be deterministic.
Proposed doc fix
-4. **REQUIRED**: `classification.json` - Known failure pattern matches (if present). Use these verified categories instead of guessing. If a match exists, use its `error_category` as the root cause category. If no matches, flag as novel/unclassified failure.
+4. **REQUIRED**: `classification.json` - Known failure pattern matching result. Always read this file. If a match exists, use its `error_category` as the root cause category. If no matches, treat as novel/unclassified failure.As per coding guidelines, “Focus on major issues impacting performance, readability, maintainability and security. Avoid nitpicks and avoid verbosity.”
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@skills/root-cause-analysis/SKILL.md` at line 80, The doc is contradictory
about classification.json being "REQUIRED" yet "if present"; update the text in
SKILL.md to state classification.json is an unconditional/deterministic input
(since scripts/cli.py always writes it), remove "if present", and clearly
instruct: always consult classification.json for known failure pattern matches
and use its error_category when a match exists, otherwise mark the failure as
novel/unclassified; reference classification.json and scripts/cli.py in the text
so readers know the source and determinism.
| ### Summary Requirements | ||
|
|
||
| 1. **Root Cause**: Category (`configuration|infrastructure|workload_bug|credential|resource|dependency`), summary, confidence | ||
| 1. **Root Cause**: Category — prefer `classification.json` categories when matched (`platform_failure|connectivity_failure|authentication_failure|resource_failure|timeout_failure|automation_failure|infrastructure_failure`). Fall back to (`configuration|infrastructure|application_bug|secrets|resource|dependency`) only for novel/unclassified errors. Include summary and confidence. |
There was a problem hiding this comment.
Include general_failure in preferred classification categories.
Line 100’s preferred list omits general_failure, but scripts/classify.py can emit general_failure as the default category for a matched pattern entry. That mismatch can force unnecessary fallback handling and weaken category consistency.
Proposed doc fix
-1. **Root Cause**: Category — prefer `classification.json` categories when matched (`platform_failure|connectivity_failure|authentication_failure|resource_failure|timeout_failure|automation_failure|infrastructure_failure`). Fall back to (`configuration|infrastructure|application_bug|secrets|resource|dependency`) only for novel/unclassified errors. Include summary and confidence.
+1. **Root Cause**: Category — prefer `classification.json` categories when matched (`platform_failure|connectivity_failure|authentication_failure|resource_failure|timeout_failure|automation_failure|infrastructure_failure|general_failure`). Fall back to (`configuration|infrastructure|application_bug|secrets|resource|dependency`) only for novel/unclassified errors. Include summary and confidence.As per coding guidelines, “Focus on major issues impacting performance, readability, maintainability and security. Avoid nitpicks and avoid verbosity.”
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@skills/root-cause-analysis/SKILL.md` at line 100, The preferred category list
in SKILL.md omits `general_failure`, causing mismatch with scripts/classify.py
which may emit `general_failure` for matched patterns; update the preferred
categories line (the list containing
`platform_failure|connectivity_failure|authentication_failure|resource_failure|timeout_failure|automation_failure|infrastructure_failure`)
to include `general_failure` among the preferred classifications so that
`general_failure` is treated as a primary match and not force fallback
handling—ensure the doc text still mentions the fallback set
(`configuration|infrastructure|application_bug|secrets|resource|dependency`)
remains unchanged and keep summary and confidence requirements intact.
Summary
known_failed.yamlknown_failed.yamlis configurable viaKNOWN_FAILED_YAMLin Claude Codesettings.jsonenv block, with fallback to bundleddata/known_failed.yamlChanges
scripts/classify.py(~80 LOC) — pattern matching logic adapted from/classify_errors.pyscripts/cli.py— integrates classification after Step 4, outputsclassification.jsonSKILL.md— instructs Claude to readclassification.jsonand prefer verified categoriestests/test_classify.py— 9 new testsTest plan
classification.jsoncategories in Step 5 summaryUpstream target:
redhat-et/rhdp-rca-plugin(open PR there once permissions are granted)🤖 Generated with Claude Code
Summary by CodeRabbit
New Features
Documentation
Tests
Chores