Integrate known failure pattern classification into RCA skill by PalmPalm7 · Pull Request #13 · redhat-et/rhdp-rca-plugin

PalmPalm7 · 2026-03-23T15:00:45Z

Summary

Adds error classification step between Step 4 and Step 5 that matches job errors against known_failed.yaml
Gives Claude verified ground-truth categories instead of guessing, using the same category taxonomy as RHDP ETL pipeline.
Path to known_failed.yaml is configurable via KNOWN_FAILED_YAML in Claude Code settings.json env block, with fallback to bundled data/known_failed.yaml

Changes

scripts/classify.py (~80 LOC) — pattern matching logic adapted from /classify_errors.py
scripts/cli.py — integrates classification after Step 4, outputs classification.json
SKILL.md — instructs Claude to read classification.json and prefer verified categories
tests/test_classify.py — 9 new tests

Test plan

All 70 tests pass (61 existing + 9 new)
Run against a real failed job to verify classification output
Verify Claude uses classification.json categories in Step 5 summary

Upstream target: redhat-et/rhdp-rca-plugin (open PR there once permissions are granted)

🤖 Generated with Claude Code

Summary by CodeRabbit

New Features
- Added error classification that matches errors against configurable known-failure patterns and writes a classification.json artifact
- Expanded root-cause categories with more specific failure types
- Added CLI options and env var support for providing known-failures (with local cache/fallback)
Documentation
- Updated skill README and docs to describe the classification step and classification.json output
Tests
- Added tests covering classification logic and known-failure loading
Chores
- Added PyYAML>=6.0 to project dependencies

Add optional error classification step that matches job errors against a curated YAML of regex-based failure patterns. This gives Claude verified ground-truth categories instead of guessing, using a standardized 8-category taxonomy (platform_failure, connectivity_failure, authentication_failure, resource_failure, timeout_failure, etc.). The known_failed.yaml is fetched at runtime from a configurable URL or local path — nothing is vendored. Configure via: - CLI: --known-failures-url <url> or --known-failures-file <path> - Env: KNOWN_FAILED_YAML_URL or KNOWN_FAILED_YAML in .claude/settings.local.json Classification is fully optional — the pipeline works without it. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

coderabbitai · 2026-03-23T15:00:52Z

📝 Walkthrough

Walkthrough

Adds a regex-based error classification feature that loads known-failure patterns from YAML (URL or local, with caching), integrates a classification step into the CLI pipeline producing classification.json, extends schema categories, updates docs, and adds tests.

Changes

Cohort / File(s)	Summary
Dependency Management `pyproject.toml`	Added `PyYAML>=6.0` to `project.dependencies` for YAML parsing.
Documentation `skills/root-cause-analysis/README.md`, `skills/root-cause-analysis/SKILL.md`	Documented `KNOWN_FAILED_YAML_URL`/file inputs, caching behavior, and new `classification.json` artifact; updated Step 5 to prefer `classification.json` categories and include summary/confidence fields.
Schema `skills/root-cause-analysis/schemas/summary.schema.json`	Extended `root_cause.category` enum with: `platform_failure`, `connectivity_failure`, `authentication_failure`, `resource_failure`, `timeout_failure`, `automation_failure`, `infrastructure_failure`, `general_failure`.
Classification Engine `skills/root-cause-analysis/scripts/classify.py`	New module: resolves known-failures from URL (with cache) or local file, parses YAML, matches error messages with regex (case/ dot-all), skips invalid patterns, and returns classification dicts; aggregator extracts unique errors from job and timeline.
CLI Integration `skills/root-cause-analysis/scripts/cli.py`	Added `--known-failures-url` and `--known-failures-file` args; added "Classify" phase that calls resolver and classifier, prints results/skip messages, and writes `analysis_dir/classification.json`.
Tests `skills/root-cause-analysis/tests/test_classify.py`	New tests for YAML loading (valid/missing/blank), regex matching and skipping invalid patterns, empty-input handling, deduplication, and end-to-end `classify_job_errors` behavior.

Sequence Diagram

sequenceDiagram
    participant CLI
    participant Classifier as classify.py
    participant Fetcher as YAML Fetcher
    participant Cache as Local Cache
    participant Matcher as Regex Matcher
    participant Output as classification.json

    CLI->>Classifier: analyze (with known-failures url/file)
    Classifier->>Fetcher: resolve_known_failures(url, local_path)
    Fetcher->>Cache: check cache file (_CACHE_FILE)
    alt cached content available
        Cache-->>Fetcher: return cached YAML
    else fetch required
        Fetcher->>Fetcher: requests.get(url) (may use GH token)
        Fetcher->>Cache: write downloaded YAML
    end
    Fetcher-->>Classifier: parsed failures list

    CLI->>Classifier: classify_job_errors(job_context, correlation, patterns)
    Classifier->>Matcher: for each unique error message -> classify_error(message, patterns)
    loop per pattern
        Matcher->>Matcher: re.search(pattern, message, IGNORECASE|DOTALL)
        alt match
            Matcher-->>Classifier: classification {error_category, matched_pattern, description}
        end
    end
    Classifier->>Output: write {patterns_loaded, matches} to classification.json
    Output-->>CLI: success / results

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately summarizes the main change: integrating error classification via known failure patterns into the RCA skill pipeline, which is the core objective across all modified files.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

The `requests` import in classify.py was top-level, causing ModuleNotFoundError in CI since it's not in pyproject.toml deps. Move it to a lazy import inside fetch_known_failures_from_url() where it's actually needed. Add PyYAML to pyproject.toml since classify.py directly uses it. Fix ruff I001 import sorting in test_classify.py. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

coderabbitai

Actionable comments posted: 5

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

skills/root-cause-analysis/scripts/cli.py (1)
10-22: ⚠️ Potential issue | 🟠 Major

Direct-execution support is incomplete.

The conditional imports on lines 10-27 correctly support both direct and module execution, but cmd_query() uses a relative import (from .splunk_client import SplunkClient on line 316) that will fail when the script is run directly. When python scripts/cli.py query ... is executed, __package__ will be None, causing the relative import to raise an error. The query command will not work in direct-execution mode.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@skills/root-cause-analysis/scripts/cli.py` around lines 10 - 22, The
cmd_query function currently does a relative import "from .splunk_client import
SplunkClient" which fails when the script is executed directly; update the
import resolution to match the top-of-file pattern by using a
conditional/try-import fallback so SplunkClient is imported correctly in both
direct and module execution modes (e.g., attempt relative import first, and if
that fails import using the top-level path or vice versa), and apply this change
where SplunkClient is referenced (cmd_query) so the query command works for both
"python scripts/cli.py" and "python -m scripts.cli".

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@pyproject.toml`:
- Around line 23-26: The pyproject.toml dependencies list is missing the runtime
dependency for the requests library used by fetch_known_failures_from_url() in
classify.py; add "requests>=2.0.0" (or an appropriate minimum version) to the
dependencies array so installations include requests and the URL-backed
known-failures code path (e.g., --known-failures-url / KNOWN_FAILED_YAML_URL)
does not raise ModuleNotFoundError.

In `@skills/root-cause-analysis/scripts/classify.py`:
- Around line 19-49: fetch_known_failures_from_url currently writes every
fetched YAML to a single global _CACHE_FILE so a later failed fetch for a
different URL can incorrectly return another URL's patterns; change the caching
to be URL-specific by deriving a per-source cache filename (e.g., hash the url
or sanitize it) inside _CACHE_DIR, write/read that per-URL cache instead of
_CACHE_FILE, and when falling back in the except block call load_known_failures
with the per-URL cache path (ensure _CACHE_DIR.mkdir remains and keep
headers/github token logic intact).
- Around line 64-69: The YAML loader in _parse_yaml_content should validate the
parsed shape before using it: ensure the top-level result is a dict and that the
"failures" key is a list of dicts; if not, return an empty list. Update
_parse_yaml_content to treat non-dict top-level values (e.g., [], str, int) as
invalid, treat missing or non-list "failures" as empty, and filter the failures
list to include only dict entries so downstream classify_error() never receives
unexpected types.

In `@skills/root-cause-analysis/scripts/cli.py`:
- Around line 209-237: The code currently skips writing classification.json when
no known failures are configured; change the logic so classification_path is
always written with a stable payload. Specifically, when
resolve_known_failures(...) returns falsy, set classifications = [] and
patterns_loaded = 0 (or compute patterns_loaded = len(known_failures) in the
truthy case), then always build classification_result = {"patterns_loaded":
patterns_loaded, "matches": classifications} and json.dump it to
classification_path; keep the existing prints for matched vs no matches but
ensure the file is created even in the else branch — update calls around
resolve_known_failures, classify_job_errors, classification_path, and
classification_result accordingly.

In `@skills/root-cause-analysis/SKILL.md`:
- Line 100: The fallback category list in the "Root Cause" instruction of
SKILL.md includes invalid enums ("workload_bug" and "credential") that don't
match schemas/summary.schema.json and will produce invalid
step5_analysis_summary.json; update the fallback list in the "Root Cause" line
to only include enum values that exist in schemas/summary.schema.json (remove or
replace "workload_bug" and "credential" with the schema-approved equivalents),
and ensure any example outputs or references to step5_analysis_summary.json are
updated accordingly and validate against schemas/summary.schema.json.

---

Outside diff comments:
In `@skills/root-cause-analysis/scripts/cli.py`:
- Around line 10-22: The cmd_query function currently does a relative import
"from .splunk_client import SplunkClient" which fails when the script is
executed directly; update the import resolution to match the top-of-file pattern
by using a conditional/try-import fallback so SplunkClient is imported correctly
in both direct and module execution modes (e.g., attempt relative import first,
and if that fails import using the top-level path or vice versa), and apply this
change where SplunkClient is referenced (cmd_query) so the query command works
for both "python scripts/cli.py" and "python -m scripts.cli".

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: cbad026d-413d-4ac5-a4d5-8d718dd66cb3

📥 Commits

Reviewing files that changed from the base of the PR and between 4ba3f2d and b6fc32e.

📒 Files selected for processing (7)

pyproject.toml
skills/root-cause-analysis/README.md
skills/root-cause-analysis/SKILL.md
skills/root-cause-analysis/schemas/summary.schema.json
skills/root-cause-analysis/scripts/classify.py
skills/root-cause-analysis/scripts/cli.py
skills/root-cause-analysis/tests/test_classify.py

pyproject.toml

coderabbitai · 2026-04-10T09:55:40Z

skills/root-cause-analysis/scripts/classify.py

+# Cache dir for downloaded known_failed.yaml
+_CACHE_DIR = Path(tempfile.gettempdir()) / "rhdp-rca"
+_CACHE_FILE = _CACHE_DIR / "known_failed.yaml"
+
+
+def fetch_known_failures_from_url(url: str) -> list[dict]:
+    """Fetch known failure patterns YAML from a URL.
+
+    Caches the file locally. Returns the parsed failures list.
+    """
+    _CACHE_DIR.mkdir(parents=True, exist_ok=True)
+
+    import requests
+
+    headers = {}
+    github_token = os.environ.get("GITHUB_TOKEN", "")
+    if github_token and "api.github.com" in url:
+        headers["Authorization"] = f"token {github_token}"
+        headers["Accept"] = "application/vnd.github.v3.raw"
+
+    try:
+        resp = requests.get(url, headers=headers, timeout=15)
+        resp.raise_for_status()
+        _CACHE_FILE.write_text(resp.text)
+        return _parse_yaml_content(resp.text)
+    except (requests.RequestException, yaml.YAMLError) as e:
+        # Fall back to cache if fetch fails
+        if _CACHE_FILE.exists():
+            return load_known_failures(_CACHE_FILE)
+        print(f"  Warning: Failed to fetch known failure patterns: {e}")
+        return []


⚠️ Potential issue | 🟠 Major

Cache the YAML per source, not in one global file.

_CACHE_FILE is shared across every configured URL. If URL A is cached and URL B later fails to fetch, this code silently classifies B with A’s patterns and returns the wrong “verified” category set. Use a URL-derived cache key, or persist the source URL alongside the cache before falling back.

🧰 Tools

🪛 GitHub Actions: CI

[error] 31-31: mypy: Library stubs not installed for "requests" [import-untyped]

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@skills/root-cause-analysis/scripts/classify.py` around lines 19 - 49, fetch_known_failures_from_url currently writes every fetched YAML to a single global _CACHE_FILE so a later failed fetch for a different URL can incorrectly return another URL's patterns; change the caching to be URL-specific by deriving a per-source cache filename (e.g., hash the url or sanitize it) inside _CACHE_DIR, write/read that per-URL cache instead of _CACHE_FILE, and when falling back in the except block call load_known_failures with the per-URL cache path (ensure _CACHE_DIR.mkdir remains and keep headers/github token logic intact).

Valid point — the single global cache file could cause cross-contamination if multiple URLs are used. In practice this is low risk since the configuration pattern is one URL per environment, but it is a real correctness issue. Tracking as a follow-up improvement. For now the single-URL assumption holds for all current usage.

@PalmPalm7, understood — the single-URL-per-environment assumption makes the risk negligible for current usage. Would you like me to open a GitHub issue to track the per-URL cache key improvement so it doesn't get lost?

skills/root-cause-analysis/scripts/classify.py

skills/root-cause-analysis/scripts/cli.py

skills/root-cause-analysis/SKILL.md

…n, schema-aligned categories - Validate YAML shape in _parse_yaml_content: reject non-dict top-level, non-list failures, and non-dict entries - Always emit classification.json even when no patterns are configured, so Step 5 has a stable pipeline contract - Fix SKILL.md fallback categories to match summary.schema.json enum (workload_bug → application_bug, credential → secrets) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

PalmPalm7 · 2026-04-10T17:10:47Z

Re: CodeRabbit outside-diff comment on cli.py:10-22 (direct-execution cmd_query import) — this is a pre-existing issue not introduced by this PR. The from .splunk_client import SplunkClient relative import in cmd_query() does need the same conditional pattern as the top-of-file imports, but since it is outside this PR's scope, it should be tracked as a separate fix.

coderabbitai

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@skills/root-cause-analysis/SKILL.md`:
- Line 80: The doc is contradictory about classification.json being "REQUIRED"
yet "if present"; update the text in SKILL.md to state classification.json is an
unconditional/deterministic input (since scripts/cli.py always writes it),
remove "if present", and clearly instruct: always consult classification.json
for known failure pattern matches and use its error_category when a match
exists, otherwise mark the failure as novel/unclassified; reference
classification.json and scripts/cli.py in the text so readers know the source
and determinism.
- Line 100: The preferred category list in SKILL.md omits `general_failure`,
causing mismatch with scripts/classify.py which may emit `general_failure` for
matched patterns; update the preferred categories line (the list containing
`platform_failure|connectivity_failure|authentication_failure|resource_failure|timeout_failure|automation_failure|infrastructure_failure`)
to include `general_failure` among the preferred classifications so that
`general_failure` is treated as a primary match and not force fallback
handling—ensure the doc text still mentions the fallback set
(`configuration|infrastructure|application_bug|secrets|resource|dependency`)
remains unchanged and keep summary and confidence requirements intact.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 6e94fc05-c7c8-44d4-b0c9-9815a9472120

📥 Commits

Reviewing files that changed from the base of the PR and between b6fc32e and 6715cee.

📒 Files selected for processing (3)

skills/root-cause-analysis/SKILL.md
skills/root-cause-analysis/scripts/classify.py
skills/root-cause-analysis/scripts/cli.py

🚧 Files skipped from review as they are similar to previous changes (2)

skills/root-cause-analysis/scripts/cli.py
skills/root-cause-analysis/scripts/classify.py

coderabbitai · 2026-04-10T17:12:47Z

skills/root-cause-analysis/SKILL.md

 2. **REQUIRED**: `step3_correlation.json` - Correlated timeline with relevant pod logs (DO NOT read step2 unless needed)
 3. **REQUIRED**: `step4_github_fetch_history.json` - Configuration and code context
-4. **CONDITIONAL**: `step2_splunk_logs.json` - Only read if step3 indicates errors needing deeper investigation
+4. **REQUIRED**: `classification.json` - Known failure pattern matches (if present). Use these verified categories instead of guessing. If a match exists, use its `error_category` as the root cause category. If no matches, flag as novel/unclassified failure.


⚠️ Potential issue | 🟠 Major

Clarify classification.json as unconditionally required input.

Line 80 is internally contradictory (REQUIRED vs “if present”). scripts/cli.py always writes classification.json, so Step 5 guidance should be deterministic.

Proposed doc fix

-4. **REQUIRED**: `classification.json` - Known failure pattern matches (if present). Use these verified categories instead of guessing. If a match exists, use its `error_category` as the root cause category. If no matches, flag as novel/unclassified failure. +4. **REQUIRED**: `classification.json` - Known failure pattern matching result. Always read this file. If a match exists, use its `error_category` as the root cause category. If no matches, treat as novel/unclassified failure.

As per coding guidelines, “Focus on major issues impacting performance, readability, maintainability and security. Avoid nitpicks and avoid verbosity.”

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@skills/root-cause-analysis/SKILL.md` at line 80, The doc is contradictory about classification.json being "REQUIRED" yet "if present"; update the text in SKILL.md to state classification.json is an unconditional/deterministic input (since scripts/cli.py always writes it), remove "if present", and clearly instruct: always consult classification.json for known failure pattern matches and use its error_category when a match exists, otherwise mark the failure as novel/unclassified; reference classification.json and scripts/cli.py in the text so readers know the source and determinism.

coderabbitai · 2026-04-10T17:12:47Z

skills/root-cause-analysis/SKILL.md

 ### Summary Requirements

-1. **Root Cause**: Category (`configuration|infrastructure|workload_bug|credential|resource|dependency`), summary, confidence
+1. **Root Cause**: Category — prefer `classification.json` categories when matched (`platform_failure|connectivity_failure|authentication_failure|resource_failure|timeout_failure|automation_failure|infrastructure_failure`). Fall back to (`configuration|infrastructure|application_bug|secrets|resource|dependency`) only for novel/unclassified errors. Include summary and confidence.


⚠️ Potential issue | 🟠 Major

Include general_failure in preferred classification categories.

Line 100’s preferred list omits general_failure, but scripts/classify.py can emit general_failure as the default category for a matched pattern entry. That mismatch can force unnecessary fallback handling and weaken category consistency.

Proposed doc fix

-1. **Root Cause**: Category — prefer `classification.json` categories when matched (`platform_failure|connectivity_failure|authentication_failure|resource_failure|timeout_failure|automation_failure|infrastructure_failure`). Fall back to (`configuration|infrastructure|application_bug|secrets|resource|dependency`) only for novel/unclassified errors. Include summary and confidence. +1. **Root Cause**: Category — prefer `classification.json` categories when matched (`platform_failure|connectivity_failure|authentication_failure|resource_failure|timeout_failure|automation_failure|infrastructure_failure|general_failure`). Fall back to (`configuration|infrastructure|application_bug|secrets|resource|dependency`) only for novel/unclassified errors. Include summary and confidence.

As per coding guidelines, “Focus on major issues impacting performance, readability, maintainability and security. Avoid nitpicks and avoid verbosity.”

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@skills/root-cause-analysis/SKILL.md` at line 100, The preferred category list in SKILL.md omits `general_failure`, causing mismatch with scripts/classify.py which may emit `general_failure` for matched patterns; update the preferred categories line (the list containing `platform_failure|connectivity_failure|authentication_failure|resource_failure|timeout_failure|automation_failure|infrastructure_failure`) to include `general_failure` among the preferred classifications so that `general_failure` is treated as a primary match and not force fallback handling—ensure the doc text still mentions the fallback set (`configuration|infrastructure|application_bug|secrets|resource|dependency`) remains unchanged and keep summary and confidence requirements intact.

PalmPalm7 closed this Mar 23, 2026

PalmPalm7 reopened this Mar 23, 2026

PalmPalm7 and others added 2 commits April 10, 2026 17:44

Fix ruff format: collapse resolve_known_failures signature to one line

b6fc32e

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

PalmPalm7 marked this pull request as ready for review April 10, 2026 09:48

coderabbitai bot reviewed Apr 10, 2026

View reviewed changes

Conversation

PalmPalm7 commented Mar 23, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Test plan

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Mar 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

❌ Failed checks (1 warning)

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot Apr 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

PalmPalm7 Apr 10, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Apr 10, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

PalmPalm7 commented Apr 10, 2026

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Apr 10, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Apr 10, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

PalmPalm7 commented Mar 23, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Mar 23, 2026 •

edited

Loading

coderabbitai bot Apr 10, 2026 •

edited

Loading