Skip to content

Integrate known failure pattern classification into RCA skill#13

Open
PalmPalm7 wants to merge 4 commits intoredhat-et:mainfrom
PalmPalm7:feat/known-failed-classification
Open

Integrate known failure pattern classification into RCA skill#13
PalmPalm7 wants to merge 4 commits intoredhat-et:mainfrom
PalmPalm7:feat/known-failed-classification

Conversation

@PalmPalm7
Copy link
Copy Markdown
Contributor

@PalmPalm7 PalmPalm7 commented Mar 23, 2026

Summary

  • Adds error classification step between Step 4 and Step 5 that matches job errors against known_failed.yaml
  • Gives Claude verified ground-truth categories instead of guessing, using the same category taxonomy as RHDP ETL pipeline.
  • Path to known_failed.yaml is configurable via KNOWN_FAILED_YAML in Claude Code settings.json env block, with fallback to bundled data/known_failed.yaml

Changes

  • scripts/classify.py (~80 LOC) — pattern matching logic adapted from /classify_errors.py
  • scripts/cli.py — integrates classification after Step 4, outputs classification.json
  • SKILL.md — instructs Claude to read classification.json and prefer verified categories
  • tests/test_classify.py — 9 new tests

Test plan

  • All 70 tests pass (61 existing + 9 new)
  • Run against a real failed job to verify classification output
  • Verify Claude uses classification.json categories in Step 5 summary

Upstream target: redhat-et/rhdp-rca-plugin (open PR there once permissions are granted)

🤖 Generated with Claude Code

Summary by CodeRabbit

  • New Features

    • Added error classification that matches errors against configurable known-failure patterns and writes a classification.json artifact
    • Expanded root-cause categories with more specific failure types
    • Added CLI options and env var support for providing known-failures (with local cache/fallback)
  • Documentation

    • Updated skill README and docs to describe the classification step and classification.json output
  • Tests

    • Added tests covering classification logic and known-failure loading
  • Chores

    • Added PyYAML>=6.0 to project dependencies

Add optional error classification step that matches job errors against
a curated YAML of regex-based failure patterns. This gives Claude
verified ground-truth categories instead of guessing, using a
standardized 8-category taxonomy (platform_failure, connectivity_failure,
authentication_failure, resource_failure, timeout_failure, etc.).

The known_failed.yaml is fetched at runtime from a configurable URL
or local path — nothing is vendored. Configure via:
- CLI: --known-failures-url <url> or --known-failures-file <path>
- Env: KNOWN_FAILED_YAML_URL or KNOWN_FAILED_YAML in
  .claude/settings.local.json

Classification is fully optional — the pipeline works without it.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Mar 23, 2026

📝 Walkthrough

Walkthrough

Adds a regex-based error classification feature that loads known-failure patterns from YAML (URL or local, with caching), integrates a classification step into the CLI pipeline producing classification.json, extends schema categories, updates docs, and adds tests.

Changes

Cohort / File(s) Summary
Dependency Management
pyproject.toml
Added PyYAML>=6.0 to project.dependencies for YAML parsing.
Documentation
skills/root-cause-analysis/README.md, skills/root-cause-analysis/SKILL.md
Documented KNOWN_FAILED_YAML_URL/file inputs, caching behavior, and new classification.json artifact; updated Step 5 to prefer classification.json categories and include summary/confidence fields.
Schema
skills/root-cause-analysis/schemas/summary.schema.json
Extended root_cause.category enum with: platform_failure, connectivity_failure, authentication_failure, resource_failure, timeout_failure, automation_failure, infrastructure_failure, general_failure.
Classification Engine
skills/root-cause-analysis/scripts/classify.py
New module: resolves known-failures from URL (with cache) or local file, parses YAML, matches error messages with regex (case/ dot-all), skips invalid patterns, and returns classification dicts; aggregator extracts unique errors from job and timeline.
CLI Integration
skills/root-cause-analysis/scripts/cli.py
Added --known-failures-url and --known-failures-file args; added "Classify" phase that calls resolver and classifier, prints results/skip messages, and writes analysis_dir/classification.json.
Tests
skills/root-cause-analysis/tests/test_classify.py
New tests for YAML loading (valid/missing/blank), regex matching and skipping invalid patterns, empty-input handling, deduplication, and end-to-end classify_job_errors behavior.

Sequence Diagram

sequenceDiagram
    participant CLI
    participant Classifier as classify.py
    participant Fetcher as YAML Fetcher
    participant Cache as Local Cache
    participant Matcher as Regex Matcher
    participant Output as classification.json

    CLI->>Classifier: analyze (with known-failures url/file)
    Classifier->>Fetcher: resolve_known_failures(url, local_path)
    Fetcher->>Cache: check cache file (_CACHE_FILE)
    alt cached content available
        Cache-->>Fetcher: return cached YAML
    else fetch required
        Fetcher->>Fetcher: requests.get(url) (may use GH token)
        Fetcher->>Cache: write downloaded YAML
    end
    Fetcher-->>Classifier: parsed failures list

    CLI->>Classifier: classify_job_errors(job_context, correlation, patterns)
    Classifier->>Matcher: for each unique error message -> classify_error(message, patterns)
    loop per pattern
        Matcher->>Matcher: re.search(pattern, message, IGNORECASE|DOTALL)
        alt match
            Matcher-->>Classifier: classification {error_category, matched_pattern, description}
        end
    end
    Classifier->>Output: write {patterns_loaded, matches} to classification.json
    Output-->>CLI: success / results
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately summarizes the main change: integrating error classification via known failure patterns into the RCA skill pipeline, which is the core objective across all modified files.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@PalmPalm7 PalmPalm7 closed this Mar 23, 2026
@PalmPalm7 PalmPalm7 reopened this Mar 23, 2026
PalmPalm7 and others added 2 commits April 10, 2026 17:44
The `requests` import in classify.py was top-level, causing
ModuleNotFoundError in CI since it's not in pyproject.toml deps.
Move it to a lazy import inside fetch_known_failures_from_url()
where it's actually needed. Add PyYAML to pyproject.toml since
classify.py directly uses it. Fix ruff I001 import sorting in
test_classify.py.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@PalmPalm7 PalmPalm7 marked this pull request as ready for review April 10, 2026 09:48
Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
skills/root-cause-analysis/scripts/cli.py (1)

10-22: ⚠️ Potential issue | 🟠 Major

Direct-execution support is incomplete.

The conditional imports on lines 10-27 correctly support both direct and module execution, but cmd_query() uses a relative import (from .splunk_client import SplunkClient on line 316) that will fail when the script is run directly. When python scripts/cli.py query ... is executed, __package__ will be None, causing the relative import to raise an error. The query command will not work in direct-execution mode.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@skills/root-cause-analysis/scripts/cli.py` around lines 10 - 22, The
cmd_query function currently does a relative import "from .splunk_client import
SplunkClient" which fails when the script is executed directly; update the
import resolution to match the top-of-file pattern by using a
conditional/try-import fallback so SplunkClient is imported correctly in both
direct and module execution modes (e.g., attempt relative import first, and if
that fails import using the top-level path or vice versa), and apply this change
where SplunkClient is referenced (cmd_query) so the query command works for both
"python scripts/cli.py" and "python -m scripts.cli".
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@pyproject.toml`:
- Around line 23-26: The pyproject.toml dependencies list is missing the runtime
dependency for the requests library used by fetch_known_failures_from_url() in
classify.py; add "requests>=2.0.0" (or an appropriate minimum version) to the
dependencies array so installations include requests and the URL-backed
known-failures code path (e.g., --known-failures-url / KNOWN_FAILED_YAML_URL)
does not raise ModuleNotFoundError.

In `@skills/root-cause-analysis/scripts/classify.py`:
- Around line 19-49: fetch_known_failures_from_url currently writes every
fetched YAML to a single global _CACHE_FILE so a later failed fetch for a
different URL can incorrectly return another URL's patterns; change the caching
to be URL-specific by deriving a per-source cache filename (e.g., hash the url
or sanitize it) inside _CACHE_DIR, write/read that per-URL cache instead of
_CACHE_FILE, and when falling back in the except block call load_known_failures
with the per-URL cache path (ensure _CACHE_DIR.mkdir remains and keep
headers/github token logic intact).
- Around line 64-69: The YAML loader in _parse_yaml_content should validate the
parsed shape before using it: ensure the top-level result is a dict and that the
"failures" key is a list of dicts; if not, return an empty list. Update
_parse_yaml_content to treat non-dict top-level values (e.g., [], str, int) as
invalid, treat missing or non-list "failures" as empty, and filter the failures
list to include only dict entries so downstream classify_error() never receives
unexpected types.

In `@skills/root-cause-analysis/scripts/cli.py`:
- Around line 209-237: The code currently skips writing classification.json when
no known failures are configured; change the logic so classification_path is
always written with a stable payload. Specifically, when
resolve_known_failures(...) returns falsy, set classifications = [] and
patterns_loaded = 0 (or compute patterns_loaded = len(known_failures) in the
truthy case), then always build classification_result = {"patterns_loaded":
patterns_loaded, "matches": classifications} and json.dump it to
classification_path; keep the existing prints for matched vs no matches but
ensure the file is created even in the else branch — update calls around
resolve_known_failures, classify_job_errors, classification_path, and
classification_result accordingly.

In `@skills/root-cause-analysis/SKILL.md`:
- Line 100: The fallback category list in the "Root Cause" instruction of
SKILL.md includes invalid enums ("workload_bug" and "credential") that don't
match schemas/summary.schema.json and will produce invalid
step5_analysis_summary.json; update the fallback list in the "Root Cause" line
to only include enum values that exist in schemas/summary.schema.json (remove or
replace "workload_bug" and "credential" with the schema-approved equivalents),
and ensure any example outputs or references to step5_analysis_summary.json are
updated accordingly and validate against schemas/summary.schema.json.

---

Outside diff comments:
In `@skills/root-cause-analysis/scripts/cli.py`:
- Around line 10-22: The cmd_query function currently does a relative import
"from .splunk_client import SplunkClient" which fails when the script is
executed directly; update the import resolution to match the top-of-file pattern
by using a conditional/try-import fallback so SplunkClient is imported correctly
in both direct and module execution modes (e.g., attempt relative import first,
and if that fails import using the top-level path or vice versa), and apply this
change where SplunkClient is referenced (cmd_query) so the query command works
for both "python scripts/cli.py" and "python -m scripts.cli".
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: cbad026d-413d-4ac5-a4d5-8d718dd66cb3

📥 Commits

Reviewing files that changed from the base of the PR and between 4ba3f2d and b6fc32e.

📒 Files selected for processing (7)
  • pyproject.toml
  • skills/root-cause-analysis/README.md
  • skills/root-cause-analysis/SKILL.md
  • skills/root-cause-analysis/schemas/summary.schema.json
  • skills/root-cause-analysis/scripts/classify.py
  • skills/root-cause-analysis/scripts/cli.py
  • skills/root-cause-analysis/tests/test_classify.py

Comment on lines +19 to +49
# Cache dir for downloaded known_failed.yaml
_CACHE_DIR = Path(tempfile.gettempdir()) / "rhdp-rca"
_CACHE_FILE = _CACHE_DIR / "known_failed.yaml"


def fetch_known_failures_from_url(url: str) -> list[dict]:
"""Fetch known failure patterns YAML from a URL.

Caches the file locally. Returns the parsed failures list.
"""
_CACHE_DIR.mkdir(parents=True, exist_ok=True)

import requests

headers = {}
github_token = os.environ.get("GITHUB_TOKEN", "")
if github_token and "api.github.com" in url:
headers["Authorization"] = f"token {github_token}"
headers["Accept"] = "application/vnd.github.v3.raw"

try:
resp = requests.get(url, headers=headers, timeout=15)
resp.raise_for_status()
_CACHE_FILE.write_text(resp.text)
return _parse_yaml_content(resp.text)
except (requests.RequestException, yaml.YAMLError) as e:
# Fall back to cache if fetch fails
if _CACHE_FILE.exists():
return load_known_failures(_CACHE_FILE)
print(f" Warning: Failed to fetch known failure patterns: {e}")
return []
Copy link
Copy Markdown

@coderabbitai coderabbitai bot Apr 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Cache the YAML per source, not in one global file.

_CACHE_FILE is shared across every configured URL. If URL A is cached and URL B later fails to fetch, this code silently classifies B with A’s patterns and returns the wrong “verified” category set. Use a URL-derived cache key, or persist the source URL alongside the cache before falling back.

🧰 Tools
🪛 GitHub Actions: CI

[error] 31-31: mypy: Library stubs not installed for "requests" [import-untyped]

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@skills/root-cause-analysis/scripts/classify.py` around lines 19 - 49,
fetch_known_failures_from_url currently writes every fetched YAML to a single
global _CACHE_FILE so a later failed fetch for a different URL can incorrectly
return another URL's patterns; change the caching to be URL-specific by deriving
a per-source cache filename (e.g., hash the url or sanitize it) inside
_CACHE_DIR, write/read that per-URL cache instead of _CACHE_FILE, and when
falling back in the except block call load_known_failures with the per-URL cache
path (ensure _CACHE_DIR.mkdir remains and keep headers/github token logic
intact).

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Valid point — the single global cache file could cause cross-contamination if multiple URLs are used. In practice this is low risk since the configuration pattern is one URL per environment, but it is a real correctness issue. Tracking as a follow-up improvement. For now the single-URL assumption holds for all current usage.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@PalmPalm7, understood — the single-URL-per-environment assumption makes the risk negligible for current usage. Would you like me to open a GitHub issue to track the per-URL cache key improvement so it doesn't get lost?

…n, schema-aligned categories

- Validate YAML shape in _parse_yaml_content: reject non-dict top-level,
  non-list failures, and non-dict entries
- Always emit classification.json even when no patterns are configured,
  so Step 5 has a stable pipeline contract
- Fix SKILL.md fallback categories to match summary.schema.json enum
  (workload_bug → application_bug, credential → secrets)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@PalmPalm7
Copy link
Copy Markdown
Contributor Author

Re: CodeRabbit outside-diff comment on cli.py:10-22 (direct-execution cmd_query import) — this is a pre-existing issue not introduced by this PR. The from .splunk_client import SplunkClient relative import in cmd_query() does need the same conditional pattern as the top-of-file imports, but since it is outside this PR's scope, it should be tracked as a separate fix.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@skills/root-cause-analysis/SKILL.md`:
- Line 80: The doc is contradictory about classification.json being "REQUIRED"
yet "if present"; update the text in SKILL.md to state classification.json is an
unconditional/deterministic input (since scripts/cli.py always writes it),
remove "if present", and clearly instruct: always consult classification.json
for known failure pattern matches and use its error_category when a match
exists, otherwise mark the failure as novel/unclassified; reference
classification.json and scripts/cli.py in the text so readers know the source
and determinism.
- Line 100: The preferred category list in SKILL.md omits `general_failure`,
causing mismatch with scripts/classify.py which may emit `general_failure` for
matched patterns; update the preferred categories line (the list containing
`platform_failure|connectivity_failure|authentication_failure|resource_failure|timeout_failure|automation_failure|infrastructure_failure`)
to include `general_failure` among the preferred classifications so that
`general_failure` is treated as a primary match and not force fallback
handling—ensure the doc text still mentions the fallback set
(`configuration|infrastructure|application_bug|secrets|resource|dependency`)
remains unchanged and keep summary and confidence requirements intact.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 6e94fc05-c7c8-44d4-b0c9-9815a9472120

📥 Commits

Reviewing files that changed from the base of the PR and between b6fc32e and 6715cee.

📒 Files selected for processing (3)
  • skills/root-cause-analysis/SKILL.md
  • skills/root-cause-analysis/scripts/classify.py
  • skills/root-cause-analysis/scripts/cli.py
🚧 Files skipped from review as they are similar to previous changes (2)
  • skills/root-cause-analysis/scripts/cli.py
  • skills/root-cause-analysis/scripts/classify.py

2. **REQUIRED**: `step3_correlation.json` - Correlated timeline with relevant pod logs (DO NOT read step2 unless needed)
3. **REQUIRED**: `step4_github_fetch_history.json` - Configuration and code context
4. **CONDITIONAL**: `step2_splunk_logs.json` - Only read if step3 indicates errors needing deeper investigation
4. **REQUIRED**: `classification.json` - Known failure pattern matches (if present). Use these verified categories instead of guessing. If a match exists, use its `error_category` as the root cause category. If no matches, flag as novel/unclassified failure.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Clarify classification.json as unconditionally required input.

Line 80 is internally contradictory (REQUIRED vs “if present”). scripts/cli.py always writes classification.json, so Step 5 guidance should be deterministic.

Proposed doc fix
-4. **REQUIRED**: `classification.json` - Known failure pattern matches (if present). Use these verified categories instead of guessing. If a match exists, use its `error_category` as the root cause category. If no matches, flag as novel/unclassified failure.
+4. **REQUIRED**: `classification.json` - Known failure pattern matching result. Always read this file. If a match exists, use its `error_category` as the root cause category. If no matches, treat as novel/unclassified failure.

As per coding guidelines, “Focus on major issues impacting performance, readability, maintainability and security. Avoid nitpicks and avoid verbosity.”

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@skills/root-cause-analysis/SKILL.md` at line 80, The doc is contradictory
about classification.json being "REQUIRED" yet "if present"; update the text in
SKILL.md to state classification.json is an unconditional/deterministic input
(since scripts/cli.py always writes it), remove "if present", and clearly
instruct: always consult classification.json for known failure pattern matches
and use its error_category when a match exists, otherwise mark the failure as
novel/unclassified; reference classification.json and scripts/cli.py in the text
so readers know the source and determinism.

### Summary Requirements

1. **Root Cause**: Category (`configuration|infrastructure|workload_bug|credential|resource|dependency`), summary, confidence
1. **Root Cause**: Category — prefer `classification.json` categories when matched (`platform_failure|connectivity_failure|authentication_failure|resource_failure|timeout_failure|automation_failure|infrastructure_failure`). Fall back to (`configuration|infrastructure|application_bug|secrets|resource|dependency`) only for novel/unclassified errors. Include summary and confidence.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Include general_failure in preferred classification categories.

Line 100’s preferred list omits general_failure, but scripts/classify.py can emit general_failure as the default category for a matched pattern entry. That mismatch can force unnecessary fallback handling and weaken category consistency.

Proposed doc fix
-1. **Root Cause**: Category — prefer `classification.json` categories when matched (`platform_failure|connectivity_failure|authentication_failure|resource_failure|timeout_failure|automation_failure|infrastructure_failure`). Fall back to (`configuration|infrastructure|application_bug|secrets|resource|dependency`) only for novel/unclassified errors. Include summary and confidence.
+1. **Root Cause**: Category — prefer `classification.json` categories when matched (`platform_failure|connectivity_failure|authentication_failure|resource_failure|timeout_failure|automation_failure|infrastructure_failure|general_failure`). Fall back to (`configuration|infrastructure|application_bug|secrets|resource|dependency`) only for novel/unclassified errors. Include summary and confidence.

As per coding guidelines, “Focus on major issues impacting performance, readability, maintainability and security. Avoid nitpicks and avoid verbosity.”

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@skills/root-cause-analysis/SKILL.md` at line 100, The preferred category list
in SKILL.md omits `general_failure`, causing mismatch with scripts/classify.py
which may emit `general_failure` for matched patterns; update the preferred
categories line (the list containing
`platform_failure|connectivity_failure|authentication_failure|resource_failure|timeout_failure|automation_failure|infrastructure_failure`)
to include `general_failure` among the preferred classifications so that
`general_failure` is treated as a primary match and not force fallback
handling—ensure the doc text still mentions the fallback set
(`configuration|infrastructure|application_bug|secrets|resource|dependency`)
remains unchanged and keep summary and confidence requirements intact.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant