Skip to content

Integrate known_failed.yaml into RCA Step 5 for verified error classification #11

@PalmPalm7

Description

@PalmPalm7

Labels: enhancement

Problem

The RCA skill's Step 5 asks Claude to categorize failures and determine root cause purely from raw evidence. Claude has no reference for what failure patterns are already known and verified by the RHDP operations team, leading to:

  • Inconsistent category names across analyses (e.g., "network issue" vs "connectivity failure" vs "SSH problem")
  • No validation against ground truth — Claude guesses when a verified answer already exists
  • Output taxonomy doesn't align with what John's AAP2 ETL pipeline uses downstream

Solution

Integrate the existing files like known_failed.yaml from the rhpds/aap2-agents repo into the RCA skill. This file contains regex-based error patterns curated with a verified error_category and human-readable description.

Implementation

1. Vendor the data

Copy known_failed.yaml into skills/root-cause-analysis/data/.

2. Add classify.py (~50 LOC)

3. Configure path via Claude Code settings

The path to known_failed.yaml should be set in Claude Code's settings.json (not .env), under the skill's env block:

{
  "env": {
    "KNOWN_FAILED_YAML": "/path/to/known_failed.yaml"
  }
}

The script falls back to skills/root-cause-analysis/data/known_failed.yaml if the setting is not provided.

4. Call between Step 4 and Step 5

After Steps 1-4 produce structured evidence, run classify.py against the error messages extracted from Steps 1 and 3. Inject the result into Claude's Step 5 prompt context:

Known failure match: connectivity_failure — "Unable to reach bastion host" (matched pattern: redacted)

If no match, flag as novel/unclassified failure worth human review.

5. Tests

  • Pattern loading (valid YAML, empty file, missing file)
  • Matching accuracy against a handful of known error strings from test fixtures
  • No-match returns None
  • Invalid regex patterns are skipped gracefully

Why this matters

  • Accuracy: Claude validates against verified patterns instead of guessing
  • Consistency: Output uses the same 8-category taxonomy as John's ETL pipeline
  • Downstream compatibility: When multi-job analysis and Jira ticket creation are built, categories will already align with the deduplicator system designed in deduplicator_concept.md
  • Low effort: The matching logic and pattern data already exist — this is integration, not invention

References

  • known_failed.yaml — curated error patterns
  • classify_errors.py — existing matching logic to adapt
  • deduplicator_concept.md — future system that consumes these categories

Metadata

Metadata

Assignees

No one assigned

    Labels

    Phase3ideas for phase 3enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions