Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 2 additions & 1 deletion .claude-plugin/marketplace.json
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,8 @@
"./skills/root-cause-analysis",
"./skills/context-fetcher",
"./skills/logs-fetcher",
"./skills/feedback-capture"
"./skills/feedback-capture",
"./skills/rca-annotator"
]
}
]
Expand Down
233 changes: 233 additions & 0 deletions skills/rca-annotator/SKILL.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,233 @@
---
name: rca-annotator
description: LLM-as-judge skill that independently evaluates root-cause-analysis outputs to produce ground-truth annotations with multi-pass consistency checks, difficulty calibration, and complete evidence traceability.
allowed-tools:
- Read
- Write
- Bash
---

# RCA Annotator (LLM-as-Judge)

An LLM-as-judge skill that independently re-analyzes raw RCA evidence (step1, step3, step4 outputs) to produce ground-truth annotations. The judge forms its own conclusion from the same source evidence the agent had — without reading the agent's diagnosis — producing an unbiased reference label for scoring agent accuracy.

| | `root-cause-analysis` (Agent) | `rca-annotator` (Judge) |
|---|---|---|
| **Purpose** | Diagnose failures for users | Produce ground-truth labels for evaluation |
| **Reads** | Logs, Splunk, GitHub (live) | Step 1/3/4 output files (offline) |
| **Output** | Human-readable diagnosis | Structured `annotation_draft.json` |

**Use when**: steps 1-4 outputs exist and you need labeled ground-truth data, difficulty benchmarking, or alternative diagnosis identification.

**Do NOT use** to perform initial RCA (use `root-cause-analysis`) or directly score agent output (compare annotation externally).

## Prerequisites

Verify root-cause-analysis has been completed.

- `JUMPBOX_URI` (optional) — SSH connection string (e.g. `"user@host -p 2222"`). If unset, uses local `.analysis/` only.
- SSH keys configured in `~/.ssh/config` if using jumpbox; `ssh` and `rsync` installed.
- **Required files** in `.analysis/<job_id>/`:
- `step1_job_context.json` — Job metadata, failed tasks
- `step3_correlation.json` — Timeline with AAP + Splunk events
- `step4_github_fetch_history.json` — Configuration and code context

If missing, run `root-cause-analysis` skill first.

## Workflow

0. Download from jumpbox (if `JUMPBOX_URI` set) or verify local files
1. Read step1, step3, step4 (never the agent's diagnosis)
2. Independently determine root cause with evidence scoring and traceability
3. *(Optional)* Multi-pass consistency check
4. Write `annotation_draft.json`
5. Upload to jumpbox (if `JUMPBOX_URI` set)

---

## Step 0: Download Analysis Files

```bash
cd skills/rca-annotator
python scripts/cli.py download --job-id <job_id>
```

Downloads from jumpbox `/usr/local/mlflow/<job_id>/` to local `.analysis/<job_id>/`. If `JUMPBOX_URI` unset, validates local files only. Errors on missing remote directory, missing required files, or connection failure.

---

## Step 1: Read Outputs

Read in order: **step1** (job metadata, failed tasks, errors) → **step3** (correlated timeline incl. Splunk) → **step4** (config hierarchy and code). Skip `step2_splunk_logs.json` — step3 already includes correlated events.

---

## Step 2: Analyze and Annotate

### A. Root Cause Category

**From Step1**: Review `failed_tasks[]` — task names, actions, errors, durations. Action patterns hint at categories (e.g. `k8s_info` → infrastructure/rbac, `uri` → network/secrets, `command/shell` → configuration/dependency).

**From Step3**: Examine `timeline_events[]` chronologically. Pod errors before task failures → infrastructure; task failures before pod errors → configuration/code.

**From Step4**: Review `github_fetches[]` for config hierarchy. Check `fetched_configs` for missing variables/conflicts and `fetched_workload` for code at failure points. Note any `status: 404` or `error: all_paths_failed`.

**Categories**: `configuration` | `infrastructure` | `application_bug` | `dependency` | `network` | `resource` | `cloud_api` | `credential` | `secrets` | `unknown`

### B. Evidence Construction with Traceability

Each evidence item requires:
- **source** — `step1` | `step3` | `step4`
- **source_file** — path to analysis file
- **json_path** — exact path in source JSON
- **exact_value** and/or **exact_quote** — literal value or text
- **confidence** — `high` | `medium` | `low`
- **is_root_cause** — exactly ONE item must be `true`

Additional fields for code/config sources: `line_number`, `github_path` (`owner/repo:path:line`).

```json
{
"source": "step1",
"source_file": ".analysis/<job_id>/step1_job_context.json",
"json_path": "failed_tasks[0].duration",
"exact_value": 917.565567,
"message": "Task duration indicates timeout, not immediate failure",
"confidence": "medium",
"is_root_cause": false
}
```

For `agnosticd_code` or `agnosticv_config` sources, add `line_number`, `exact_quote`, and `github_path`.

**Process**: Start with Step3 timeline → filter to `task_failed`/`pod_error` → identify root cause evidence → extract traceability → assign confidence → include supporting evidence.

### C. Difficulty Calibration

Score 0-10 based on:
- **+3**: Requires cross-source correlation (AAP + Splunk + GitHub)
- **+2**: Requires understanding code behavior
- **+2**: Error message is generic/misleading
- **+1**: Requires variable precedence/override knowledge
- **+1**: Requires domain knowledge (K8s, Ansible, cloud APIs)
- **+1**: Multiple plausible alternatives exist
- **+1**: Timing dependencies critical

Mapping: 0-3 = easy, 4-6 = medium, 7-10 = hard. Include score and justification.

### D. Alternative Diagnoses

Identify plausible-but-incorrect hypotheses with:
- `category`, `summary`, `why_wrong`
- `plausibility`: `high` (shares many characteristics) | `medium` (some evidence) | `low` (superficial similarity)
- `supporting_evidence` and `contradicting_evidence` lists

---

## Step 2b: Multi-Pass Consistency Check (Optional)

For high-stakes annotations or medium/low initial confidence. Run 2-3 independent passes and compare:

- All agree on category → consensus, confidence: high
- 2/3 agree → majority, confidence: medium, note discrepancy
- All differ → confidence: low, flag for human review

Skip if initial confidence is high and difficulty is easy.

---

## Step 3: Write Annotation

Verify before writing:
- All required files read (step1, step3, step4)
- Root cause category matches evidence
- Exactly one evidence item has `is_root_cause: true`
- All evidence has traceability (source_file, json_path, exact_value/quote)
- All evidence has confidence level
- Difficulty score calculated with justification
- Alternative diagnoses have plausibility levels

Write `annotation_draft.json` to `.analysis/<job_id>/`.

---

## Step 4: Upload Annotation

```bash
cd skills/rca-annotator
python scripts/cli.py upload --job-id <job_id>
```

Uploads `.analysis/<job_id>/annotation_draft.json` to jumpbox if `JUMPBOX_URI` set. Local copy always preserved. If `JUMPBOX_URI` unset, file remains local only.

---

## Output Format

Save to `.analysis/<job_id>/annotation_draft.json`:

```json
{
"job_id": "1234567",
"annotator": "claude_judge",
"annotated_at": "2026-03-19T12:00:00Z",
"difficulty": "easy | medium | hard",
"difficulty_score": 5,
"difficulty_justification": "Requires correlating task code (+2) with missing configs...",

"root_cause": {
"category": "configuration | infrastructure | application_bug | dependency | network | resource | cloud_api | credential | secrets | unknown",
"summary": "One sentence describing what failed and why.",
"confidence": "high | medium | low"
},

"evidence": [
{
"source": "step1 | step3 | step4",
"source_file": ".analysis/<job_id>/step1_job_context.json",
"json_path": "failed_tasks[0].duration",
"exact_value": 917.565567,
"exact_quote": "optional — literal text for code/config",
"line_number": 5,
"github_path": "owner/repo:path/to/file.yml:line",
"message": "The relevant log line or config snippet.",
"confidence": "high | medium | low",
"is_root_cause": true
}
],

"recommendations": [
{
"priority": "high | medium | low",
"action": "What should be done to fix it.",
"file": "path/to/file.yml"
}
],

"contributing_factors": [
"Factor that made the failure more likely or harder to diagnose."
],

"alternative_diagnoses": [
{
"category": "infrastructure",
"summary": "A plausible but wrong diagnosis.",
"why_wrong": "Why the evidence does not support this.",
"plausibility": "high | medium | low",
"supporting_evidence": ["long timeout", "destroy action"],
"contradicting_evidence": ["test environment", "auth retry pattern"]
}
],

"consistency_check": {
"num_passes": 3,
"agreement": {
"category_agreement": "full | partial | none",
"category_votes": {"credential": 3},
"evidence_overlap": 0.85,
"confidence_distribution": {"high": 2, "medium": 1, "low": 0},
"confidence_mode": "high"
}
}
}
```
69 changes: 69 additions & 0 deletions skills/rca-annotator/schemas/schema.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,69 @@
{
"job_id": "1234567",
"annotator": "claude_judge",
"annotated_at": "2026-03-08T12:00:00Z",
"difficulty": "easy | medium | hard",
"difficulty_score": 5,
"difficulty_justification": "Requires correlating task code (+2) with missing configs (+0) and interpreting generic MODULE FAILURE (+2) and understanding default(omit) behavior (+1). Total: 5 points.",

"root_cause": {
"category": "configuration | infrastructure | application_bug | dependency | network | resource | cloud_api | credential | secrets | unknown",
"summary": "One sentence describing what failed and why.",
"confidence": "high | medium | low"
},

"evidence": [
{
"source": "step1 | step3 | step4 | aap_job | splunk_ocp | agnosticv_config | agnosticd_code",
"source_file": ".analysis/1234567/step1_job_context.json",
"json_path": "failed_tasks[0].duration",
"exact_value": 917.565567,
"exact_quote": "kubeconfig: \"{{ _bookbag_kubeconfig | default(omit) }}\"",
"line_number": 5,
"github_path": "owner/repo:path/to/file.yml:line",
"message": "The relevant log line or config snippet.",
"confidence": "high | medium | low",
"is_root_cause": true
}
],

"recommendations": [
{
"priority": "high | medium | low",
"action": "What should be done to fix it.",
"file": "path/to/file.yml"
}
],

"contributing_factors": [
"Factor that made the failure more likely or harder to diagnose."
],

"alternative_diagnoses": [
{
"category": "infrastructure",
"summary": "A plausible but wrong diagnosis an agent might produce.",
"why_wrong": "Why the evidence does not support this.",
"plausibility": "high | medium | low",
"supporting_evidence": [
"Evidence that could support this diagnosis",
"Another piece of supporting evidence"
],
"contradicting_evidence": [
"Evidence that contradicts this diagnosis",
"Another contradicting factor"
]
}
],

"consistency_check": {
"num_passes": 3,
"agreement": {
"category_agreement": "full | partial | none",
"category_votes": {"credential": 3, "infrastructure": 0},
"evidence_overlap": 0.85,
"confidence_distribution": {"high": 2, "medium": 1, "low": 0},
"confidence_mode": "high"
}
}
}
Loading
Loading