Skip to content

Latest commit

 

History

History
475 lines (334 loc) · 15.5 KB

File metadata and controls

475 lines (334 loc) · 15.5 KB
name description argument-hint
audit-agents-skills
Audit quality of agents, skills, and commands in a Claude Code project
[path] [--fix] [--verbose]

Audit Agents/Skills/Commands Quality

Comprehensive quality audit for Claude Code agents, skills, and commands. Scores each file on weighted criteria with production readiness grading.

Arguments

  • [path] - Directory to audit (default: current project .claude/)
  • --fix - Generate fix suggestions for failing criteria
  • --verbose - Show details for all criteria (not just failures)

Usage

/audit-agents-skills              # Audit current project
/audit-agents-skills --fix        # Audit + fix suggestions
/audit-agents-skills ~/other-repo # Audit another project
/audit-agents-skills --verbose    # Full details for all criteria

Phase 1: Discovery

Objective: Locate and classify all agents, skills, and commands

Steps

  1. Scan directories:

    .claude/agents/
    .claude/skills/
    .claude/commands/
    examples/agents/      (if exists)
    examples/skills/      (if exists)
    examples/commands/    (if exists)
    
  2. Classify files:

    • Agent: File in agents/ directory with YAML frontmatter containing tools: field
    • Skill: File in skills/ directory OR has SKILL.md name OR frontmatter with allowed-tools: field
    • Command: File in commands/ directory with frontmatter containing name: and description:
  3. Display summary:

    Found: X agents, Y skills, Z commands
    

Phase 2: Audit Individual Files

Each file type is scored on weighted criteria. Maximum scores:

  • Agents: 32 points
  • Skills: 32 points
  • Commands: 20 points

Agents (32 points max)

Identity (weight: 3x) - 12 points

Criterion Points Detection
Clear name field 3 Frontmatter YAML has name: field that's descriptive (not generic like "agent1")
description with triggers 3 Description contains "when", "use", or "trigger" keywords indicating activation context
model specified 3 Frontmatter has model: field (sonnet/haiku/opus)
tools restricted appropriately 3 Tools list doesn't include Bash unless justified, or includes explanation for risky tools

Rationale: Identity determines discoverability and activation. If users can't locate or invoke the agent, downstream quality is irrelevant.

Prompt Quality (weight: 2x) - 8 points

Criterion Points Detection
Role defined 2 Contains "You are" or "Your role" statement defining agent persona
Output format specified 2 Has section titled "Output", "Format", or "Deliverables" specifying expected structure
Scope/limits defined 2 Has section defining scope, triggers, or when NOT to use the agent
Anti-hallucination measures 2 Contains keywords: "verify", "cite", "source", "evidence", or warnings against hallucination

Rationale: Prompt quality determines reliability and accuracy of agent responses.

Validation (weight: 1x) - 4 points

Criterion Points Detection
3+ usage examples 1 Has "Examples", "Usage", or "Scenarios" section with at least 3 distinct examples
Edge cases documented 1 Mentions "edge case", "error", "failure", or "limitation" scenarios
Integration documented 1 References other agents, skills, or tools it works with
Error handling described 1 Mentions "fallback", "recovery", "error handling", or failure modes

Rationale: Validation ensures robustness through comprehensive testing scenarios.

Design (weight: 2x) - 8 points

Criterion Points Detection
Single responsibility 2 File size <5000 tokens AND description is focused (not "general purpose" or multiple verbs)
No duplication 2 Description doesn't overlap significantly with other agents (>50% keyword similarity check)
Composable (skills references) 2 References skills or other agents it can invoke, showing modularity
Reasonable token budget 2 File size <8000 tokens (avoids context bloat)

Rationale: Design patterns determine maintainability and scalability of agent architecture.


Skills (32 points max)

Structure (weight: 3x) - 12 points

Criterion Points Detection
Valid SKILL.md or frontmatter 3 File named SKILL.md OR has YAML frontmatter with name: field
name valid 3 Name is lowercase, 1-64 chars, matches pattern [a-z0-9-]+ (no spaces/special chars)
description non-empty 3 Description field exists and is >20 characters
allowed-tools specified 3 Frontmatter has allowed-tools: field listing tool permissions

Rationale: Structure compliance ensures spec compatibility with Claude Code runtime.

Content (weight: 2x) - 8 points

Criterion Points Detection
Methodology/workflow described 2 Has section titled "Methodology", "Workflow", "Process", or numbered steps
Output format specified 2 Has section specifying deliverable format (Markdown, JSON, report structure)
Examples provided 2 Has "Examples", "Usage", or "Scenarios" section with concrete instances
Checklists included 2 Contains Markdown checkbox syntax - [ ] or - [x] for actionable items

Rationale: Content richness determines usability and learning curve.

Technical (weight: 1x) - 4 points

Criterion Points Detection
Scripts have error handling 1 If bundled scripts exist, contain set -e, trap, or `
No hardcoded paths 1 No absolute paths like /Users/, /home/, C:\ in code or instructions
No secrets 1 No keywords: "password", "secret", "token", "api_key", "credentials" in plaintext
Dependencies documented 1 If external tools required, has "Requirements", "Dependencies", or "Prerequisites" section

Rationale: Technical hygiene prevents portability issues and security risks.

Design (weight: 2x) - 8 points

Criterion Points Detection
Single responsibility 2 Description is focused on one domain (not "general" or multi-purpose)
Clear triggers 2 Has section defining "When to use", "Triggers", or "Activation criteria"
No overlap with other skills 2 Description doesn't duplicate >50% of keywords from other skills in project
Portable 2 No Claude Code-specific extensions that break portability (check for custom APIs)

Rationale: Design determines findability and maintainability across projects.


Commands (20 points max)

Structure (weight: 3x) - 12 points

Criterion Points Detection
Valid frontmatter 3 Has YAML frontmatter with both name: and description: fields
argument-hint if takes args 3 If $ARGUMENTS variable is used in body, frontmatter has argument-hint: field
Step-by-step workflow 3 Body contains numbered sections (1., 2., 3.) or clear phase structure
Usage examples 3 Has section titled "Usage", "Examples", or shows invocation patterns

Rationale: Structure determines usability and learnability for command users.

Quality (weight: 2x) - 8 points

Criterion Points Detection
Error handling 2 Mentions "error", "failure", "fallback", or conditional paths for failures
Output format defined 2 Specifies what command outputs (report, file, summary) and its structure
Validation gates 2 Contains checkpoints, verification steps, or "before proceeding" checks
Arguments parsed properly 2 If takes args, shows how to parse/validate $ARGUMENTS (default values, validation)

Rationale: Quality determines reliability and production readiness.


Phase 3: Scoring

Individual File Score

Score = (Points Obtained / Max Points) × 100

Example: Agent scores 26/32 points → 81% score

Grade Assignment

Grade Score Range Status
A 90-100% Production-ready ✅
B 80-89% Good (production threshold) ⚠️
C 70-79% Needs improvement 🔧
D 60-69% Significant gaps ⚠️
F <60% Critical issues ❌

Production Threshold: 80% (Grade B or higher)

Overall Project Score

Weighted average by file type:

Overall = (Σ Agent Scores × Agent Count + Σ Skill Scores × Skill Count + Σ Command Scores × Command Count) / Total Files

Phase 4: Report Generation

Report Structure

# Audit: Agents/Skills/Commands

**Project**: {path}
**Date**: {date}
**Overall Score**: {score}% ({grade})
**Files Audited**: {total} ({n} agents, {n} skills, {n} commands)
**Production Ready**: {count} files ({percentage}%)

---

## Summary

| Type | Files | Avg Score | Grade | Production Ready |
|------|-------|-----------|-------|------------------|
| Agents | X | Y% | Z | N/X (%) |
| Skills | X | Y% | Z | N/X (%) |
| Commands | X | Y% | Z | N/X (%) |

---

## Individual Scores

| File | Type | Score | Grade | Top Issues |
|------|------|-------|-------|------------|
| agent-name.md | Agent | 85% | B | Missing anti-hallucination measures, no edge cases |
| skill-name/ | Skill | 72% | C | Hardcoded paths, no error handling |
| command.md | Command | 95% | A | None |

---

## Top Issues (Across All Files)

1. **Missing error handling** (8 files affected)
   - Impact: Runtime failures unhandled
   - Fix: Add error handling sections, fallback strategies

2. **Hardcoded paths** (5 files affected)
   - Impact: Portability broken across systems
   - Fix: Use relative paths or environment variables

3. **No usage examples** (4 files affected)
   - Impact: Poor learnability, unclear invocation
   - Fix: Add "Examples" section with 3+ scenarios

---

## Detailed Breakdown

<details>
<summary>agent-name.md (Agent, 85%, Grade B)</summary>

### Scores by Category

| Category | Points | Max | Pass |
|----------|--------|-----|------|
| Identity | 12 | 12 ||
| Prompt Quality | 6 | 8 | ⚠️ |
| Validation | 2 | 4 ||
| Design | 6 | 8 | ⚠️ |

### Failed Criteria

-**Anti-hallucination measures** (2 pts): No keywords found for source verification
-**Edge cases documented** (1 pt): No mention of failure scenarios
-**Integration documented** (1 pt): No references to other agents/skills

### Recommendations

1. Add "Source Verification" section requiring citation of claims
2. Document edge cases: API failures, timeout scenarios, invalid input
3. List compatible skills/agents for composition patterns

</details>

---

## Recommendations (Prioritized)

### High Priority (Critical for production)

1. **Add error handling to 8 files**
   - Files: [list]
   - Action: Add error handling sections, define fallback behaviors

2. **Remove hardcoded paths from 5 files**
   - Files: [list]
   - Action: Replace with `$HOME`, relative paths, or env vars

### Medium Priority (Improves quality)

3. **Add usage examples to 4 files**
   - Files: [list]
   - Action: Create "Examples" section with 3+ scenarios

4. **Define output formats in 3 files**
   - Files: [list]
   - Action: Specify deliverable structure (Markdown/JSON/report)

### Low Priority (Polish)

5. **Add integration docs to 2 files**
   - Files: [list]
   - Action: List compatible agents/skills for composition

---

## Next Steps

1. Review failures: Focus on Grade D/F files first
2. Run with `--fix` for automated suggestions
3. Re-audit after improvements to track progress
4. Aim for 80%+ (Grade B) across all files for production readiness

Phase 5: Fix Mode (Optional)

Trigger: --fix flag

For each failing criterion, generate specific fix suggestion:

Example Fix Suggestions

File: agent-name.md Issue: Missing anti-hallucination measures (2 pts lost)

Suggested Fix:

Add this section after the "Methodology" section:

## Source Verification

- Always cite sources for factual claims
- Use phrases like "According to [source]..." or "Based on [documentation]..."
- If uncertain, explicitly state "I don't have verified information on..."
- Never invent statistics, version numbers, or API details

File: skill-debugging/scripts/analyze.sh Issue: No error handling (1 pt lost)

Suggested Fix:

Add to top of script:

set -e  # Exit on error
trap 'echo "Error on line $LINENO"' ERR

# Replace risky commands:
curl https://api.example.com        # ❌ No error check
curl https://api.example.com || {   # ✅ Error handled
    echo "API call failed"
    exit 1
}

Verbose Mode (Optional)

Trigger: --verbose flag

By default, report shows only failed criteria. Verbose mode shows all criteria with pass/fail status:

### All Criteria (Verbose)

| Criterion | Status | Points | Notes |
|-----------|--------|--------|-------|
| Clear name | ✅ Pass | 3/3 | Name is "debugging-specialist" (descriptive) |
| Description with triggers | ✅ Pass | 3/3 | Contains "Use when debugging..." |
| Model specified | ❌ Fail | 0/3 | No `model:` field in frontmatter |
| Tools restricted | ⚠️ Partial | 2/3 | Includes Bash but no justification |
| ... | ... | ... | ... |

Industry Context

Source: LangChain Agent Report 2026 (verified)

Key Statistics:

  • 29.5% of organizations deploy agents without systematic evaluation
  • 18% cite "agent bugs" as their top challenge
  • Only 12% use automated quality checks

Implication: This audit addresses a real industry gap. Most teams deploy agents/skills without validation, leading to production issues. The 80% threshold (Grade B) aligns with industry best practices for production readiness.

Comparison: Manual checklists (like the Guide's Agent Validation Checklist on line 4921) are comprehensive but error-prone. Automated scoring reduces human error and provides quantitative metrics for tracking improvements over time.


Related

  • Agent Validation Checklist (guide line 4921): Manual 16-criteria checklist
  • Skill Validation (guide line 5491): Spec compliance documentation
  • Examples: examples/agents/, examples/skills/, examples/commands/
  • Advanced Audit: Use audit-agents-skills skill (see examples/skills/) for comparative analysis vs templates

Implementation Notes

Detection Patterns

Frontmatter Parsing:

import re
yaml_match = re.search(r'^---\n(.*?)\n---', content, re.DOTALL)
if yaml_match:
    import yaml
    frontmatter = yaml.safe_load(yaml_match.group(1))

Keyword Detection (case-insensitive):

has_trigger = any(word in description.lower() for word in ['when', 'use', 'trigger'])

Token Counting (approximate):

tokens = len(content.split()) * 1.3  # Rough estimate: 1 token ≈ 0.75 words

Overlap Detection

Compare descriptions using Jaccard similarity:

def jaccard_similarity(desc1, desc2):
    words1 = set(desc1.lower().split())
    words2 = set(desc2.lower().split())
    intersection = words1 & words2
    union = words1 | words2
    return len(intersection) / len(union) if union else 0

# Flag if similarity > 0.5 (50% keyword overlap)

Grade Color Coding (Terminal Output)

COLORS = {
    'A': '\033[92m',  # Green
    'B': '\033[93m',  # Yellow
    'C': '\033[93m',  # Yellow
    'D': '\033[91m',  # Red
    'F': '\033[91m'   # Red
}

Command ready for use: /audit-agents-skills