layout

default

title

LLM Evaluation Loop

nav_order

parent

Nice-to-Have

checklist_enabled

true

checklist_stage

stage-4

checklist_section

Nice-to-Have Artifact Checklist

checklist_order

checklist_audit_areas

maintenance-drift

LLM Evaluation Loop

Why This Matters

Guidelines are hypotheses about what the model needs to know. Some work well. Others get ignored or misinterpreted. Without a feedback loop, you can't tell the difference.

An evaluation loop tracks which guidelines are effective, which are failing, and how to iterate — turning guideline writing from guesswork into a systematic process.

What to Include

How to track LLM mistakes — categories, frequency, patterns
How to trace mistakes to guidelines — was the guideline missing, unclear, or ignored?
When to update guidelines — triggers for revision
Prompt pattern library — recording what works for reuse

How to Write It

Start simple. A shared document or issue label where the team records "the LLM got this wrong." No tooling needed.
Categorize failures. Wrong framework, wrong file location, wrong test type, security violation, architecture violation — each category maps to a guideline.
Review on defined triggers. Look at failure patterns. If one category dominates, that guideline needs work.

Example

## LLM Evaluation

Track failures in: GitHub issues with label "llm-quality"

Failure categories:
- wrong-framework  → Tech Stack guideline
- wrong-location   → Directory Structure guideline
- wrong-test-type  → Testing Strategy guideline
- security-issue   → Security Basics guideline
- style-mismatch   → Coding Standards guideline
- scope-creep      → Project Scope guideline

Trigger-based review:
- Count failures by category
- Top category → rewrite or expand that guideline
- If a guideline is followed consistently → it's working, leave it
- If a guideline is consistently ignored → it may be too vague, too buried,
  or conflicting with another guideline

Prompt patterns:
- Record prompts that produce consistently good results
- Share in docs/prompts/ as reusable snippets
- Retire prompts that stop working (model updates)

Common Mistakes

Not tracking at all. Without data, guideline updates are based on gut feeling. Even a simple tally is better than nothing.

Blaming the model instead of the guideline. When the LLM makes a mistake, the first question should be "is there a guideline for this?" followed by "is the guideline clear enough?" The model follows instructions — if it's not following yours, the instructions may be the problem.

Never retiring guidelines. Some guidelines become unnecessary as the model improves or the project stabilizes. Keeping them wastes token budget.

Tool-Specific Notes

Claude Code: CLAUDE.md supports iterative refinement — use /init to regenerate periodically and compare with your manual version.
All tools: This is a process guideline, not a tool configuration. It works the same regardless of which LLM tool you use.

References

External References

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LLM Evaluation Loop

Why This Matters

What to Include

How to Write It

Example

Common Mistakes

Tool-Specific Notes

References

FilesExpand file tree

llm-evaluation-loop.md

Latest commit

History

llm-evaluation-loop.md

File metadata and controls

LLM Evaluation Loop

Why This Matters

What to Include

How to Write It

Example

Common Mistakes

Tool-Specific Notes

References