| layout | default | |
|---|---|---|
| title | LLM Evaluation Loop | |
| nav_order | 6 | |
| parent | Nice-to-Have | |
| checklist_enabled | true | |
| checklist_stage | stage-4 | |
| checklist_section | Nice-to-Have Artifact Checklist | |
| checklist_order | 6 | |
| checklist_audit_areas |
|
Guidelines are hypotheses about what the model needs to know. Some work well. Others get ignored or misinterpreted. Without a feedback loop, you can't tell the difference.
An evaluation loop tracks which guidelines are effective, which are failing, and how to iterate — turning guideline writing from guesswork into a systematic process.
- How to track LLM mistakes — categories, frequency, patterns
- How to trace mistakes to guidelines — was the guideline missing, unclear, or ignored?
- When to update guidelines — triggers for revision
- Prompt pattern library — recording what works for reuse
- Start simple. A shared document or issue label where the team records "the LLM got this wrong." No tooling needed.
- Categorize failures. Wrong framework, wrong file location, wrong test type, security violation, architecture violation — each category maps to a guideline.
- Review on defined triggers. Look at failure patterns. If one category dominates, that guideline needs work.
## LLM Evaluation
Track failures in: GitHub issues with label "llm-quality"
Failure categories:
- wrong-framework → Tech Stack guideline
- wrong-location → Directory Structure guideline
- wrong-test-type → Testing Strategy guideline
- security-issue → Security Basics guideline
- style-mismatch → Coding Standards guideline
- scope-creep → Project Scope guideline
Trigger-based review:
- Count failures by category
- Top category → rewrite or expand that guideline
- If a guideline is followed consistently → it's working, leave it
- If a guideline is consistently ignored → it may be too vague, too buried,
or conflicting with another guideline
Prompt patterns:
- Record prompts that produce consistently good results
- Share in docs/prompts/ as reusable snippets
- Retire prompts that stop working (model updates)Not tracking at all. Without data, guideline updates are based on gut feeling. Even a simple tally is better than nothing.
Blaming the model instead of the guideline. When the LLM makes a mistake, the first question should be "is there a guideline for this?" followed by "is the guideline clear enough?" The model follows instructions — if it's not following yours, the instructions may be the problem.
Never retiring guidelines. Some guidelines become unnecessary as the model improves or the project stabilizes. Keeping them wastes token budget.
- Claude Code: CLAUDE.md supports iterative refinement — use
/initto regenerate periodically and compare with your manual version. - All tools: This is a process guideline, not a tool configuration. It works the same regardless of which LLM tool you use.