Prerequisite: ../02_Scientist/05_Evaluation/.
For evaluation theory and metrics, see 02_Scientist/05_Evaluation. This document addresses the practical question: how do you set up a continuous evaluation and improvement cycle for a domain LLM system?
In most domain LLM projects, the limiting factor is not training — it's knowing whether the model is actually getting better. Without rigorous evaluation:
- You can't tell if a change helped or hurt
- You optimize for the wrong things (loss goes down, but output quality doesn't improve)
- You can't communicate progress to stakeholders
- You don't know when to stop iterating
Evaluation is not a phase — it's a continuous process that runs in parallel with every other activity.
The evaluation set is the single most important artifact in your project. It must be:
- Representative: Covers all task types and difficulty levels the system will encounter
- Expert-validated: Every answer has been verified by a domain expert
- Static: Never used for training. Never modified based on model performance.
- Versioned: Track changes if you must add new examples
- Sufficient size: Minimum 200 examples for statistical significance; 500+ preferred
| Category | Proportion | Examples |
|---|---|---|
| Factual recall | 20-30% | "What is the standard curing time for C40 concrete?" |
| Reasoning | 20-30% | "Given these project constraints, what is the optimal construction sequence?" |
| Multi-step | 15-20% | "If material delivery is delayed by 2 weeks, what downstream milestones are affected and what mitigation options exist?" |
| Edge cases | 10-15% | Ambiguous questions, questions with insufficient information, out-of-scope requests |
| Format compliance | 10-15% | Tasks requiring specific output structure (reports, tables, checklists) |
For each evaluation question, create:
- Reference answer: The ideal response, written by a domain expert
- Key facts: A checklist of facts/points that must appear in any correct answer
- Unacceptable elements: Things that would make an answer wrong (common misconceptions, dangerous advice)
- Difficulty rating: Easy / Medium / Hard
- Category tags: For slicing evaluation results by topic
Example:
question: "What are the key considerations for transitioning an airport from construction to operations?"
reference_answer: "The transition requires coordinating across five dimensions: (1) commissioning and testing of all systems..."
key_facts:
- mentions commissioning/testing
- mentions staff training and certification
- mentions documentation handover
- mentions regulatory approval process
- mentions phased transition (not big-bang)
unacceptable:
- suggests skipping safety certification
- ignores regulatory requirements
difficulty: medium
categories: [operations, transition, planning]Use multiple methods in combination. No single method is sufficient.
┌─────────────────────┐
│ Human Expert │ Gold standard. Expensive. Use for final validation.
│ Evaluation │ 50-100 examples per round.
├─────────────────────┤
│ LLM-as-Judge │ Scalable proxy for human judgment.
│ │ Run on full eval set (200-500 examples).
├─────────────────────┤
│ Automated Metrics │ Fast, cheap, run on every change.
│ (key fact check, │ Catches regressions immediately.
│ format validation) │
└─────────────────────┘
Run these on every model change (fast feedback):
| Metric | What It Checks | Implementation |
|---|---|---|
| Key fact recall | Does the answer contain required facts? | String matching or embedding similarity against key_facts checklist |
| Format compliance | Does output follow required structure? | Regex or schema validation |
| Refusal accuracy | Does model correctly refuse out-of-scope questions? | Check against labeled out-of-scope test cases |
| Consistency | Same question → same answer (across runs)? | Run each question 3 times, measure variance |
| Latency | Response time within acceptable range? | Time each inference call |
| Token efficiency | Answer length reasonable? | Check against expected length range |
Use a strong model to evaluate outputs at scale. This is the workhorse of modern LLM evaluation.
Prompt template:
You are an expert evaluator for a domain-specific AI assistant focused on infrastructure construction and operations.
Evaluate the following response on these dimensions (score 1-5 each):
1. **Accuracy**: Are all stated facts correct? (1=major errors, 5=fully accurate)
2. **Completeness**: Does the response address all aspects of the question? (1=misses key points, 5=comprehensive)
3. **Relevance**: Does the response stay on topic? (1=off-topic, 5=precisely relevant)
4. **Clarity**: Is the response well-organized and easy to understand? (1=confusing, 5=crystal clear)
5. **Safety**: Does the response avoid harmful or misleading advice? (1=dangerous, 5=fully safe)
Question: {question}
Reference answer: {reference}
Model response: {response}
Provide scores and brief justification for each dimension.
Calibration: Before relying on LLM-as-judge, validate it against human judgments on 50-100 examples. If agreement (Cohen's kappa) is below 0.6, refine the evaluation prompt.
Reserve for:
- Final validation before deployment
- Calibrating LLM-as-judge
- Evaluating subjective quality (tone, professionalism, trustworthiness)
- Catching errors that automated methods miss
Protocol:
- Blind evaluation (evaluator doesn't know which model version produced the output)
- Standardized rubric (same scoring criteria for all evaluators)
- Inter-annotator agreement check (at least 2 evaluators per example, measure agreement)
- Structured feedback form (not just scores, but specific comments on what's wrong)
| Layer | Metrics | How to Measure |
|---|---|---|
| Retrieval | Recall@k, MRR, NDCG | Compare retrieved docs against relevance labels |
| Generation | Faithfulness, relevance, completeness | LLM-as-judge against retrieved context |
| End-to-end | Answer accuracy, user satisfaction | Compare against gold-standard answers |
Critical: Evaluate retrieval and generation separately. If the final answer is wrong, you need to know whether retrieval failed (right answer wasn't retrieved) or generation failed (right context was retrieved but model ignored it).
| Dimension | Metrics | How to Measure |
|---|---|---|
| Domain performance | Accuracy on domain eval set | Automated + LLM-as-judge |
| General capability | Performance on general benchmarks (MMLU, etc.) | Automated benchmarks |
| Catastrophic forgetting | Delta between base model and fine-tuned on general tasks | Before/after comparison |
| Safety | Refusal rate on adversarial prompts | Red-teaming test set |
| Dimension | Metrics | How to Measure |
|---|---|---|
| Graph quality | Triple accuracy, completeness, consistency | Expert sampling |
| Query accuracy | Text-to-Cypher correctness | Compare generated queries against gold-standard |
| Reasoning depth | Multi-hop answer accuracy | Test cases requiring 2-3 hop traversal |
| Fallback quality | Performance when KG doesn't have the answer | Test with out-of-graph questions |
┌─────────────┐ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐
│ Evaluate │────→│ Analyze │────→│ Improve │────→│ Validate │──┐
│ (full eval │ │ (categorize │ │ (targeted │ │ (re-run │ │
│ set) │ │ failures) │ │ fixes) │ │ eval set) │ │
└─────────────┘ └──────────────┘ └──────────────┘ └──────────────┘ │
↑ │
└─────────────────────────────────────────────────────────────────────────┘
After each evaluation round, categorize every failure:
Total errors: 47/200 (76.5% accuracy)
Breakdown:
├── Knowledge gaps: 18 (38%)
│ ├── Missing regulation knowledge: 8
│ ├── Missing technical procedures: 6
│ └── Missing domain terminology: 4
├── Reasoning errors: 12 (26%)
│ ├── Wrong causal chain: 5
│ ├── Incomplete analysis: 4
│ └── Contradictory statements: 3
├── Format violations: 9 (19%)
│ ├── Missing required sections: 5
│ └── Wrong output structure: 4
├── Hallucinations: 5 (11%)
│ ├── Fabricated statistics: 3
│ └── Non-existent regulations cited: 2
└── Refusal failures: 3 (6%)
└── Answered out-of-scope questions: 3
This breakdown directly tells you what to fix next:
- Add training data covering missing regulations (addresses 8 errors)
- Add chain-of-thought examples for causal reasoning (addresses 5 errors)
- Add more format-compliant examples (addresses 9 errors)
Every improvement risks breaking something that previously worked. Maintain a regression test set:
- Regression set: Examples the model previously answered correctly. After any change, verify these still pass.
- Improvement set: Examples the model previously failed. After targeted fixes, verify these now pass.
- Net improvement: improvement_set gains - regression_set losses. Must be positive.
Stop when:
- You've met your predefined success criteria (Section 2 of 04_Finetuning_Playbook)
- Marginal improvement per iteration drops below a threshold (e.g., <1% accuracy gain per round)
- Remaining errors are in categories that can't be fixed with more data (fundamental model limitations)
- The cost of further iteration exceeds the value of improvement
Evaluation doesn't end at deployment. In production, monitor:
| Metric | What It Tracks | Alert Threshold |
|---|---|---|
| User feedback | Thumbs up/down ratio | < 80% positive |
| Query failure rate | % of queries with no useful response | > 10% |
| Latency P95 | 95th percentile response time | > target SLA |
| Retrieval empty rate | % of queries with no relevant documents retrieved | > 15% |
| Hallucination flags | User-reported or auto-detected fabrications | Any increase |
| Frequency | Activity |
|---|---|
| Weekly | Run automated metrics on eval set (catch regressions from any system changes) |
| Monthly | LLM-as-judge evaluation on full eval set + new real user queries |
| Quarterly | Human expert evaluation, update eval set with new question types |
Production usage generates the most valuable evaluation data:
User queries → Log (with consent) → Sample interesting cases → Expert annotation → Add to eval set / training set
Prioritize logging:
- Queries where the user gave negative feedback
- Queries the model refused to answer
- Queries with unusually long/short responses
- Novel query patterns not seen in training
This closes the loop: production data improves evaluation, which improves the model, which improves production quality.
- Zheng et al. (2023): Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena.