This document provides practical examples of how to use LLM Evals for Domain Experts in different scenarios.
input,llm_output
"What is 2+2?","2+2 equals 4."
"Name a red fruit","An apple is a red fruit."input,llm_output,evaluation,labels,ideal_output,comments
"What is 2+2?","2+2 equals 4.","approved","accurate-relevant","2+2 = 4","Perfect mathematical answer"
"Name a red fruit","An apple is a red fruit.","approved","accurate-relevant,concise-direct","Apple","Good example"input,llm_output
"What are the symptoms of diabetes?","Common symptoms include frequent urination, excessive thirst, unexplained weight loss, fatigue, and blurred vision."
"Explain how insulin works","Insulin is a hormone that helps cells absorb glucose from the bloodstream, regulating blood sugar levels."Evaluation Focus:
- Medical accuracy
- Safety considerations
- Appropriate disclaimers
- Evidence-based information
input,llm_output
"Explain photosynthesis for 5th graders","Plants make their own food using sunlight! They take in water through roots and carbon dioxide through leaves, then use sunlight to turn these into sugar."
"What caused World War II?","World War II was caused by multiple factors including economic instability, rise of totalitarian regimes, and unresolved issues from World War I."Evaluation Focus:
- Age-appropriate language
- Accuracy of information
- Clarity and comprehension
- Educational value
input,llm_output
"How do I reset my password?","To reset your password: 1) Click 'Forgot Password' on the login page, 2) Enter your email address, 3) Check your email for reset link, 4) Follow the instructions in the email."
"My order is delayed, what should I do?","I understand your concern about the delayed order. Let me help you track your package and provide an update on the expected delivery date."Evaluation Focus:
- Helpfulness and clarity
- Empathy and tone
- Completeness of solution
- Professional communication
input,llm_output
"What are GDPR requirements for data processing?","GDPR requires lawful basis for processing, data minimization, purpose limitation, accuracy, storage limitation, and appropriate security measures."
"Explain contract termination rights","Contract termination rights depend on the specific terms outlined in the agreement, applicable laws, and circumstances surrounding the termination."Evaluation Focus:
- Legal accuracy
- Appropriate disclaimers
- Risk mitigation
- Jurisdiction considerations
- hallucination: Factually incorrect information
- factually-incorrect: Verifiably wrong facts
- ignored-instruction: Didn't follow the prompt
- generic-response: Too vague or template-like
- lack-of-context: Missing important context
- off-topic: Doesn't address the question
- gold-standard: Exceptional response quality
- accurate-relevant: Correct and on-topic
- creative-innovative: Shows creativity or novel approach
- concise-direct: Well-structured and to the point
- well-structured: Good organization and flow
- detailed-complete: Comprehensive coverage
- Import: Production AI responses for review
- Evaluate: Mark approve/reject with error labels
- Annotate: Add ideal responses for rejected items
- Export: Generate report for development team
- Import: Responses from different models to same prompts
- Blind Evaluation: Randomize presentation order
- Score: Use quality labels consistently
- Analyze: Compare approval rates and label distributions
- Import: Diverse prompt-response pairs
- Curate: Mark gold-standard examples
- Enhance: Add ideal responses and detailed comments
- Export: Create training dataset with annotations
- Use
↑/↓for quick approve/reject - Press
1-6for rapid label selection - Use
G+numberto jump to specific items - Press
Ctrl+Fto search through data
- For LLMs: Use
|||separator to avoid comma conflicts - For Analysis: Include metadata and statistics
- For Spreadsheets: Use standard comma separator
- For Databases: Use JSON format with metadata
- Approve Remaining: Quick approval of pre-screened content
- Pattern Evaluation: Apply similar labels to similar responses
- Progressive Refinement: Multiple passes with different focus areas
- Approval Rate: Percentage of responses marked as approved
- Time per Item: Average evaluation speed
- Label Distribution: Most common issues identified
- Gold Standard Rate: Percentage of exceptional responses
Evaluation Summary:
- Total Items: 250
- Evaluated: 250 (100%)
- Approved: 180 (72%)
- Rejected: 70 (28%)
- Gold Standard: 25 (10%)
Top Issues:
1. Generic Response: 35 items (14%)
2. Lack of Context: 20 items (8%)
3. Factually Incorrect: 15 items (6%)
Recommendations:
- Improve prompt specificity to reduce generic responses
- Add context awareness training
- Enhance fact-checking capabilities
- Sensitive Data: Use offline evaluation only
- Personal Information: Remove PII before evaluation
- Confidential Content: Ensure no external transmission
- Access Control: Limit evaluator access appropriately
- Session Tracking: Record evaluation sessions
- Evaluator Identification: Include evaluator names in exports
- Version Control: Track different evaluation rounds
- Change Documentation: Note evaluation criteria changes
- Tool Overview: Basic functionality demonstration
- Domain Guidelines: Specific evaluation criteria
- Practice Session: Evaluate sample data
- Calibration: Compare evaluations with expert
- Quality Check: Review first few real evaluations
- Clear Criteria: Define what constitutes approval/rejection
- Example Bank: Maintain reference examples
- Regular Calibration: Periodic consistency checks
- Documentation: Record evaluation rationale
💡 Need more examples? Check the CONTRIBUTING.md for guidelines on adding domain-specific examples.