Skip to content

Latest commit

 

History

History
191 lines (154 loc) · 6.75 KB

File metadata and controls

191 lines (154 loc) · 6.75 KB

Usage Examples for LLM Evals

This document provides practical examples of how to use LLM Evals for Domain Experts in different scenarios.

📋 Basic CSV Format

Minimal Required Format

input,llm_output
"What is 2+2?","2+2 equals 4."
"Name a red fruit","An apple is a red fruit."

Extended Format (with previous evaluations)

input,llm_output,evaluation,labels,ideal_output,comments
"What is 2+2?","2+2 equals 4.","approved","accurate-relevant","2+2 = 4","Perfect mathematical answer"
"Name a red fruit","An apple is a red fruit.","approved","accurate-relevant,concise-direct","Apple","Good example"

🎯 Domain-Specific Use Cases

1. Medical/Healthcare Domain

input,llm_output
"What are the symptoms of diabetes?","Common symptoms include frequent urination, excessive thirst, unexplained weight loss, fatigue, and blurred vision."
"Explain how insulin works","Insulin is a hormone that helps cells absorb glucose from the bloodstream, regulating blood sugar levels."

Evaluation Focus:

  • Medical accuracy
  • Safety considerations
  • Appropriate disclaimers
  • Evidence-based information

2. Educational Content

input,llm_output
"Explain photosynthesis for 5th graders","Plants make their own food using sunlight! They take in water through roots and carbon dioxide through leaves, then use sunlight to turn these into sugar."
"What caused World War II?","World War II was caused by multiple factors including economic instability, rise of totalitarian regimes, and unresolved issues from World War I."

Evaluation Focus:

  • Age-appropriate language
  • Accuracy of information
  • Clarity and comprehension
  • Educational value

3. Customer Support

input,llm_output
"How do I reset my password?","To reset your password: 1) Click 'Forgot Password' on the login page, 2) Enter your email address, 3) Check your email for reset link, 4) Follow the instructions in the email."
"My order is delayed, what should I do?","I understand your concern about the delayed order. Let me help you track your package and provide an update on the expected delivery date."

Evaluation Focus:

  • Helpfulness and clarity
  • Empathy and tone
  • Completeness of solution
  • Professional communication

4. Legal/Compliance

input,llm_output
"What are GDPR requirements for data processing?","GDPR requires lawful basis for processing, data minimization, purpose limitation, accuracy, storage limitation, and appropriate security measures."
"Explain contract termination rights","Contract termination rights depend on the specific terms outlined in the agreement, applicable laws, and circumstances surrounding the termination."

Evaluation Focus:

  • Legal accuracy
  • Appropriate disclaimers
  • Risk mitigation
  • Jurisdiction considerations

🏷️ Common Label Categories

Error Labels

  • hallucination: Factually incorrect information
  • factually-incorrect: Verifiably wrong facts
  • ignored-instruction: Didn't follow the prompt
  • generic-response: Too vague or template-like
  • lack-of-context: Missing important context
  • off-topic: Doesn't address the question

Quality Labels

  • gold-standard: Exceptional response quality
  • accurate-relevant: Correct and on-topic
  • creative-innovative: Shows creativity or novel approach
  • concise-direct: Well-structured and to the point
  • well-structured: Good organization and flow
  • detailed-complete: Comprehensive coverage

📊 Evaluation Workflows

Workflow 1: Quality Assurance

  1. Import: Production AI responses for review
  2. Evaluate: Mark approve/reject with error labels
  3. Annotate: Add ideal responses for rejected items
  4. Export: Generate report for development team

Workflow 2: Model Comparison

  1. Import: Responses from different models to same prompts
  2. Blind Evaluation: Randomize presentation order
  3. Score: Use quality labels consistently
  4. Analyze: Compare approval rates and label distributions

Workflow 3: Training Data Creation

  1. Import: Diverse prompt-response pairs
  2. Curate: Mark gold-standard examples
  3. Enhance: Add ideal responses and detailed comments
  4. Export: Create training dataset with annotations

🔍 Advanced Tips

Keyboard Efficiency

  • Use / for quick approve/reject
  • Press 1-6 for rapid label selection
  • Use G+number to jump to specific items
  • Press Ctrl+F to search through data

Export Optimization

  • For LLMs: Use ||| separator to avoid comma conflicts
  • For Analysis: Include metadata and statistics
  • For Spreadsheets: Use standard comma separator
  • For Databases: Use JSON format with metadata

Batch Operations

  • Approve Remaining: Quick approval of pre-screened content
  • Pattern Evaluation: Apply similar labels to similar responses
  • Progressive Refinement: Multiple passes with different focus areas

📈 Quality Metrics

Key Performance Indicators

  • Approval Rate: Percentage of responses marked as approved
  • Time per Item: Average evaluation speed
  • Label Distribution: Most common issues identified
  • Gold Standard Rate: Percentage of exceptional responses

Reporting Templates

Evaluation Summary:
- Total Items: 250
- Evaluated: 250 (100%)
- Approved: 180 (72%)
- Rejected: 70 (28%)
- Gold Standard: 25 (10%)

Top Issues:
1. Generic Response: 35 items (14%)
2. Lack of Context: 20 items (8%)
3. Factually Incorrect: 15 items (6%)

Recommendations:
- Improve prompt specificity to reduce generic responses
- Add context awareness training
- Enhance fact-checking capabilities

🔒 Security Best Practices

Data Handling

  • Sensitive Data: Use offline evaluation only
  • Personal Information: Remove PII before evaluation
  • Confidential Content: Ensure no external transmission
  • Access Control: Limit evaluator access appropriately

Audit Trail

  • Session Tracking: Record evaluation sessions
  • Evaluator Identification: Include evaluator names in exports
  • Version Control: Track different evaluation rounds
  • Change Documentation: Note evaluation criteria changes

🎓 Training New Evaluators

Onboarding Checklist

  1. Tool Overview: Basic functionality demonstration
  2. Domain Guidelines: Specific evaluation criteria
  3. Practice Session: Evaluate sample data
  4. Calibration: Compare evaluations with expert
  5. Quality Check: Review first few real evaluations

Consistency Tips

  • Clear Criteria: Define what constitutes approval/rejection
  • Example Bank: Maintain reference examples
  • Regular Calibration: Periodic consistency checks
  • Documentation: Record evaluation rationale

💡 Need more examples? Check the CONTRIBUTING.md for guidelines on adding domain-specific examples.