LLM Evals for Domain Experts

A standalone, privacy-first LLM evaluation tool for domain experts

Zero installation • No dependencies • Complete privacy • Professional UX

🎯 Overview

LLM Evals for Domain Experts is a sophisticated evaluation interface designed specifically for domain experts who need to assess Large Language Model outputs efficiently and securely. Built as a single, self-contained HTML file, it requires no installation, setup, or external dependencies.

✨ Key Features

🔒 Complete Privacy: All data stays on your local machine - no uploads, no tracking
📱 Zero Installation: Just download and open in any modern browser
🌍 Multilingual Support: English, Portuguese, and Spanish interfaces
⚡ Professional UX: Keyboard shortcuts, batch operations, and streamlined workflows
📊 Rich Analytics: Built-in statistics, approval rates, and performance metrics
💾 Smart Export: CSV/JSON export with customizable separators and metadata
🎨 Modern UI: Dark/light themes with responsive design for all screen sizes

🚀 Quick Start

Download and Run

Download llm-evals-standalone.html
Double-click to open in your browser
Import your CSV file with input and llm_output columns
Start evaluating immediately!

No Installation Required ✅

No Python, Node.js, or package managers needed
No server setup or configuration
No internet connection required after download
Works on Windows, macOS, Linux, and mobile devices

📋 How It Works

1. Import Your Data

Upload any CSV file with input and llm_output columns
Supports files up to 2MB with up to 500 items
Auto-detects previously exported evaluations for continuity

2. Evaluate Efficiently

Approve/Reject: Quick ✅/❌ decisions with visual feedback
Label Classification:
- Error Labels: Hallucination, Factually Incorrect, Ignored Instructions, etc.
- Quality Labels: Gold Standard, Accurate & Relevant, Creative & Innovative, etc.
Rich Annotations: Add ideal responses and detailed comments
Keyboard Shortcuts: Navigate, evaluate, and label without touching the mouse

3. Track Progress

Real-time statistics and approval rates
Session timing and performance metrics
Visual progress indicators
Auto-save functionality

4. Export Results

Multiple Formats: CSV (recommended) or JSON
Custom Separators: Comma, semicolon, pipe, triple-pipe, tab, or custom
Rich Metadata: Timestamps, evaluator info, and session statistics
Smart Naming: Auto-generated filenames with evaluation summary

🎮 Advanced Features

⌨️ Keyboard Shortcuts

Shortcut	Action
`←` / `→`	Navigate between items
`↑` / `↓`	Approve / Reject
`1-6`	Quick label selection
`Ctrl+E`	Export evaluations
`Ctrl+N`	New session
`G+[number]`	Go to specific item
`Ctrl+F`	Search in data
`Ctrl+Z`	Undo last action

🔍 Smart Navigation

Jump to Item: Go directly to any item number
Search Functionality: Find specific content across inputs/outputs
Batch Operations: Approve/reject remaining items in bulk
Undo System: Reverse recent evaluation decisions

📊 Analytics Dashboard

Completion Rates: Track evaluation progress
Time Metrics: Average time per item and total session time
Quality Distribution: Breakdown of labels and decisions
Gold Standard Tracking: Identify exceptional responses

🌐 Internationalization

Full interface localization in three languages:

🇺🇸 English: Default interface language
🇧🇷 Portuguese: Complete Brazilian Portuguese translation
🇪🇸 Spanish: Full Spanish interface support

Language switching is instant and preserves all evaluation progress.

🔒 Privacy & Security

Complete Data Privacy

No Server Communication: Everything runs locally in your browser
No Data Upload: Your evaluations never leave your machine
No Tracking: Zero analytics, cookies, or user tracking
Offline Capable: Works without internet connection

Security Benefits

Air-Gapped Operation: Perfect for sensitive or confidential data
GDPR Compliant: No personal data collection or processing
Enterprise Safe: No external dependencies or security risks
Audit Friendly: Single file makes security review trivial

💼 Perfect for Domain Experts

Why Domain Experts Love This Tool

🧠 Cognitive Scientists: Evaluate reasoning and logical consistency
📚 Content Specialists: Assess accuracy and domain knowledge
🎯 Product Managers: Review user experience and feature requests
🔬 Researchers: Conduct systematic LLM capability studies
👩‍💼 Business Analysts: Evaluate commercial viability of AI responses
🎓 Educators: Grade and assess AI-generated educational content

Professional Workflows

Blind Evaluation: Evaluate without bias using randomized presentation
Inter-Rater Reliability: Multiple experts can evaluate the same dataset
Longitudinal Studies: Track model improvements over time
A/B Testing: Compare different models or prompts
Quality Assurance: Systematic review of production AI outputs

📈 Use Cases

🏢 Enterprise

Model Comparison: Evaluate different LLMs for specific use cases
Quality Control: Systematic review of AI-generated content
Performance Monitoring: Track model degradation or improvement
Compliance Checking: Ensure AI outputs meet regulatory standards

🔬 Research

Academic Studies: Systematic evaluation for research papers
Benchmark Creation: Build custom evaluation datasets
Capability Assessment: Test specific model capabilities
Error Analysis: Identify patterns in model failures

🎯 Product Development

User Acceptance Testing: Evaluate AI features with domain experts
Feature Validation: Test new AI capabilities before release
Customer Feedback: Structure evaluation of user-reported issues
Competitive Analysis: Compare your AI against competitors

🛠️ Technical Specifications

Browser Compatibility

Chrome/Edge: Full feature support including custom scrollbars
Firefox: Complete functionality with standard scrollbars
Safari: Full compatibility on macOS and iOS
Mobile Browsers: Responsive design for tablet and phone evaluation

Performance

File Size: ~4MB standalone file (includes all dependencies)
Memory Usage: Optimized for large datasets (500+ items)
Load Time: Instant startup, no network requests
Responsiveness: Smooth interactions even with large datasets

Data Format Requirements

input,llm_output
"Your question or prompt here","AI model response here"
"Another input","Another response"

Optional columns (automatically detected):

evaluation: Previous evaluation status
labels: Previous label classifications
ideal_output: Previous ideal response annotations
comments: Previous evaluation comments

🔧 Customization

CSV Export Options

Separators: Comma, semicolon, pipe, triple-pipe (|||), tab, or custom
Metadata: Optional timestamps and session statistics
Filtering: Export only evaluated items or include all
Naming: Automatic filename generation with evaluation summary

Interface Customization

Themes: Professional light and dark modes
Font Scaling: Adjustable text size (80% to 150%)
Language: Switch between English, Portuguese, and Spanish
Layout: Responsive design adapts to screen size

📊 Export Formats

CSV Format (Recommended)

input,llm_output,evaluation,labels,ideal_output,comments,evaluation_timestamp
"Input text","Output text","approved","accurate-relevant,well-structured","Ideal response","Great answer","2024-12-30T10:30:00Z"

JSON Format (Advanced)

[
  {
    "input": "Input text",
    "llm_output": "Output text", 
    "evaluation": "approved",
    "labels": "accurate-relevant,well-structured",
    "ideal_output": "Ideal response",
    "comments": "Great answer",
    "evaluation_timestamp": "2024-12-30T10:30:00Z"
  }
]

🤝 Contributing

We welcome contributions to make this tool even better for domain experts!

Areas for Contribution

New Languages: Add translations for additional languages
Label Categories: Suggest domain-specific evaluation labels
Export Formats: Add support for specialized export formats
UI Improvements: Enhance user experience and accessibility
Documentation: Improve guides and examples

How to Contribute

Fork this repository
Create a feature branch (git checkout -b feature/amazing-feature)
Make your changes to llm-evals-standalone.html
Test thoroughly across different browsers
Commit your changes (git commit -m 'Add amazing feature')
Push to the branch (git push origin feature/amazing-feature)
Open a Pull Request

📝 License

This project is licensed under the MIT License - see the LICENSE file for details.

MIT License Summary

✅ Commercial Use: Use in commercial projects
✅ Modification: Modify and adapt to your needs
✅ Distribution: Share with colleagues and teams
✅ Private Use: Use internally within organizations
❌ Liability: No warranty or liability from authors

🆘 Support

Getting Help

📖 Documentation: This README covers most use cases
🐛 Bug Reports: Open an issue with detailed reproduction steps
💡 Feature Requests: Share your ideas for improvements
❓ Questions: Ask in GitHub Discussions

Common Solutions

File Won't Load: Ensure your CSV has input and llm_output columns
Slow Performance: Try smaller batches (under 500 items) for optimal speed
Export Issues: Check that you have evaluation data before exporting
Browser Issues: Use Chrome/Edge for the best experience

🙏 Acknowledgments

Built for domain experts who need:

Privacy: Complete data control and security
Simplicity: No technical barriers to evaluation
Efficiency: Professional workflows and keyboard shortcuts
Flexibility: Customizable labels and export options
Reliability: Offline operation and data integrity

Made with ❤️ for the AI evaluation community

⭐ Star this repository if it helps your evaluation workflows!

Report Bug • Request Feature • Documentation

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
.gitignore		.gitignore
CONTRIBUTING.md		CONTRIBUTING.md
EXAMPLES.md		EXAMPLES.md
LICENSE		LICENSE
README.md		README.md
llm-evals-standalone.html		llm-evals-standalone.html

Folders and files

Latest commit

History

Repository files navigation

LLM Evals for Domain Experts

🎯 Overview

✨ Key Features

🚀 Quick Start

Download and Run

No Installation Required ✅

📋 How It Works

1. Import Your Data

2. Evaluate Efficiently

3. Track Progress

4. Export Results

🎮 Advanced Features

⌨️ Keyboard Shortcuts

🔍 Smart Navigation

📊 Analytics Dashboard

🌐 Internationalization

🔒 Privacy & Security

Complete Data Privacy

Security Benefits

💼 Perfect for Domain Experts

Why Domain Experts Love This Tool

Professional Workflows

📈 Use Cases

🏢 Enterprise

🔬 Research

🎯 Product Development

🛠️ Technical Specifications

Browser Compatibility

Performance

Data Format Requirements

🔧 Customization

CSV Export Options

Interface Customization

📊 Export Formats

CSV Format (Recommended)

JSON Format (Advanced)

🤝 Contributing

Areas for Contribution

How to Contribute

📝 License

MIT License Summary

🆘 Support

Getting Help

Common Solutions

🙏 Acknowledgments

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages