A standalone, privacy-first LLM evaluation tool for domain experts
Zero installation • No dependencies • Complete privacy • Professional UX
LLM Evals for Domain Experts is a sophisticated evaluation interface designed specifically for domain experts who need to assess Large Language Model outputs efficiently and securely. Built as a single, self-contained HTML file, it requires no installation, setup, or external dependencies.
- 🔒 Complete Privacy: All data stays on your local machine - no uploads, no tracking
- 📱 Zero Installation: Just download and open in any modern browser
- 🌍 Multilingual Support: English, Portuguese, and Spanish interfaces
- ⚡ Professional UX: Keyboard shortcuts, batch operations, and streamlined workflows
- 📊 Rich Analytics: Built-in statistics, approval rates, and performance metrics
- 💾 Smart Export: CSV/JSON export with customizable separators and metadata
- 🎨 Modern UI: Dark/light themes with responsive design for all screen sizes
- Download
llm-evals-standalone.html - Double-click to open in your browser
- Import your CSV file with
inputandllm_outputcolumns - Start evaluating immediately!
- No Python, Node.js, or package managers needed
- No server setup or configuration
- No internet connection required after download
- Works on Windows, macOS, Linux, and mobile devices
- Upload any CSV file with
inputandllm_outputcolumns - Supports files up to 2MB with up to 500 items
- Auto-detects previously exported evaluations for continuity
- Approve/Reject: Quick ✅/❌ decisions with visual feedback
- Label Classification:
- Error Labels: Hallucination, Factually Incorrect, Ignored Instructions, etc.
- Quality Labels: Gold Standard, Accurate & Relevant, Creative & Innovative, etc.
- Rich Annotations: Add ideal responses and detailed comments
- Keyboard Shortcuts: Navigate, evaluate, and label without touching the mouse
- Real-time statistics and approval rates
- Session timing and performance metrics
- Visual progress indicators
- Auto-save functionality
- Multiple Formats: CSV (recommended) or JSON
- Custom Separators: Comma, semicolon, pipe, triple-pipe, tab, or custom
- Rich Metadata: Timestamps, evaluator info, and session statistics
- Smart Naming: Auto-generated filenames with evaluation summary
| Shortcut | Action |
|---|---|
← / → |
Navigate between items |
↑ / ↓ |
Approve / Reject |
1-6 |
Quick label selection |
Ctrl+E |
Export evaluations |
Ctrl+N |
New session |
G+[number] |
Go to specific item |
Ctrl+F |
Search in data |
Ctrl+Z |
Undo last action |
- Jump to Item: Go directly to any item number
- Search Functionality: Find specific content across inputs/outputs
- Batch Operations: Approve/reject remaining items in bulk
- Undo System: Reverse recent evaluation decisions
- Completion Rates: Track evaluation progress
- Time Metrics: Average time per item and total session time
- Quality Distribution: Breakdown of labels and decisions
- Gold Standard Tracking: Identify exceptional responses
Full interface localization in three languages:
- 🇺🇸 English: Default interface language
- 🇧🇷 Portuguese: Complete Brazilian Portuguese translation
- 🇪🇸 Spanish: Full Spanish interface support
Language switching is instant and preserves all evaluation progress.
- No Server Communication: Everything runs locally in your browser
- No Data Upload: Your evaluations never leave your machine
- No Tracking: Zero analytics, cookies, or user tracking
- Offline Capable: Works without internet connection
- Air-Gapped Operation: Perfect for sensitive or confidential data
- GDPR Compliant: No personal data collection or processing
- Enterprise Safe: No external dependencies or security risks
- Audit Friendly: Single file makes security review trivial
🧠 Cognitive Scientists: Evaluate reasoning and logical consistency
📚 Content Specialists: Assess accuracy and domain knowledge
🎯 Product Managers: Review user experience and feature requests
🔬 Researchers: Conduct systematic LLM capability studies
👩💼 Business Analysts: Evaluate commercial viability of AI responses
🎓 Educators: Grade and assess AI-generated educational content
- Blind Evaluation: Evaluate without bias using randomized presentation
- Inter-Rater Reliability: Multiple experts can evaluate the same dataset
- Longitudinal Studies: Track model improvements over time
- A/B Testing: Compare different models or prompts
- Quality Assurance: Systematic review of production AI outputs
- Model Comparison: Evaluate different LLMs for specific use cases
- Quality Control: Systematic review of AI-generated content
- Performance Monitoring: Track model degradation or improvement
- Compliance Checking: Ensure AI outputs meet regulatory standards
- Academic Studies: Systematic evaluation for research papers
- Benchmark Creation: Build custom evaluation datasets
- Capability Assessment: Test specific model capabilities
- Error Analysis: Identify patterns in model failures
- User Acceptance Testing: Evaluate AI features with domain experts
- Feature Validation: Test new AI capabilities before release
- Customer Feedback: Structure evaluation of user-reported issues
- Competitive Analysis: Compare your AI against competitors
- Chrome/Edge: Full feature support including custom scrollbars
- Firefox: Complete functionality with standard scrollbars
- Safari: Full compatibility on macOS and iOS
- Mobile Browsers: Responsive design for tablet and phone evaluation
- File Size: ~4MB standalone file (includes all dependencies)
- Memory Usage: Optimized for large datasets (500+ items)
- Load Time: Instant startup, no network requests
- Responsiveness: Smooth interactions even with large datasets
input,llm_output
"Your question or prompt here","AI model response here"
"Another input","Another response"Optional columns (automatically detected):
evaluation: Previous evaluation statuslabels: Previous label classificationsideal_output: Previous ideal response annotationscomments: Previous evaluation comments
- Separators: Comma, semicolon, pipe, triple-pipe (
|||), tab, or custom - Metadata: Optional timestamps and session statistics
- Filtering: Export only evaluated items or include all
- Naming: Automatic filename generation with evaluation summary
- Themes: Professional light and dark modes
- Font Scaling: Adjustable text size (80% to 150%)
- Language: Switch between English, Portuguese, and Spanish
- Layout: Responsive design adapts to screen size
input,llm_output,evaluation,labels,ideal_output,comments,evaluation_timestamp
"Input text","Output text","approved","accurate-relevant,well-structured","Ideal response","Great answer","2024-12-30T10:30:00Z"[
{
"input": "Input text",
"llm_output": "Output text",
"evaluation": "approved",
"labels": "accurate-relevant,well-structured",
"ideal_output": "Ideal response",
"comments": "Great answer",
"evaluation_timestamp": "2024-12-30T10:30:00Z"
}
]We welcome contributions to make this tool even better for domain experts!
- New Languages: Add translations for additional languages
- Label Categories: Suggest domain-specific evaluation labels
- Export Formats: Add support for specialized export formats
- UI Improvements: Enhance user experience and accessibility
- Documentation: Improve guides and examples
- Fork this repository
- Create a feature branch (
git checkout -b feature/amazing-feature) - Make your changes to
llm-evals-standalone.html - Test thoroughly across different browsers
- Commit your changes (
git commit -m 'Add amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
This project is licensed under the MIT License - see the LICENSE file for details.
- ✅ Commercial Use: Use in commercial projects
- ✅ Modification: Modify and adapt to your needs
- ✅ Distribution: Share with colleagues and teams
- ✅ Private Use: Use internally within organizations
- ❌ Liability: No warranty or liability from authors
- 📖 Documentation: This README covers most use cases
- 🐛 Bug Reports: Open an issue with detailed reproduction steps
- 💡 Feature Requests: Share your ideas for improvements
- ❓ Questions: Ask in GitHub Discussions
- File Won't Load: Ensure your CSV has
inputandllm_outputcolumns - Slow Performance: Try smaller batches (under 500 items) for optimal speed
- Export Issues: Check that you have evaluation data before exporting
- Browser Issues: Use Chrome/Edge for the best experience
Built for domain experts who need:
- Privacy: Complete data control and security
- Simplicity: No technical barriers to evaluation
- Efficiency: Professional workflows and keyboard shortcuts
- Flexibility: Customizable labels and export options
- Reliability: Offline operation and data integrity
Made with ❤️ for the AI evaluation community
⭐ Star this repository if it helps your evaluation workflows!