EvalGuard is an open metadata and schema repository for organizing, interpreting, and standardizing the evaluation of large language models (LLMs). It provides configurable schema definitions, metric descriptions, task annotations, and interpretability thresholds to support reproducible and meaningful evaluation practices.
EvalGuard is tool-agnostic but compatible with evaluation outputs from systems such as:
- lm-evaluation-harness
- Hugging Face
evaluate
- Custom or enterprise evaluation pipelines
EvalGuard does not perform evaluations itself. Instead, it serves as a structured reference for interpreting model evaluation data and aligning it with categories, tags, thresholds, and recommended guardrails.
- REST API specification for accessing evaluation data
- Schemas for evaluation reports, tasks, metrics, policies, and guardrails
- Configuration files for model descriptions, task definitions, metric types, and operational guardrails
- CLI tools for schema validation, data generation, and API model generation
This enables:
- Consistent comparison across evaluations
- Configurable guidance on model strengths and limitations
- Operational guardrails and policy frameworks
- Risk mitigation and quality enforcement
evalguard/
├── schemas/ # Schema definitions (see SPECIFICATION.md)
│ └── v1/ # Version 1 schemas
├── config/ # Configuration files for interpretation
│ ├── tasks/ # Task definitions and metadata
│ ├── metrics/ # Metric definitions and types
│ ├── policies/ # Policy definitions with embedded thresholds
│ └── guardrails/ # Operational guardrails and policies
├── reports/ # Community-contributed model evaluation reports
├── tools/ # CLI tool for schema management
├── api-models/ # Generated language-specific models
├── SPECIFICATION.md # Complete schema specification
└── README.md
Policies define evaluation contexts and performance thresholds. They group related thresholds and evaluation criteria, with thresholds embedded within policy definitions.
Guardrails define operational constraints and policies for model deployment, helping mitigate risks and enforce quality standards.
Comprehensive documentation of a model's capabilities and evaluation results, with evaluation results contextualized by policy.
The EvalGuard schema system is formally defined in the EvalGuard Schema Specification. This specification provides:
- Schema Definitions: Formal definitions for tasks, metrics, thresholds, and reports
- Validation Rules: Comprehensive validation requirements and constraints
- File Organization: Schema versioning and file structure guidelines
- Implementation Requirements: Conformance requirements for implementations
- API Specification: Complete REST API interface definition
- Examples: Detailed configuration examples and usage patterns
The specification follows industry standards and uses RFC 2119 terminology for clarity and precision.
EvalGuard defines a REST API specification for accessing evaluation data. The API supports filtering by model, task, metric, and policy context.
# Get reports for a specific model
curl "https://api.evalguard.org/v1/reports?model_name=meta-llama/Llama-3.1-8B-Instruct"
# Get model card with specific policy evaluation results
curl "https://api.evalguard.org/v1/models/llama-3.1-8b-instruct/card?policy_id=default"
# Get policies with embedded thresholds
curl "https://api.evalguard.org/v1/policies?tasks=truthfulqa_mc1,winogender_schemas"
Note: This is a specification only. The API is not implemented in this repository. Anyone interested in providing EvalGuard API services can implement this specification.
EvalGuard thrives on community contributions! We welcome contributions of evaluation reports, configuration files, and improvements to the schemas and tools.
Reports must follow the directory structure:
reports/<model_org>/<model_name>/<report_format>/report.json
Example: reports/meta-llama/Llama-3.1-8B-Instruct/lm-eval/report.json
- lm-evaluation-harness (
lm-eval
): Currently supported with full CLI integration - Other formats: If you have evaluation reports in other formats, please create an issue to discuss integration
When adding a new report, you can use the CLI to automatically generate missing tasks, models, and metrics but bear in mind that some fields will not be filled like friendly name or description so take some time to review what is missing:
# Generate missing configuration from a single report
evalguard lm-eval gen -f report.json
# Generate missing configuration from multiple reports in a directory
evalguard lm-eval gen -d reports/
This will create the necessary configuration files in the appropriate config/
subdirectories.
🚀 Create a Pull Request using our Report Contribution Template to submit your evaluation reports!
The template will guide you through:
- Proper report structure and format
- Configuration generation and validation
- Quality assurance steps
- Required checklist items
Simply copy the template content when creating your PR to ensure all requirements are met.
You can add new configuration files directly to the appropriate folders:
- Tasks:
config/tasks/
- Task definitions and metadata - Metrics:
config/metrics/
- Metric definitions and types - Policies:
config/policies/
- Policy definitions with embedded thresholds - Guardrails:
config/guardrails/
- Operational guardrails and policies
Always validate your contributions using the CLI:
# Validate all configuration files
evalguard config validate
# Validate specific types
evalguard config validate -t tasks
evalguard config validate -t metrics
evalguard config validate -t policies
evalguard config validate -t guardrails
All configuration files must follow the schemas defined in schemas/v1/
. See SPECIFICATION.md for detailed schema definitions and examples.
⚙️ Create a Pull Request using our Configuration Update Template to submit your configuration changes!
The template will guide you through:
- Proper configuration file structure
- Validation and schema compliance
- Quality assurance for configuration updates
- Required checklist items
Simply copy the template content when creating your PR to ensure all requirements are met.
# Install dependencies and build standalone binary
cd tools
# Build binaries for multiple platforms
make package
# Available platforms: macOS, Linux, Windows
# Output: bin/evalguard-{platform}
# Move the binary to a convenient location and add to PATH
# For macOS:
cp bin/evalguard-macos ~/.local/bin/evalguard
echo 'export PATH="$HOME/.local/bin:$PATH"' >> ~/.zshrc
source ~/.zshrc
# For Linux:
cp bin/evalguard-linux ~/.local/bin/evalguard
echo 'export PATH="$HOME/.local/bin:$PATH"' >> ~/.bashrc
source ~/.bashrc
# For Windows (PowerShell):
# Copy evalguard-win.exe to a directory in your PATH
# Or add the bin directory to your system PATH
# Validate all configuration files
evalguard config validate
# Validate specific types
evalguard config validate -t tasks
evalguard config validate -t metrics
evalguard config validate -t policies
evalguard config validate -t guardrails
# Generate tasks/metrics/model_info from evaluation reports
evalguard lm-eval gen -f report.json
evalguard lm-eval gen -d reports/
# Generate API models from schemas
evalguard api gen --type js --spec-version v1
- SPECIFICATION.md: Complete schema specification with detailed examples
- tools/README.md: CLI tool development and usage details
- api-models/: Generated language-specific models and usage examples