A curated list of tools, methods & platforms for evaluating AI quality in real applications.
Evaluation is how you know if your AI actually works (and not hallucinating). This list covers the frameworks, benchmarks, datasets, and platforms you need to test LLMs, debug RAG pipelines, and monitor autonomous agents in production, organized by what you're trying to measure and how.
Aleph Alpha Eval Framework - Production-ready evaluation framework with 90+ pre-loaded benchmarks for reasoning, coding, and safety.
Anthropic Model Evals - Anthropic's evaluation suite for safety, capabilities, and alignment testing of language models.
Bloom - Anthropic's open-source agentic framework for automated behavioral evaluations of frontier AI models.
ColossalEval - Unified pipeline for classic metrics plus GPT-assisted scoring across public datasets.
DeepEval - Python unit-test style metrics for hallucination, relevance, toxicity, and bias.
Hugging Face lighteval - Toolkit powering HF leaderboards with 1k+ tasks and pluggable metrics.
Inspect AI - UK AI Safety Institute framework for scripted eval plans, tool calls, and model-graded rubrics.
lmms-eval - One-for-all multimodal evaluation toolkit supporting 100+ tasks across text, image, video, and audio.
MLflow Evaluators - Eval API that logs LLM scores next to classic experiment tracking runs.
OpenAI Evals - Reference harness plus registry spanning reasoning, extraction, and safety evals.
OpenCompass - Research harness with CascadeEvaluator, CompassRank syncing, and LLM-as-judge utilities.
Prompt Flow - Flow builder with built-in evaluation DAGs, dataset runners, and CI hooks.
Promptfoo - Local-first CLI and dashboard for evaluating prompts, RAG flows, and agents with cost tracking and regression detection.
Ragas - Evaluation library that grades answers, context, and grounding with pluggable scorers.
TruLens - Feedback function framework for chains and agents with customizable judge models.
W&B Weave Evaluations - Managed evaluation orchestrator with dataset versioning and dashboards.
ZenML - Pipeline framework that bakes evaluation steps and guardrail metrics into LLM workflows.
Athina AI - SOC-2 compliant LLM evaluation and monitoring platform with 50+ preset evaluations and VPC deployment.
Braintrust - Hosted evaluation workspace with CI-style regression tests, agent sandboxes, and token cost tracking.
LangSmith - Hosted tracing plus datasets, batched evals, and regression gating for LangChain apps.
Parea AI - Developer tools for evaluating, testing, and monitoring LLM-powered applications with actionable insights.
Patronus AI - Evaluation platform with multimodal LLM-as-judge, hallucination detection, and industry benchmarks like FinanceBench.
W&B Prompt Registry - Prompt evaluation templates with reproducible scoring and reviews.
EvalScope RAG - Guides and templates that extend Ragas-style metrics with domain rubrics.
LlamaIndex Evaluation - Modules for replaying queries, scoring retrievers, and comparing query engines.
Open RAG Eval - Vectara harness with UMBRELA and AutoNuggetizer metrics that don't require golden answers.
RAGEval - Framework that auto-generates corpora, questions, and RAG rubrics for completeness.
R-Eval - Toolkit for robust RAG scoring aligned with the Evaluation of RAG survey taxonomy.
UltraRAG - MCP-based RAG development framework with built-in evaluation workflows and multimodal support.
BEIR - Benchmark suite covering dense, sparse, and hybrid retrieval tasks.
ColBERT - Late-interaction dense retriever with evaluation scripts for IR datasets.
MTEB - Embeddings benchmark measuring retrieval, reranking, and similarity quality.
Awesome-RAG-Evaluation - Curated catalog of RAG evaluation metrics, datasets, and leaderboards.
Awesome-RAG-Reasoning - EMNLP 2025 collection of RAG + reasoning benchmarks, datasets, and implementations.
Comparing LLMs on Real-World Retrieval - Empirical analysis of how language models perform on practical retrieval tasks.
RAG Evaluation Survey - Comprehensive paper covering metrics, judgments, and open problems for RAG.
RAGTruth - Human-annotated dataset for measuring hallucinations and faithfulness in RAG answers.
AlpacaEval - Automated instruction-following evaluator with length-controlled LLM judge scoring.
ChainForge - Visual IDE for comparing prompts, sampling models, and scoring batches with rubrics.
Guardrails AI - Declarative validation framework that enforces schemas, correction chains, and judgments.
Lakera Guard - Hosted prompt security platform with red-team datasets for jailbreak and injection testing.
PromptBench - Benchmark suite for adversarial prompt stress tests across diverse tasks.
Red Teaming Handbook - Microsoft playbook for adversarial prompt testing and mitigation patterns.
ARTKIT - Automated multi-turn red teaming framework that simulates attacker-target interactions for jailbreak testing.
DeepTeam - Open-source LLM red teaming framework testing for bias, data exposure, and prompt injection vulnerabilities.
Garak - NVIDIA's adversarial testing toolkit with 100+ attack modules for prompt injection and data extraction.
PyRIT - Microsoft's Python Risk Identification Toolkit for orchestrating LLM attack suites and red team automation.
Deepchecks Evaluation Playbook - Survey of evaluation metrics, failure modes, and platform comparisons.
HELM - Holistic Evaluation of Language Models methodology emphasizing multi-criteria scoring.
Instruction-Following Evaluation (IFEval) - Constraint-verification prompts for automatically checking instruction compliance.
OpenAI Cookbook Evals - Practical notebooks showing how to build custom evals.
Safety Evaluation Guides - Cloud vendor recipes for testing quality, safety, and risk.
Who Validates the Validators? - EvalGen workflow aligning LLM judges with human rubrics via mixed-initiative criteria design.
ZenML Evaluation Playbook - Playbook for embedding eval gates into pipelines and deployments.
Agenta - End-to-end LLM developer platform for prompt engineering, evaluation, and deployment.
Arize Phoenix - OpenTelemetry-native observability and evaluation toolkit for RAG, LLMs, and agents.
DocETL - ETL system for complex document processing with LLMs and built-in quality checks.
Giskard - Testing framework for ML models with vulnerability scanning and LLM-specific detectors.
Helicone - Open-source LLM observability platform with cost tracking, caching, and evaluation tools.
Langfuse - Open-source LLM engineering platform providing tracing, eval dashboards, and prompt analytics.
Lilac - Data curation tool for exploring and enriching datasets with semantic search and clustering.
LiteLLM - Unified API for 100+ LLM providers with cost tracking, fallbacks, and load balancing.
Lunary - Production toolkit for LLM apps with tracing, prompt management, and evaluation pipelines.
Mirascope - Python toolkit for building LLM applications with structured outputs and evaluation utilities.
OpenLIT - Telemetry instrumentation for LLM apps with built-in quality metrics and guardrail hooks.
OpenLLMetry - OpenTelemetry instrumentation for LLM traces that feed any backend or custom eval logic.
Opik - Self-hostable evaluation and observability hub with datasets, scoring jobs, and interactive traces.
Rhesis - Collaborative testing platform with automated test generation and multi-turn conversation simulation for LLM and agentic applications.
traceAI - Open-source multi-modal tracing and diagnostics framework for LLM, RAG, and agent workflows built on OpenTelemetry.
UpTrain - OSS/hosted evaluation suite with 20+ checks, RCA tooling, and LlamaIndex integrations.
VoltAgent - TypeScript agent framework paired with VoltOps for trace inspection and regression testing.
Zeno - Data-centric evaluation UI for slicing failures, comparing prompts, and debugging retrieval quality.
ChatIntel - Conversation analytics platform for evaluating chatbot quality, sentiment, and user satisfaction.
Confident AI - DeepEval-backed platform for scheduled eval suites, guardrails, and production monitors.
Datadog LLM Observability - Datadog module capturing LLM traces, metrics, and safety signals.
Deepchecks LLM Evaluation - Managed eval suites with dataset versioning, dashboards, and alerting.
Eppo - Experimentation platform with AI-specific evaluation metrics and statistical rigor for LLM A/B testing.
Future AGI - Multi-modal evaluation, simulation, and optimization platform for reliable AI systems across software and hardware.
Galileo - Evaluation and data-curation studio with labeling, slicing, and issue triage.
HoneyHive - Evaluation and observability platform with prompt versioning, A/B testing, and fine-tuning workflows.
Humanloop - Production prompt management with human-in-the-loop evals and annotation queues.
Maxim AI - Evaluation and observability platform focusing on agent simulations and monitoring.
Orq.ai - LLM operations platform with prompt management, evaluation workflows, and deployment pipelines.
PostHog LLM Analytics - Product analytics toolkit extended to track custom LLM events and metrics.
PromptLayer - Prompt engineering platform with version control, evaluation tracking, and team collaboration.
Amazon Bedrock Evaluations - Managed service for scoring foundation models and RAG pipelines.
Amazon Bedrock Guardrails - Safety layer that evaluates prompts and responses for policy compliance.
Azure AI Foundry Evaluations - Evaluation flows and risk reports wired into Prompt Flow projects.
Vertex AI Generative AI Evaluation - Adaptive rubric-based evaluation with agent assessment, LangChain/CrewAI support, and test-driven evaluation framework.
AGIEval - Human-centric standardized exams spanning entrance tests, legal, and math scenarios.
BIG-bench - Collaborative benchmark probing reasoning, commonsense, and long-tail tasks.
CommonGen-Eval - GPT-4 judged CommonGen-lite suite for constrained commonsense text generation.
DyVal - Dynamic reasoning benchmark that varies difficulty and graph structure to stress models.
LM Evaluation Harness - Standard harness for scoring autoregressive models on dozens of tasks.
LLM-Uncertainty-Bench - Adds uncertainty-aware scoring across QA, RC, inference, dialog, and summarization.
LLMBar - Meta-eval testing whether LLM judges can spot instruction-following failures.
MMLU - Massive multitask language understanding benchmark for academic and professional subjects.
MMLU-Pro - Harder 10-choice extension focused on reasoning-rich, low-leakage questions.
PertEval - Knowledge-invariant perturbations to debias multiple-choice accuracy inflation.
SimpleBench - Fundamental reasoning benchmark where humans (83.7%) significantly outperform best AI models (62.4%).
InfiniteBench - First LLM benchmark with average data length surpassing 100K tokens across 12 tasks.
LongBench v2 - Long-context benchmark with 8k-2M word contexts and 503 challenging questions across six task categories.
LongGenBench - ICLR 2025 benchmark evaluating 16K-32K token long-form text generation quality.
LV-Eval - Long-context suite with five length tiers up to 256K tokens and distraction controls.
RULER - NVIDIA's synthetic long-context benchmark with configurable sequence length and 13 tasks across 4 categories.
FinanceBench - Industry benchmark for LLM performance on financial questions and reasoning.
FinEval - Chinese financial QA and reasoning benchmark across regulation, accounting, and markets.
HumanEval - Unit-test-based benchmark for code synthesis and docstring reasoning.
LAiW - Legal benchmark covering retrieval, foundation inference, and complex case applications in Chinese law.
MATH - Competition-level math benchmark targeting multi-step symbolic reasoning.
MBPP - Mostly Basic Programming Problems benchmark for small coding tasks.
MedHELM - Comprehensive medical LLM benchmark with 121 clinician-validated tasks and LLM-jury evaluation protocol.
AgentBench - Evaluates LLMs acting as agents across simulated domains like games and coding.
AstaBench - AI2 benchmark for scientific research AI agents covering literature review, experiment replication, and data analysis.
BrowseComp - OpenAI benchmark of 1,266 problems measuring AI agents' ability to find entangled information on the web.
ColBench - Multi-turn benchmark evaluating LLMs as collaborative coding agents with simulated human partners.
Context-Bench - Letta's benchmark for evaluating AI agent context management and memory capabilities.
DPAI Arena - JetBrains benchmark evaluating full multi-workflow, multi-language developer agents across the engineering lifecycle.
GAIA - Tool-use benchmark requiring grounded reasoning with live web access and planning.
MetaTool Tasks - Tool-calling benchmark and eval harness for agents built around LLaMA models.
SuperCLUE-Agent - Chinese agent eval covering tool use, planning, long/short-term memory, and APIs.
SWE-bench - Real-world GitHub issue resolution benchmark for coding agents.
SWE-bench Live - Continuously updated benchmark with monthly refreshes for contamination-free evaluation.
SWE-bench Pro - Enterprise-level coding benchmark with 1,865 problems across 41 repos requiring hours-to-days solutions.
Terminal-Bench - Stanford/Laude benchmark evaluating AI agents operating in sandboxed command-line environments.
ARC-AGI-2 - Next-generation reasoning benchmark where pure LLMs score 0% but humans can solve every task.
JudgeBench - ICLR 2025 benchmark for evaluating LLM-based judges on challenging response pairs across knowledge, reasoning, math, and coding.
MERLIM - 300K+ image-question pairs with focus on detecting cross-modal hallucination and hidden hallucinations.
MME - Comprehensive MLLM evaluation measuring perception and cognition across 14 subtasks.
MMMU-Pro - Harder extension of MMMU benchmark for multimodal understanding with expert-level questions.
MMT-Bench - 31K+ questions across image, text, video, and point cloud modalities with 162 subtasks.
Video-MME - CVPR 2025 benchmark for comprehensive evaluation of multimodal LLMs in video analysis.
VisualToolBench - First "think with image" benchmark evaluating MLLMs on tasks requiring active visual interaction.
AdvBench - Adversarial prompt benchmark for jailbreak and misuse resistance measurement.
BBQ - Bias-sensitive QA sets measuring stereotype reliance and ambiguous cases.
SimpleSafetyTests - 100-prompt test suite for identifying critical safety risks across five harm areas.
ToxiGen - Toxic language generation and classification benchmark for robustness checks.
TruthfulQA - Measures factuality and hallucination propensity via adversarially written questions.
ARC Prize Leaderboard - AGI reasoning leaderboard tracking ARC-AGI-2 performance across frontier models and open submissions.
CompassRank - OpenCompass leaderboard comparing frontier and research models across multi-domain suites.
LLM Agents Benchmark Collections - Aggregated leaderboard comparing multi-agent safety and reliability suites.
LMArena - Crowdsourced LLM comparison platform (formerly LMSYS Chatbot Arena) with 6M+ user votes for Elo ratings.
Open LLM Leaderboard - Hugging Face benchmark board with IFEval, MMLU-Pro, GPQA, and more.
Open Medical-LLM Leaderboard - Hugging Face leaderboard for medical domain LLM performance across healthcare benchmarks.
OpenAI Evals Registry - Community suites and scores covering accuracy, safety, and instruction following.
Scale SEAL Leaderboard - Expert-rated leaderboard covering reasoning, coding, and safety via SEAL evaluations.
AI Evals for Engineers & PMs - Cohort course from Hamel & Shreya with lifetime reader, Discord, AI Eval Assistant, and live office hours.
AlignEval - Eugene Yan's guide on building LLM judges by following methodical alignment processes.
Applied LLMs - Practical lessons from a year of building with LLMs, emphasizing evaluation as a core practice.
Data Flywheels for LLM Applications - Iterative data improvement processes for building better LLM systems.
Error Analysis & Prioritizing Next Steps - Andrew Ng walkthrough showing how to slice traces and focus eval work via classic ML techniques.
Error Analysis Before Tests - Office hours notes on why error analysis should precede writing automated tests.
Eval Tools Comparison - Detailed comparison of evaluation tools including Braintrust, LangSmith, and Promptfoo.
Evals for AI Engineers - O'Reilly book by Shreya Shankar & Hamel Husain on systematic error analysis, evaluation pipelines, and LLM-as-a-judge.
Evaluating RAG Systems - Practical guidance on RAG evaluation covering retrieval quality and generation assessment.
Field Guide to Rapidly Improving AI Products - Comprehensive guide on error analysis, data viewers, and systematic improvement from 30+ implementations.
Inspect AI Deep Dive - Technical deep dive into Inspect AI framework with hands-on examples.
KDD 2025 Tutorial: Evaluation & Benchmarking of LLM Agents - Academic tutorial covering LLM agent evaluation methodology and best practices.
LLM Evals FAQ - Comprehensive FAQ with 45+ articles covering evaluation questions from practitioners.
LLM Evaluators Survey - Survey of LLM-as-judge use cases and approaches with practical implementation patterns.
LLM-as-a-Judge Survey - Comprehensive 2025 survey on building reliable LLM-as-a-Judge systems with bias mitigation strategies.
LLM-as-a-Judge Guide - In-depth guide on using LLMs as judges for automated evaluation with calibration tips.
Mastering LLMs Open Course - Free 40+ hour course covering evals, RAG, and fine-tuning taught by 25+ industry practitioners.
Modern IR Evals For RAG - Why traditional IR evals are insufficient for RAG, covering BEIR and modern approaches.
Multi-Turn Chat Evals - Strategies for evaluating multi-turn conversational AI systems.
Open Source LLM Tools Comparison - PostHog comparison of open-source LLM observability and evaluation tools.
Scoping LLM Evals - Case study on managing evaluation complexity through proper scoping and topic distribution.
Why AI evals are the hottest new skill - Lenny's interview covering error analysis, axial coding, eval prompts, and PRD alignment.
Your AI Product Needs Evals - Foundational article on why every AI product needs systematic evaluation.
Arize Phoenix AI Chatbot - Next.js chatbot with Phoenix tracing, dataset replays, and evaluation jobs.
Azure LLM Evaluation Samples - Prompt Flow and Azure AI Foundry projects demonstrating hosted evals.
Deepchecks QA over CSV - Example agent wired to Deepchecks scoring plus tracing dashboards.
OpenAI Evals Demo Evals - Templates for extending OpenAI Evals with custom datasets.
Promptfoo Examples - Ready-made prompt regression suites for RAG, summarization, and agents.
ZenML Projects - End-to-end pipelines showing how to weave evaluation steps into LLMOps stacks.
Awesome ChainForge - Ecosystem list centered on ChainForge experiments and extensions.
Awesome-LLM-Eval - Cross-lingual (Chinese) compendium of eval tooling, papers, datasets, and leaderboards.
Awesome LLMOps - Curated tooling for training, deployment, and monitoring of LLM apps.
Awesome Machine Learning - Language-specific ML resources that often host evaluation building blocks.
Awesome-Multimodal-Large-Language-Models - Latest advances on multimodal LLMs including evaluation benchmarks and surveys.
Awesome RAG - Broad coverage of retrieval-augmented generation techniques and tools.
Awesome Self-Hosted - Massive catalog of self-hostable software, including observability stacks.
GenAI Notes - Continuously updated notes and resources on GenAI systems, evaluation, and operations.
Contributions are welcome—please read CONTRIBUTING.md for scope, entry rules, and the pull-request checklist before submitting updates.