Skip to content

Vvkmnn/awesome-ai-eval

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

14 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Awesome AI Eval Awesome

A curated list of tools, methods & platforms for evaluating AI quality in real applications.

Awesome AI Eval robot logo

Evaluation is how you know if your AI actually works (and not hallucinating). This list covers the frameworks, benchmarks, datasets, and platforms you need to test LLMs, debug RAG pipelines, and monitor autonomous agents in production, organized by what you're trying to measure and how.

Contents


Tools

Evaluators and Test Harnesses

Core Frameworks

  • Aleph Alpha Eval Framework - Production-ready evaluation framework with 90+ pre-loaded benchmarks for reasoning, coding, and safety.
  • Anthropic Model Evals - Anthropic's evaluation suite for safety, capabilities, and alignment testing of language models.
  • Bloom - Anthropic's open-source agentic framework for automated behavioral evaluations of frontier AI models.
  • ColossalEval - Unified pipeline for classic metrics plus GPT-assisted scoring across public datasets.
  • DeepEval - Python unit-test style metrics for hallucination, relevance, toxicity, and bias.
  • Hugging Face lighteval - Toolkit powering HF leaderboards with 1k+ tasks and pluggable metrics.
  • Inspect AI - UK AI Safety Institute framework for scripted eval plans, tool calls, and model-graded rubrics.
  • lmms-eval - One-for-all multimodal evaluation toolkit supporting 100+ tasks across text, image, video, and audio.
  • MLflow Evaluators - Eval API that logs LLM scores next to classic experiment tracking runs.
  • OpenAI Evals - Reference harness plus registry spanning reasoning, extraction, and safety evals.
  • OpenCompass - Research harness with CascadeEvaluator, CompassRank syncing, and LLM-as-judge utilities.
  • Prompt Flow - Flow builder with built-in evaluation DAGs, dataset runners, and CI hooks.
  • Promptfoo - Local-first CLI and dashboard for evaluating prompts, RAG flows, and agents with cost tracking and regression detection.
  • Ragas - Evaluation library that grades answers, context, and grounding with pluggable scorers.
  • TruLens - Feedback function framework for chains and agents with customizable judge models.
  • W&B Weave Evaluations - Managed evaluation orchestrator with dataset versioning and dashboards.
  • ZenML - Pipeline framework that bakes evaluation steps and guardrail metrics into LLM workflows.

Application and Agent Harnesses

  • Athina AI - SOC-2 compliant LLM evaluation and monitoring platform with 50+ preset evaluations and VPC deployment.
  • Braintrust - Hosted evaluation workspace with CI-style regression tests, agent sandboxes, and token cost tracking.
  • LangSmith - Hosted tracing plus datasets, batched evals, and regression gating for LangChain apps.
  • Parea AI - Developer tools for evaluating, testing, and monitoring LLM-powered applications with actionable insights.
  • Patronus AI - Evaluation platform with multimodal LLM-as-judge, hallucination detection, and industry benchmarks like FinanceBench.
  • W&B Prompt Registry - Prompt evaluation templates with reproducible scoring and reviews.

RAG and Retrieval

RAG Frameworks

  • EvalScope RAG - Guides and templates that extend Ragas-style metrics with domain rubrics.
  • LlamaIndex Evaluation - Modules for replaying queries, scoring retrievers, and comparing query engines.
  • Open RAG Eval - Vectara harness with UMBRELA and AutoNuggetizer metrics that don't require golden answers.
  • RAGEval - Framework that auto-generates corpora, questions, and RAG rubrics for completeness.
  • R-Eval - Toolkit for robust RAG scoring aligned with the Evaluation of RAG survey taxonomy.
  • UltraRAG - MCP-based RAG development framework with built-in evaluation workflows and multimodal support.

Retrieval Benchmarks

  • BEIR - Benchmark suite covering dense, sparse, and hybrid retrieval tasks.
  • ColBERT - Late-interaction dense retriever with evaluation scripts for IR datasets.
  • MTEB - Embeddings benchmark measuring retrieval, reranking, and similarity quality.

RAG Datasets and Surveys

Prompt Evaluation & Safety

  • AlpacaEval - Automated instruction-following evaluator with length-controlled LLM judge scoring.
  • ChainForge - Visual IDE for comparing prompts, sampling models, and scoring batches with rubrics.
  • Guardrails AI - Declarative validation framework that enforces schemas, correction chains, and judgments.
  • Lakera Guard - Hosted prompt security platform with red-team datasets for jailbreak and injection testing.
  • PromptBench - Benchmark suite for adversarial prompt stress tests across diverse tasks.
  • Red Teaming Handbook - Microsoft playbook for adversarial prompt testing and mitigation patterns.

Red Teaming & Adversarial Testing

  • ARTKIT - Automated multi-turn red teaming framework that simulates attacker-target interactions for jailbreak testing.
  • DeepTeam - Open-source LLM red teaming framework testing for bias, data exposure, and prompt injection vulnerabilities.
  • Garak - NVIDIA's adversarial testing toolkit with 100+ attack modules for prompt injection and data extraction.
  • PyRIT - Microsoft's Python Risk Identification Toolkit for orchestrating LLM attack suites and red team automation.

Datasets and Methodology


Platforms

Open Source Platforms

  • Agenta - End-to-end LLM developer platform for prompt engineering, evaluation, and deployment.
  • Arize Phoenix - OpenTelemetry-native observability and evaluation toolkit for RAG, LLMs, and agents.
  • DocETL - ETL system for complex document processing with LLMs and built-in quality checks.
  • Giskard - Testing framework for ML models with vulnerability scanning and LLM-specific detectors.
  • Helicone - Open-source LLM observability platform with cost tracking, caching, and evaluation tools.
  • Langfuse - Open-source LLM engineering platform providing tracing, eval dashboards, and prompt analytics.
  • Lilac - Data curation tool for exploring and enriching datasets with semantic search and clustering.
  • LiteLLM - Unified API for 100+ LLM providers with cost tracking, fallbacks, and load balancing.
  • Lunary - Production toolkit for LLM apps with tracing, prompt management, and evaluation pipelines.
  • Mirascope - Python toolkit for building LLM applications with structured outputs and evaluation utilities.
  • OpenLIT - Telemetry instrumentation for LLM apps with built-in quality metrics and guardrail hooks.
  • OpenLLMetry - OpenTelemetry instrumentation for LLM traces that feed any backend or custom eval logic.
  • Opik - Self-hostable evaluation and observability hub with datasets, scoring jobs, and interactive traces.
  • Rhesis - Collaborative testing platform with automated test generation and multi-turn conversation simulation for LLM and agentic applications.
  • traceAI - Open-source multi-modal tracing and diagnostics framework for LLM, RAG, and agent workflows built on OpenTelemetry.
  • UpTrain - OSS/hosted evaluation suite with 20+ checks, RCA tooling, and LlamaIndex integrations.
  • VoltAgent - TypeScript agent framework paired with VoltOps for trace inspection and regression testing.
  • Zeno - Data-centric evaluation UI for slicing failures, comparing prompts, and debugging retrieval quality.

Hosted Platforms

  • ChatIntel - Conversation analytics platform for evaluating chatbot quality, sentiment, and user satisfaction.
  • Confident AI - DeepEval-backed platform for scheduled eval suites, guardrails, and production monitors.
  • Datadog LLM Observability - Datadog module capturing LLM traces, metrics, and safety signals.
  • Deepchecks LLM Evaluation - Managed eval suites with dataset versioning, dashboards, and alerting.
  • Eppo - Experimentation platform with AI-specific evaluation metrics and statistical rigor for LLM A/B testing.
  • Future AGI - Multi-modal evaluation, simulation, and optimization platform for reliable AI systems across software and hardware.
  • Galileo - Evaluation and data-curation studio with labeling, slicing, and issue triage.
  • HoneyHive - Evaluation and observability platform with prompt versioning, A/B testing, and fine-tuning workflows.
  • Humanloop - Production prompt management with human-in-the-loop evals and annotation queues.
  • Maxim AI - Evaluation and observability platform focusing on agent simulations and monitoring.
  • Orq.ai - LLM operations platform with prompt management, evaluation workflows, and deployment pipelines.
  • PostHog LLM Analytics - Product analytics toolkit extended to track custom LLM events and metrics.
  • PromptLayer - Prompt engineering platform with version control, evaluation tracking, and team collaboration.

Cloud Platforms


Benchmarks

General

  • AGIEval - Human-centric standardized exams spanning entrance tests, legal, and math scenarios.
  • BIG-bench - Collaborative benchmark probing reasoning, commonsense, and long-tail tasks.
  • CommonGen-Eval - GPT-4 judged CommonGen-lite suite for constrained commonsense text generation.
  • DyVal - Dynamic reasoning benchmark that varies difficulty and graph structure to stress models.
  • LM Evaluation Harness - Standard harness for scoring autoregressive models on dozens of tasks.
  • LLM-Uncertainty-Bench - Adds uncertainty-aware scoring across QA, RC, inference, dialog, and summarization.
  • LLMBar - Meta-eval testing whether LLM judges can spot instruction-following failures.
  • MMLU - Massive multitask language understanding benchmark for academic and professional subjects.
  • MMLU-Pro - Harder 10-choice extension focused on reasoning-rich, low-leakage questions.
  • PertEval - Knowledge-invariant perturbations to debias multiple-choice accuracy inflation.
  • SimpleBench - Fundamental reasoning benchmark where humans (83.7%) significantly outperform best AI models (62.4%).

Long Context

  • InfiniteBench - First LLM benchmark with average data length surpassing 100K tokens across 12 tasks.
  • LongBench v2 - Long-context benchmark with 8k-2M word contexts and 503 challenging questions across six task categories.
  • LongGenBench - ICLR 2025 benchmark evaluating 16K-32K token long-form text generation quality.
  • LV-Eval - Long-context suite with five length tiers up to 256K tokens and distraction controls.
  • RULER - NVIDIA's synthetic long-context benchmark with configurable sequence length and 13 tasks across 4 categories.

Domain

  • FinanceBench - Industry benchmark for LLM performance on financial questions and reasoning.
  • FinEval - Chinese financial QA and reasoning benchmark across regulation, accounting, and markets.
  • HumanEval - Unit-test-based benchmark for code synthesis and docstring reasoning.
  • LAiW - Legal benchmark covering retrieval, foundation inference, and complex case applications in Chinese law.
  • MATH - Competition-level math benchmark targeting multi-step symbolic reasoning.
  • MBPP - Mostly Basic Programming Problems benchmark for small coding tasks.
  • MedHELM - Comprehensive medical LLM benchmark with 121 clinician-validated tasks and LLM-jury evaluation protocol.

Agent

  • AgentBench - Evaluates LLMs acting as agents across simulated domains like games and coding.
  • AstaBench - AI2 benchmark for scientific research AI agents covering literature review, experiment replication, and data analysis.
  • BrowseComp - OpenAI benchmark of 1,266 problems measuring AI agents' ability to find entangled information on the web.
  • ColBench - Multi-turn benchmark evaluating LLMs as collaborative coding agents with simulated human partners.
  • Context-Bench - Letta's benchmark for evaluating AI agent context management and memory capabilities.
  • DPAI Arena - JetBrains benchmark evaluating full multi-workflow, multi-language developer agents across the engineering lifecycle.
  • GAIA - Tool-use benchmark requiring grounded reasoning with live web access and planning.
  • MetaTool Tasks - Tool-calling benchmark and eval harness for agents built around LLaMA models.
  • SuperCLUE-Agent - Chinese agent eval covering tool use, planning, long/short-term memory, and APIs.
  • SWE-bench - Real-world GitHub issue resolution benchmark for coding agents.
  • SWE-bench Live - Continuously updated benchmark with monthly refreshes for contamination-free evaluation.
  • SWE-bench Pro - Enterprise-level coding benchmark with 1,865 problems across 41 repos requiring hours-to-days solutions.
  • Terminal-Bench - Stanford/Laude benchmark evaluating AI agents operating in sandboxed command-line environments.

Reasoning

  • ARC-AGI-2 - Next-generation reasoning benchmark where pure LLMs score 0% but humans can solve every task.
  • JudgeBench - ICLR 2025 benchmark for evaluating LLM-based judges on challenging response pairs across knowledge, reasoning, math, and coding.

Multimodal

  • MERLIM - 300K+ image-question pairs with focus on detecting cross-modal hallucination and hidden hallucinations.
  • MME - Comprehensive MLLM evaluation measuring perception and cognition across 14 subtasks.
  • MMMU-Pro - Harder extension of MMMU benchmark for multimodal understanding with expert-level questions.
  • MMT-Bench - 31K+ questions across image, text, video, and point cloud modalities with 162 subtasks.
  • Video-MME - CVPR 2025 benchmark for comprehensive evaluation of multimodal LLMs in video analysis.
  • VisualToolBench - First "think with image" benchmark evaluating MLLMs on tasks requiring active visual interaction.

Safety

  • AdvBench - Adversarial prompt benchmark for jailbreak and misuse resistance measurement.
  • BBQ - Bias-sensitive QA sets measuring stereotype reliance and ambiguous cases.
  • SimpleSafetyTests - 100-prompt test suite for identifying critical safety risks across five harm areas.
  • ToxiGen - Toxic language generation and classification benchmark for robustness checks.
  • TruthfulQA - Measures factuality and hallucination propensity via adversarially written questions.

Leaderboards

  • ARC Prize Leaderboard - AGI reasoning leaderboard tracking ARC-AGI-2 performance across frontier models and open submissions.
  • CompassRank - OpenCompass leaderboard comparing frontier and research models across multi-domain suites.
  • LLM Agents Benchmark Collections - Aggregated leaderboard comparing multi-agent safety and reliability suites.
  • LMArena - Crowdsourced LLM comparison platform (formerly LMSYS Chatbot Arena) with 6M+ user votes for Elo ratings.
  • Open LLM Leaderboard - Hugging Face benchmark board with IFEval, MMLU-Pro, GPQA, and more.
  • Open Medical-LLM Leaderboard - Hugging Face leaderboard for medical domain LLM performance across healthcare benchmarks.
  • OpenAI Evals Registry - Community suites and scores covering accuracy, safety, and instruction following.
  • Scale SEAL Leaderboard - Expert-rated leaderboard covering reasoning, coding, and safety via SEAL evaluations.

Resources

Guides & Training

Examples

Related Collections

  • Awesome ChainForge - Ecosystem list centered on ChainForge experiments and extensions.
  • Awesome-LLM-Eval - Cross-lingual (Chinese) compendium of eval tooling, papers, datasets, and leaderboards.
  • Awesome LLMOps - Curated tooling for training, deployment, and monitoring of LLM apps.
  • Awesome Machine Learning - Language-specific ML resources that often host evaluation building blocks.
  • Awesome-Multimodal-Large-Language-Models - Latest advances on multimodal LLMs including evaluation benchmarks and surveys.
  • Awesome RAG - Broad coverage of retrieval-augmented generation techniques and tools.
  • Awesome Self-Hosted - Massive catalog of self-hostable software, including observability stacks.
  • GenAI Notes - Continuously updated notes and resources on GenAI systems, evaluation, and operations.

Contributing

Contributions are welcome—please read CONTRIBUTING.md for scope, entry rules, and the pull-request checklist before submitting updates.

✌️