Thanks for helping improve this list! Follow the guidance below to keep the repo ready for sindresorhus/awesome review.
This list only includes:
- Actively maintained evaluation frameworks, datasets, or platforms focused on LLMs, RAG, agents, or prompt safety.
- Publicly documented services (open-source or commercial) with clear usage instructions.
- Research papers, surveys, or datasets that directly inform evaluation methodology.
It explicitly excludes:
- Generic ML libraries without evaluation components.
- Vendor landing pages without actionable docs or public access.
- Dead, archived, or unmaintained projects.
Each entry follows this format (badge first for visual alignment):
-  [**Name**](https://link) - Description.
- Keep every bullet to one sentence ending with a period.
- Use bold for the title inside the link:
[**Name**](url) - Place badge before the link for consistent visual alignment.
- Alphabetize entries within every subsection.
- Prefer HTTPS links without tracking parameters.
- Avoid marketing adjectives ("best", "awesome", "super powerful", etc.).
Every entry includes a non-clickable badge before the title. Here's how to create them:
Use the shields.io GitHub stars badge with github.com label:

Example: Adding https://github.com/anthropics/evals
- Extract owner and repo:
anthropics/evals - Build badge:
 - Final entry:
-  [**Anthropic Model Evals**](https://github.com/anthropics/evals) - Evaluation suite for safety, capabilities, and alignment testing.
Use a static shields.io badge with the domain name:

Example: Adding https://docs.llamaindex.ai/en/stable/module_guides/evaluating/
- Extract domain:
docs.llamaindex.ai - Build badge:
 - Final entry:
-  [**LlamaIndex Evaluation**](https://docs.llamaindex.ai/en/stable/module_guides/evaluating/) - Modules for replaying queries and comparing query engines.
| Link type | Badge template |
|---|---|
| GitHub repo |  |
| Documentation |  |
| arXiv paper |  |
| Hugging Face |  |
Before opening a PR, confirm each item:
- The new or updated entry falls within scope and is currently maintained.
- The entry is placed in the single most relevant section and alphabetized.
- The description is under ~120 characters, single-sentence, and free of fluff.
- All sections listed in
## Contentsexist and match the heading text exactly. - No duplicate entries exist elsewhere in the README.
-
pnpm lint(awesome-lint) exits with no errors.
- Edit
README.mdand keep sections balanced; merge or split sections sparingly. - Verify the project's docs and repo are reachable (no 404s) and HTTPS-only.
- Keep descriptions concrete: mention metrics, modalities, or differentiating features.
- Run the lint command below and fix any reported issues.
- Update
CONTRIBUTING.mdorREADME.mdif rules evolve.
Here's a complete example of adding a new tool to the "Core Frameworks" section:
Before (existing entries):
#### Core Frameworks
-  [**DeepEval**](https://github.com/confident-ai/deepeval) - Python unit-test style metrics for hallucination, relevance, toxicity, and bias.
-  [**Ragas**](https://github.com/explodinggradients/ragas) - Evaluation library that grades answers, context, and grounding with pluggable scorers.After (with your new entry alphabetized):
#### Core Frameworks
-  [**DeepEval**](https://github.com/confident-ai/deepeval) - Python unit-test style metrics for hallucination, relevance, toxicity, and bias.
-  [**MyNewTool**](https://github.com/org/mynewtool) - Brief description of what makes this tool useful for evaluation.
-  [**Ragas**](https://github.com/explodinggradients/ragas) - Evaluation library that grades answers, context, and grounding with pluggable scorers.pnpm lintThe script runs awesome-lint locally and enforces Markdown style, heading order, link validity, and Awesome-specific rules. Fix every issue before pushing.
- Re-run
pnpm lintwhen dependencies change or Markdown tooling is upgraded. - Sweep for abandoned repos (no commits for 12 months) and remove them promptly.
- Close stale PRs or issues that add low-signal items.
- Prefer PRs that include supporting evidence (blog post, release notes, or adoption proof).
- Encourage contributors to add datasets, not just tools, to keep the list balanced.