ai-benchmark

🤖 A curated list of resources for testing AI agents - frameworks, methodologies, benchmarks, tools, and best practices for ensuring reliable, safe, and effective autonomous AI systems

testing qa benchmark machine-learning evaluation chaos artificial-intelligence chaos-monkey testing-tools awesome-list quality-assurance ai-safety ai-agents chaos-engineering llm llm-evaluation agentic-ai ai-benchmark agent-evaluation

Updated May 28, 2025

GustyCube / ERR-EVAL

Star

Benchmark for evaluating AI epistemic reliability - testing how well LLMs handle uncertainty, avoid hallucinations, and acknowledge what they don't know.

python nlp benchmark machine-learning ai evaluation collaborate ai-safety llm llm-evaluation hallucination-detection ai-benchmark llm-benchmark

Updated Jan 2, 2026
Python

Habitante / gta-benchmark

Star

GTA (Guess The Algorithm) Benchmark - A tool for testing AI reasoning capabilities

python docker flask benchmark machine-learning puzzle reverse-engineering educational pattern-recognition ctf binary-analysis algorithm-analysis computational-thinking algorithmic-reasoning ai-benchmark

Updated Jan 12, 2025
Python

mlcommons / storage_results_v2.0

Star

This repository contains the results and code for the MLPerf™ Storage v2.0 benchmark.

benchmark machine-learning ai ai-benchmark ml-benchmark

Updated Aug 4, 2025
Python

MindTrial: Evaluate and compare AI language models (LLMs) on text-based tasks with optional file/image attachments and tool use. Supports multiple providers (OpenAI, Google, Anthropic, DeepSeek, Mistral AI, xAI, Alibaba, Moonshot AI, OpenRouter), custom tasks in YAML, and HTML/CSV reports.

Updated Feb 7, 2026
Go

yasarshaikh / SF-bench

Star

The first comprehensive benchmark for evaluating AI coding agents on Salesforce development tasks. Tests Apex, LWC, Flows, and more.

benchmark salesforce apex lwc lightning-web-components ai-benchmark

Updated Jan 27, 2026
Python

bogdanminko / laperf

Star

La Perf is a framework for AI performance benchmarking — covering LLMs, VLMs, embeddings, with power-metrics collection.

cuda mps mlx nvidia-gpu apple-silicon ollama lmstudio ai-performance ai-benchmark ml-benchmark open-source-benchmark

Updated Dec 1, 2025
Python

MitchellShibilski-Unkel / PyPC

Star

Python Performance Tester & More...

speedtest performance-analysis specs geekbench performance-test computer-specs gpu-testing python-performance python-performance-test pypc-tests mitchell-shibilski-unkel pypc gpu-test ai-benchmark

Updated Jan 26, 2026
Python

awesomelistsio / awesome-ai-benchmarks-evaluation

Sponsor

Star

A curated list of evaluation tools, benchmark datasets, leaderboards, frameworks, and resources for assessing model performance.

awesome ai awesome-list awesome-lists ai-benchmarks ai-evaluation ai-benchmark

Updated Jan 16, 2026
Python

trustyai-explainability / llama-stack-provider-lmeval

Star

TrustyAI's LMEval provider for Llama Stack

evaluation provider ai-benchmark llama-stack

Updated Feb 2, 2026
Python

playsaurus-inc / play-bench

Star

PlayBench is a platform that evaluates AI models by having them compete in various games and creative tasks. Unlike traditional benchmarks that focus on text generation quality or factual knowledge, PlayBench tests models on skills like strategic thinking, pattern recognition, and creative problem-solving.

svg chess ai rock-paper-scissors ai-benchmark

Updated Feb 2, 2026
Blade

KyleDerZweite / word-salat

Star

🥗 Scramble text while keeping first and last letters intact (Cambridge effect). Includes CLI, scoring tools, and AI decoding benchmarks.

python nlp language cli decoding text-processing word-scramble text-scrambler ai-benchmark cambridge-effect

Updated Nov 28, 2025
Python

rwst / biored-openrouter

Star

Evaluation system for biomedical relation extraction using the BioRED dataset schema. The system will send document passages to OpenRouter-hosted LLMs, extract relations, compare results against ground truth annotations, and maintain a persistent CSV database of results.

relation-extraction biomedical-text-mining biomedical-ai ai-benchmark openrouter-api

Updated Dec 16, 2025
Python

sashsinha / nqmp-bench

Star

NQMP is a tiny, deterministic llm benchmark focused on logical sensitivity to small prompt flips.

nlp benchmark consistency logic evaluation reasoning llm ai-benchmark

Updated Sep 10, 2025
HTML

dulcekllr / agent-os

Star

🌐 Manage AI coding sessions seamlessly with AgentOS, a mobile-first web UI designed for efficient and user-friendly interaction.

cli user-agent analytics user-agent-parser survey osu operator troubleshooting ai-agent agentic ai-benchmark desktop-agent gui-agent os-agent os-agent-survey computer-using claude-code

Updated Feb 7, 2026
TypeScript

georgejeffers / Wordle-AI-Benchmark

Star

WordleBench — Deterministic AI Wordle benchmark. Compare 34+ LLMs (GPT-5, Claude 4.5, Gemini, Grok, Llama) head-to-head on accuracy, speed, and cost across 50 standardized words.

typescript nextjs gemini wordle language-models claude wordle-solver gpt-5 vercel-ai-sdk llm-leaderboard ai-benchmark llm-benchmark ai-comparison deterministic-testing

Updated Feb 6, 2026
TypeScript

Improve this page

Add a description, image, and links to the ai-benchmark topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the ai-benchmark topic, visit your repo's landing page and select "manage topics."

Learn more

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ai-benchmark

Here are 26 public repositories matching this topic...

microsoft / WindowsAgentArena

TheAgentCompany / TheAgentCompany

GoodStartLabs / AI_Diplomacy

rungalileo / agent-leaderboard

chaosync-org / awesome-ai-agent-testing

GustyCube / ERR-EVAL

Habitante / gta-benchmark

mlcommons / storage_results_v2.0

petmal / MindTrial

yasarshaikh / SF-bench

bogdanminko / laperf

MitchellShibilski-Unkel / PyPC

awesomelistsio / awesome-ai-benchmarks-evaluation

trustyai-explainability / llama-stack-provider-lmeval

playsaurus-inc / play-bench

KyleDerZweite / word-salat

rwst / biored-openrouter

sashsinha / nqmp-bench

dulcekllr / agent-os

georgejeffers / Wordle-AI-Benchmark

Improve this page

Add this topic to your repo