Windows Agent Arena (WAA) 🪟 is a scalable OS platform for testing and benchmarking of multi-modal AI agents.
-
Updated
Apr 30, 2025 - Python
Windows Agent Arena (WAA) 🪟 is a scalable OS platform for testing and benchmarking of multi-modal AI agents.
An agent benchmark with tasks in a simulated software company.
Frontier Models playing the board game Diplomacy.
Ranking LLMs on agentic tasks
🤖 A curated list of resources for testing AI agents - frameworks, methodologies, benchmarks, tools, and best practices for ensuring reliable, safe, and effective autonomous AI systems
Benchmark for evaluating AI epistemic reliability - testing how well LLMs handle uncertainty, avoid hallucinations, and acknowledge what they don't know.
GTA (Guess The Algorithm) Benchmark - A tool for testing AI reasoning capabilities
This repository contains the results and code for the MLPerf™ Storage v2.0 benchmark.
MindTrial: Evaluate and compare AI language models (LLMs) on text-based tasks with optional file/image attachments and tool use. Supports multiple providers (OpenAI, Google, Anthropic, DeepSeek, Mistral AI, xAI, Alibaba, Moonshot AI, OpenRouter), custom tasks in YAML, and HTML/CSV reports.
The first comprehensive benchmark for evaluating AI coding agents on Salesforce development tasks. Tests Apex, LWC, Flows, and more.
La Perf is a framework for AI performance benchmarking — covering LLMs, VLMs, embeddings, with power-metrics collection.
Python Performance Tester & More...
A curated list of evaluation tools, benchmark datasets, leaderboards, frameworks, and resources for assessing model performance.
TrustyAI's LMEval provider for Llama Stack
PlayBench is a platform that evaluates AI models by having them compete in various games and creative tasks. Unlike traditional benchmarks that focus on text generation quality or factual knowledge, PlayBench tests models on skills like strategic thinking, pattern recognition, and creative problem-solving.
🥗 Scramble text while keeping first and last letters intact (Cambridge effect). Includes CLI, scoring tools, and AI decoding benchmarks.
Evaluation system for biomedical relation extraction using the BioRED dataset schema. The system will send document passages to OpenRouter-hosted LLMs, extract relations, compare results against ground truth annotations, and maintain a persistent CSV database of results.
NQMP is a tiny, deterministic llm benchmark focused on logical sensitivity to small prompt flips.
🌐 Manage AI coding sessions seamlessly with AgentOS, a mobile-first web UI designed for efficient and user-friendly interaction.
WordleBench — Deterministic AI Wordle benchmark. Compare 34+ LLMs (GPT-5, Claude 4.5, Gemini, Grok, Llama) head-to-head on accuracy, speed, and cost across 50 standardized words.
Add a description, image, and links to the ai-benchmark topic page so that developers can more easily learn about it.
To associate your repository with the ai-benchmark topic, visit your repo's landing page and select "manage topics."