A benchmark that challenges language models to code solutions for scientific problems
-
Updated
Jan 26, 2026 - Python
A benchmark that challenges language models to code solutions for scientific problems
A quick view of high-performance convolution neural networks (CNNs) inference engines on mobile devices.
AI coding models, agents, CLIs, IDEs, AI app builders, open source tooling, benchmarks
Benchmark evaluating LLMs on their ability to create and resist disinformation. Includes comprehensive testing across major models (Claude, GPT-4, Gemini, Llama, etc.) with standardized evaluation metrics.
[NeurIPS 2025] AGI-Elo: How Far Are We From Mastering A Task?
A fully open-source database of AI models with benchmark scores, prices, and capabilities.
Qualitative benchmark suite for evaluating AI coding agents and orchestration paradigms on realistic, complex development tasks
A curated list of evaluation tools, benchmark datasets, leaderboards, frameworks, and resources for assessing model performance.
Privacy-first AI model testing - no API keys required
LlamaEval is a rapid prototype developed during a hackathon to provide a user-friendly dashboard for evaluating and comparing Llama models using the TogetherAI API.
A production-grade benchmarking suite that evaluates vector databases (Qdrant, Milvus, Weaviate, ChromaDB, Pinecone, SQLite, TopK) for music semantic search applications. Features automated performance testing, statistical analysis across 15-20 iterations, real-time web UI for database comparison, and comprehensive reporting with production.
Performance benchmarking of ML/AI workloads ResNet, CosmoFlow, & DeepCam
A data-driven audit of AI judge reliability using MT-Bench human annotations. This project analyzes 3,500+ model comparisons across 6 LLMs and 8 task categories to measure how well GPT-4 evaluations align with human judgment. Includes Python workflow, disagreement metrics, and a Power BI dashboard for insights.
Add a description, image, and links to the ai-benchmarks topic page so that developers can more easily learn about it.
To associate your repository with the ai-benchmarks topic, visit your repo's landing page and select "manage topics."