ai-benchmarks

Star

Here are 14 public repositories matching this topic...

scicode-bench / SciCode

Star

A benchmark that challenges language models to code solutions for scientific problems

benchmark ai ai-benchmarks llm

Updated Jan 26, 2026
Python

CAS-CLab / CNN-Inference-Engine-Quick-View

Star

A quick view of high-performance convolution neural networks (CNNs) inference engines on mobile devices.

cnn inference-engine cnns inference-engines cnn-inference-engine ai-benchmarks speed-benchmarks

Updated Jun 13, 2022

joylarkin / AI-Coding-Landscape

Star

AI coding models, agents, CLIs, IDEs, AI app builders, open source tooling, benchmarks

ai-agents ai-benchmarks ai-apps ai-ide ai-assisted-coding ai-coding-tools ai-coding ai-app-builder vibe-coding vibecoding ai-coding-assistant coding-llm ai-coding-agents ai-coding-landscape ai-coding-2025 ai-coding-models coding-models ai-leaderboards ai-coding-2026

Updated Feb 7, 2026

Benchmark evaluating LLMs on their ability to create and resist disinformation. Includes comprehensive testing across major models (Claude, GPT-4, Gemini, Llama, etc.) with standardized evaluation metrics.

nlp machine-learning gemini llama language-model model-evaluation ai-safety mistral claude disinformation ai-security ai-benchmarks ai-evaluation llm llm-benchmarking gpt4o

Updated Mar 20, 2025

SS47816 / AGI-Elo

Star

[NeurIPS 2025] AGI-Elo: How Far Are We From Mastering A Task?

benchmark leaderboard agi imagenet coco artificial-general-intelligence datasets evaluation-metrics elo-rating rating-system evaluation-framework sota ai-benchmarks waymo-open-dataset mmlu vision-language-action ai-evaluation-framework livecodebench navsim

Updated Oct 28, 2025
Python

AI-Stats / AI-Stats

Star

A fully open-source database of AI models with benchmark scores, prices, and capabilities.

ai artifical-intelligence ai-benchmarks ai-models ai-evaluation llm

Updated Feb 5, 2026
TypeScript

pyros-projects / agent-comparison

Star

Qualitative benchmark suite for evaluating AI coding agents and orchestration paradigms on realistic, complex development tasks

orchestration ai-agents ai-benchmarks qualitative-evaluation llm-agents coding-agents agentic-workflows agent-evaluation agent-testing ai-coding-assistants agent-comparison development-tasks

Updated Nov 25, 2025
Python

awesomelistsio / awesome-ai-benchmarks-evaluation

Sponsor

Star

A curated list of evaluation tools, benchmark datasets, leaderboards, frameworks, and resources for assessing model performance.

awesome ai awesome-list awesome-lists ai-benchmarks ai-evaluation ai-benchmark

Updated Jan 16, 2026
Python

BLGardner / aiq-x

Star

Privacy-first AI model testing - no API keys required

benchmark gemini model-evaluation claude ai-benchmarks ai-tools llm chatgpt ai-testing llm-evaluation llm-testing

Updated Feb 5, 2026
HTML

Paraskevi-KIvroglou / Hackathon-LlamaEval

Star

LlamaEval is a rapid prototype developed during a hackathon to provide a user-friendly dashboard for evaluating and comparing Llama models using the TogetherAI API.

evaluation-metrics streamlit ai-benchmarks llms togetherai llms-benchmarking llama3

Updated Nov 10, 2024
Python

scriptstar / vector-db-benchmark

Star

A production-grade benchmarking suite that evaluates vector databases (Qdrant, Milvus, Weaviate, ChromaDB, Pinecone, SQLite, TopK) for music semantic search applications. Features automated performance testing, statistical analysis across 15-20 iterations, real-time web UI for database comparison, and comprehensive reporting with production.

Updated Aug 27, 2025
Python

mrowan137 / ml-performance-benchmark

Star

Performance benchmarking of ML/AI workloads ResNet, CosmoFlow, & DeepCam

benchmark deep-learning resnet ai-benchmarks cosmoflow deepcam

Updated Dec 16, 2024
CWeb

windopper / takeoff

Star

AI 정보/아티클 게시 서비스

ai article ai-service ai-benchmarks ai-timeline

Updated Dec 12, 2025
TypeScript

Vibhanshu-555 / Human-Aligned-LLM-Evaluation-Audit

Star

A data-driven audit of AI judge reliability using MT-Bench human annotations. This project analyzes 3,500+ model comparisons across 6 LLMs and 8 task categories to measure how well GPT-4 evaluations align with human judgment. Includes Python workflow, disagreement metrics, and a Power BI dashboard for insights.

python analytics pandas data-analysis powerbi evaluation-metrics model-comparison ai-benchmarks large-language-models llm-evaluation

Updated Nov 24, 2025
HTML

Improve this page

Add a description, image, and links to the ai-benchmarks topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the ai-benchmarks topic, visit your repo's landing page and select "manage topics."

Learn more

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ai-benchmarks

Here are 14 public repositories matching this topic...

scicode-bench / SciCode

CAS-CLab / CNN-Inference-Engine-Quick-View

joylarkin / AI-Coding-Landscape

lechmazur / deception

SS47816 / AGI-Elo

AI-Stats / AI-Stats

pyros-projects / agent-comparison

awesomelistsio / awesome-ai-benchmarks-evaluation

BLGardner / aiq-x

Paraskevi-KIvroglou / Hackathon-LlamaEval

scriptstar / vector-db-benchmark

mrowan137 / ml-performance-benchmark

windopper / takeoff

Vibhanshu-555 / Human-Aligned-LLM-Evaluation-Audit

Improve this page

Add this topic to your repo