open-compass repositories

SWE-bench-server

Public

Python

•0•0•0•1•Updated

Mar 2, 2026

VLMEvalKit

Public

Open-source evaluation toolkit of large multi-modality models (LMMs), support 220+ LMMs, 80+ benchmarks

computer-vision evaluation pytorchgemini openai vqa vit gpt multi-modal clip

Python

•

Apache License 2.0

•640•3.9k•200•27•Updated

Mar 2, 2026

opencompass

Public

OpenCompass is an LLM evaluation platform, supporting a wide range of models (Llama3, Mistral, InternLM2,GPT-4,LLaMa2, Qwen,GLM, Claude, etc) over 100+ datasets…

benchmark evaluation openaillm chatgpt large-language-model llama2 llama3

Python

•

Apache License 2.0

•737•6.7k•367•66•Updated

Feb 27, 2026

GTA

Public

[NeurIPS 2024 D&B Track] GTA: A Benchmark for General Tool Agents

llm-agent llm-evaluation

Python

•

Apache License 2.0

•9•135•0•0•Updated

Feb 16, 2026

MiroFlow

Public

MiroMind Research Agent: Fully Open-Source Deep Research Agent with Reproducible State-of-the-Art Performance on FutureX, GAIA, HLE, BrowserComp and xBench.

Python

•

Apache License 2.0

•264•0•0•0•Updated

Dec 30, 2025

RePro

Public

[ICLR 2026] Rectifying LLM Thought From Lens of Optimization

reinforcement-learning large-language-model large-language-model-reasoning

Python

•

MIT License

•4•14•1•0•Updated

Dec 5, 2025

SAGA

Public

The code repository for the NeurIPS 2025 paper "Rethinking Verification for LLM Code Generation: From Generation to Testing."

0•10•1•0•Updated

Nov 27, 2025

ATLAS

Public

ATLAS: A High-Difficulty, Multidisciplinary Benchmark for Frontier Scientific Reasoning

1•6•0•0•Updated

Nov 20, 2025

OASIS

Public

Python

•0•3•0•0•Updated

Nov 12, 2025

InteractScience

Public

JavaScript

•

Apache License 2.0

•0•8•0•0•Updated

Oct 31, 2025

CognitiveKernel-Pro

Public

Deep Research Agent CognitiveKernel-Pro from Tencent AI Lab. Paper: https://arxiv.org/pdf/2508.00414

Python

•

Other

•48•0•0•0•Updated

Oct 27, 2025

GAOKAO-Eval

Public

Jupyter Notebook

•7•114•5•0•Updated

Oct 7, 2025

.github

Public

1•0•0•0•Updated

Sep 9, 2025

MMBench-GUI

Public

Official repo of "MMBench-GUI: Hierarchical Multi-Platform Evaluation Framework for GUI Agents". It can be used to evaluate a GUI agent with a hierarchical mann…

benchmark-framework vision-language-model computer-usegui-agent

Python

•6•100•5•0•Updated

Sep 8, 2025

ReasonZoo

Public

Python

•

Apache License 2.0

•0•3•0•0•Updated

Aug 27, 2025

CompassVerifier

Public

[EMNLP 2025] CompassVerifier: A Unified and Robust Verifier for LLMs Evaluation and Outcome Reward

Jupyter Notebook

•2•63•0•0•Updated

Aug 10, 2025

GPassK

Public

[ACL 2025] Are Your LLMs Capable of Stable Reasoning?

large-language-model-evaluation reasoning-stability

Python

•2•32•2•0•Updated

Aug 5, 2025

Creation-MMBench

Public

Assessing Context-Aware Creative Intelligence in MLLMs

JavaScript

•0•23•1•0•Updated

Jul 22, 2025

CompassJudger

Public

The All-in-one Judge Models introduced by Opencompass

Apache License 2.0

•6•116•1•0•Updated

Jul 15, 2025

RaML

Public

[Preprint 2025] Deciphering Trajectory-Aided LLM Reasoning: An Optimization Perspective

Jupyter Notebook

•2•7•0•0•Updated

May 27, 2025

BotChat

Public

Evaluating LLMs' multi-round chatting capability via assessing conversations generated by two LLM instances.

Jupyter Notebook

•

Apache License 2.0

•7•161•2•0•Updated

May 22, 2025

Ada-LEval

Public

The official implementation of "Ada-LEval: Evaluating long-context LLMs with length-adaptable benchmarks"

gpt4 llm long-context

Python

•3•56•0•0•Updated

May 22, 2025

MathBench

Public

[ACL 2024 Findings] MathBench: A Comprehensive Multi-Level Difficulty Mathematics Evaluation Dataset

Apache License 2.0

•1•111•5•0•Updated

May 22, 2025

MMBench

Public

Official Repo of "MMBench: Is Your Multi-modal Model an All-around Player?"

Apache License 2.0

•15•288•13•0•Updated

May 22, 2025

ProSA

Public

[EMNLP 2024 Findings] ProSA: Assessing and Understanding the Prompt Sensitivity of LLMs

Python

•

Apache License 2.0

•2•29•0•0•Updated

May 22, 2025

ANAH

Public

[ACL 2024] ANAH & [NeurIPS 2024] ANAH-v2 & [ICLR 2025] Mask-DPO

acl alignment gpticlr neurips llms hallucination-detection hallucination-mitigation

Python

•

Apache License 2.0

•4•63•1•0•Updated

Apr 30, 2025

oc_doc_website

Public

0•0•0•0•Updated

Feb 12, 2025

CriticEval

Public

[NeurIPS 2024] A comprehensive benchmark for evaluating critique ability of LLMs

Python

•

Apache License 2.0

•2•49•2•0•Updated

Nov 29, 2024

lagent-cibench

Public

Python

•

Apache License 2.0

•1•2•0•0•Updated

Sep 23, 2024

hinode

Public

A clean documentation and blog theme for your Hugo site based on Bootstrap 5

HTML

•

MIT License

•62•0•0•0•Updated

Sep 1, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OpenCompass

All

All

43 repositories

SWE-bench-server

VLMEvalKit

opencompass

GTA

MiroFlow

RePro

SAGA

ATLAS

OASIS

InteractScience

CognitiveKernel-Pro

GAOKAO-Eval

.github

MMBench-GUI

ReasonZoo

CompassVerifier

GPassK

Creation-MMBench

CompassJudger

RaML

BotChat

Ada-LEval

MathBench

MMBench

ProSA

ANAH

oc_doc_website

CriticEval

lagent-cibench

hinode

All

All

Repositories list

43 repositories