Skip to content
Change the repository type filter

All

    Repositories list

    • SWE-bench-server

      Public
      Python
      0001Updated Feb 28, 2026Feb 28, 2026
    • VLMEvalKit

      Public
      Open-source evaluation toolkit of large multi-modality models (LMMs), support 220+ LMMs, 80+ benchmarks
      Python
      Apache License 2.0
      6403.9k20028Updated Feb 27, 2026Feb 27, 2026
    • OpenCompass is an LLM evaluation platform, supporting a wide range of models (Llama3, Mistral, InternLM2,GPT-4,LLaMa2, Qwen,GLM, Claude, etc) over 100+ datasets…
      Python
      Apache License 2.0
      7376.7k36766Updated Feb 27, 2026Feb 27, 2026
    • GTA

      Public
      [NeurIPS 2024 D&B Track] GTA: A Benchmark for General Tool Agents
      Python
      Apache License 2.0
      913500Updated Feb 16, 2026Feb 16, 2026
    • MiroFlow

      Public
      MiroMind Research Agent: Fully Open-Source Deep Research Agent with Reproducible State-of-the-Art Performance on FutureX, GAIA, HLE, BrowserComp and xBench.
      Python
      Apache License 2.0
      261000Updated Dec 30, 2025Dec 30, 2025
    • RePro

      Public
      [ICLR 2026] Rectifying LLM Thought From Lens of Optimization
      Python
      MIT License
      41410Updated Dec 5, 2025Dec 5, 2025
    • SAGA

      Public
      The code repository for the NeurIPS 2025 paper "Rethinking Verification for LLM Code Generation: From Generation to Testing."
      01010Updated Nov 27, 2025Nov 27, 2025
    • ATLAS

      Public
      ATLAS: A High-Difficulty, Multidisciplinary Benchmark for Frontier Scientific Reasoning
      1600Updated Nov 20, 2025Nov 20, 2025
    • OASIS

      Public
      Python
      0300Updated Nov 12, 2025Nov 12, 2025
    • JavaScript
      Apache License 2.0
      0800Updated Oct 31, 2025Oct 31, 2025
    • Deep Research Agent CognitiveKernel-Pro from Tencent AI Lab. Paper: https://arxiv.org/pdf/2508.00414
      Python
      Other
      48000Updated Oct 27, 2025Oct 27, 2025
    • Jupyter Notebook
      711450Updated Oct 7, 2025Oct 7, 2025
    • .github

      Public
      1000Updated Sep 9, 2025Sep 9, 2025
    • Official repo of "MMBench-GUI: Hierarchical Multi-Platform Evaluation Framework for GUI Agents". It can be used to evaluate a GUI agent with a hierarchical mann…
      Python
      610050Updated Sep 8, 2025Sep 8, 2025
    • ReasonZoo

      Public
      Python
      Apache License 2.0
      0300Updated Aug 27, 2025Aug 27, 2025
    • [EMNLP 2025] CompassVerifier: A Unified and Robust Verifier for LLMs Evaluation and Outcome Reward
      Jupyter Notebook
      26300Updated Aug 10, 2025Aug 10, 2025
    • GPassK

      Public
      [ACL 2025] Are Your LLMs Capable of Stable Reasoning?
      Python
      23220Updated Aug 5, 2025Aug 5, 2025
    • Assessing Context-Aware Creative Intelligence in MLLMs
      JavaScript
      02310Updated Jul 22, 2025Jul 22, 2025
    • The All-in-one Judge Models introduced by Opencompass
      Apache License 2.0
      611610Updated Jul 15, 2025Jul 15, 2025
    • RaML

      Public
      [Preprint 2025] Deciphering Trajectory-Aided LLM Reasoning: An Optimization Perspective
      Jupyter Notebook
      2700Updated May 27, 2025May 27, 2025
    • BotChat

      Public
      Evaluating LLMs' multi-round chatting capability via assessing conversations generated by two LLM instances.
      Jupyter Notebook
      Apache License 2.0
      716120Updated May 22, 2025May 22, 2025
    • Ada-LEval

      Public
      The official implementation of "Ada-LEval: Evaluating long-context LLMs with length-adaptable benchmarks"
      Python
      35600Updated May 22, 2025May 22, 2025
    • MathBench

      Public
      [ACL 2024 Findings] MathBench: A Comprehensive Multi-Level Difficulty Mathematics Evaluation Dataset
      Apache License 2.0
      111150Updated May 22, 2025May 22, 2025
    • MMBench

      Public
      Official Repo of "MMBench: Is Your Multi-modal Model an All-around Player?"
      Apache License 2.0
      15287130Updated May 22, 2025May 22, 2025
    • ProSA

      Public
      [EMNLP 2024 Findings] ProSA: Assessing and Understanding the Prompt Sensitivity of LLMs
      Python
      Apache License 2.0
      22900Updated May 22, 2025May 22, 2025
    • ANAH

      Public
      [ACL 2024] ANAH & [NeurIPS 2024] ANAH-v2 & [ICLR 2025] Mask-DPO
      Python
      Apache License 2.0
      46210Updated Apr 30, 2025Apr 30, 2025
    • 0000Updated Feb 12, 2025Feb 12, 2025
    • [NeurIPS 2024] A comprehensive benchmark for evaluating critique ability of LLMs
      Python
      Apache License 2.0
      24920Updated Nov 29, 2024Nov 29, 2024
    • Python
      Apache License 2.0
      1200Updated Sep 23, 2024Sep 23, 2024
    • hinode

      Public
      A clean documentation and blog theme for your Hugo site based on Bootstrap 5
      HTML
      MIT License
      62000Updated Sep 1, 2024Sep 1, 2024