Skip to content
Change the repository type filter

All

    Repositories list

    • Vlaser

      Public
      Vlaser: Vision-Language-Action Model with Synergistic Embodied Reasoning
      Python
      04100Updated Feb 16, 2026Feb 16, 2026
    • UMMEvalKit

      Public
      A unified, efficient, and extensible evaluation toolkit for unified multimodal models
      Jupyter Notebook
      1500Updated Feb 12, 2026Feb 12, 2026
    • VKnowU

      Public
      Python
      11100Updated Feb 3, 2026Feb 3, 2026
    • GenExam

      Public
      GenExam: A Multidisciplinary Text-to-Image Exam
      Python
      45600Updated Jan 29, 2026Jan 29, 2026
    • MetaCaptioner

      Public
      Python
      34410Updated Jan 27, 2026Jan 27, 2026
    • ScaleCUA

      Public
      ScaleCUA is the open-sourced computer use agents that can operate on cross-platform environments (Windows, macOS, Ubuntu, Android).
      Python
      741.1k140Updated Jan 7, 2026Jan 7, 2026
    • [ICCV 2025] GUIOdyssey is a comprehensive dataset for training and evaluating cross-app navigation agents. GUIOdyssey consists of 8,834 episodes from 6 mobile d…
      Python
      8147100Updated Jan 3, 2026Jan 3, 2026
    • SDLM

      Public
      Sequential Diffusion Language Model (SDLM) enhances pre-trained autoregressive language models by adaptively determining generation length and maintaining KV-ca…
      Python
      39000Updated Dec 27, 2025Dec 27, 2025
    • InternVideo

      Public
      [ECCV2024] Video Foundation Models & Data for Multimodal Understanding
      Python
      1392.2k1344Updated Dec 15, 2025Dec 15, 2025
    • SID-VLN

      Public
      Official implementation of: Learning Goal-Oriented Language-Guided Navigation with Self-Improving Demonstrations at Scale
      Python
      21100Updated Nov 29, 2025Nov 29, 2025
    • vinci

      Public
      Vinci: A Real-time Embodied Smart Assistant based on Egocentric Vision-Language Model
      Python
      28120Updated Nov 27, 2025Nov 27, 2025
    • OmniQuant

      Public
      [ICLR2024 spotlight] OmniQuant is a simple and powerful quantization technique for LLMs.
      Python
      76887291Updated Nov 26, 2025Nov 26, 2025
    • [ACL 2025 Main] EfficientQAT: Efficient Quantization-Aware Training for Large Language Models
      Python
      27327130Updated Nov 26, 2025Nov 26, 2025
    • VideoChat-Flash

      Public
      [ICLR2026] VideoChat-Flash: Hierarchical Compression for Long-Context Video Modeling
      Python
      16507110Updated Nov 18, 2025Nov 18, 2025
    • ExpVid

      Public
      0800Updated Oct 28, 2025Oct 28, 2025
    • VideoChat-R1

      Public
      [NIPS2025] VideoChat-R1 & R1.5: Enhancing Spatio-Temporal Perception and Reasoning via Reinforcement Fine-Tuning
      Python
      10257240Updated Oct 18, 2025Oct 18, 2025
    • NaViL

      Public
      Python
      78900Updated Oct 10, 2025Oct 10, 2025
    • PonderV2

      Public
      [T-PAMI 2025] PonderV2: Pave the Way for 3D Foundation Model with A Universal Pre-training Paradigm
      Python
      837000Updated Sep 30, 2025Sep 30, 2025
    • InternVL

      Public
      [CVPR 2024 Oral] InternVL Family: A Pioneering Open-Source Alternative to GPT-4o. 接近GPT-4o表现的开源多模态对话模型
      Python
      7579.8k29711Updated Sep 22, 2025Sep 22, 2025
    • EgoExoLearn

      Public
      [CVPR 2024] Data and benchmark code for the EgoExoLearn dataset
      Python
      27930Updated Aug 26, 2025Aug 26, 2025
    • VRBench

      Public
      [ICCV 2025] A Benchmark for Multi-Step Reasoning in Long Narrative Videos
      Python
      02410Updated Aug 8, 2025Aug 8, 2025
    • PIIP

      Public
      [NeurIPS 2024 Spotlight ⭐️ & TPAMI 2025] Parameter-Inverted Image Pyramid Networks (PIIP)
      Python
      510820Updated Aug 5, 2025Aug 5, 2025
    • LORIS

      Public
      [ICML2023] Long-Term Rhythmic Video Soundtracker
      Python
      16210Updated Jul 28, 2025Jul 28, 2025
    • TPO

      Public
      Task Preference Optimization: Improving Multimodal Large Language Models with Vision Task Alignment
      Jupyter Notebook
      66410Updated Jul 22, 2025Jul 22, 2025
    • Docopilot

      Public
      [CVPR 2025] Docopilot: Improving Multimodal Models for Document-Level Understanding
      Python
      13620Updated Jul 22, 2025Jul 22, 2025
    • Mono-InternVL

      Public
      [CVPR 2025] Mono-InternVL: Pushing the Boundaries of Monolithic Multimodal Large Language Models with Endogenous Visual Pre-training
      Python
      010370Updated Jul 18, 2025Jul 18, 2025
    • ZeroGUI

      Public
      ZeroGUI: Automating Online GUI Learning at Zero Human Cost
      Python
      810900Updated Jul 17, 2025Jul 17, 2025
    • MUTR

      Public
      「AAAI 2024」 Referred by Multi-Modality: A Unified Temporal Transformers for Video Object Segmentation
      Python
      78230Updated Jun 13, 2025Jun 13, 2025
    • PVC

      Public
      [CVPR 2025] PVC: Progressive Visual Token Compression for Unified Image and Video Processing in Large Vision-Language Models
      Python
      25140Updated Jun 12, 2025Jun 12, 2025
    • FluxViT

      Public
      Make Your Training Flexible: Towards Deployment-Efficient Video Models
      Python
      03710Updated Jun 11, 2025Jun 11, 2025