You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
- This repository contains the implementation for our ICCV 2025 submission on evaluating and training multi-modal large language models for action recognition.
5
-
- Our code is built on [LLaVA-NeXT](https://github.com/LLaVA-VL/LLaVA-NeXT), and files in the directory `llavaction/action` are related to this work. We thank the authors of LLaVA-NeXT for making their code publicly available.
Understanding human behavior requires measuring behavioral actions. Due to its complexity, behavior is best mapped onto a rich, semantic structure such as language. The recent development of multi-modal large language models (MLLMs) is a promising candidate for a wide range of action understanding tasks. In this work, we focus on evaluating and then improving MLLMs to perform action recognition. We reformulate EPIC-KITCHENS-100, one of the largest and most challenging egocentric action datasets, to the form of video multiple question answering (EPIC-KITCHENS-100-MQA). We show that when we sample difficult incorrect answers as distractors, leading MLLMs struggle to recognize the correct actions. We propose a series of methods that greatly improve the MLLMs' ability to perform action recognition, achieving state-of-the-art on both the EPIC-KITCHENS-100 Challenge, as well as outperforming GPT-4o by 21 points in accuracy on EPIC-KITCHENS-100-MQA. Lastly, we show improvements on other action-related video benchmarks such as VideoMME, PerceptionTest and MVBench.
15
+
16
+
## Code
17
+
18
+
- This repository contains the implementation for our preprint on evaluating and training multi-modal large language models for action recognition.
19
+
- Our code is built on [LLaVA-NeXT](https://github.com/LLaVA-VL/LLaVA-NeXT), and files in the directory `llavaction/action` are related to our work. We thank the authors of LLaVA-NeXT for making their code publicly available.
6
20
- The files in the `/eval`, `/model`, `/serve` and `/train` are directly from [LLaVA-NeXT](https://github.com/LLaVA-VL/LLaVA-NeXT), unless modified and noted below.
7
21
- Modified files are:
8
22
-- /model/llava_arch.py
@@ -11,12 +25,12 @@
11
25
-- /train/llava_trainer.py
12
26
-- /utils.py
13
27
-- A diff can be generated against the commit (79ef45a6d8b89b92d7a8525f077c3a3a9894a87d) of LLaVA-NeXT to see our modifications.
14
-
- The code will be made publicly available when published. For review, the provided code and model license is [no license](https://choosealicense.com/no-permission/).
15
-
16
28
17
29
## Demo
18
30
- Currently, we provide code to run video inference in a Jupyter Notebook (which can be run on Google Colaboratory).
19
-
**Installation guide for video inference:**
31
+
32
+
33
+
### Installation guide for video inference:
20
34
```bash
21
35
conda create -n llavaction python=3.10 -y
22
36
conda activate llavaction
@@ -25,3 +39,7 @@ pip install -e .
25
39
```
26
40
27
41
- Please see the `/example` directory for a demo notebook.
42
+
43
+
## EPIC-KITCHENS-100-MQA
44
+
45
+
In our work, we introduce a new way to evaluate MLMMs for action recognition by casting EPIC-KITCHENS-100 into a multi-question-answer benchmark. This has not yet been released [as of 3/2025], but please check the issues or open an issue if you are interested in accessing this resource before the paper is published. We also plan to integrate this the package [lmms-eval](https://github.com/EvolvingLMMs-Lab/lmms-eval).
0 commit comments