Skip to content

Commit 2a7f063

Browse files
author
heyinan
committed
[Add] ViCLIP
1 parent 7db6c4d commit 2a7f063

File tree

475 files changed

+79627
-0
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

475 files changed

+79627
-0
lines changed

Pretrain/ViCLIP/LICENSE

Lines changed: 21 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,21 @@
1+
MIT License
2+
3+
Copyright (c) 2022 Jie Lei
4+
5+
Permission is hereby granted, free of charge, to any person obtaining a copy
6+
of this software and associated documentation files (the "Software"), to deal
7+
in the Software without restriction, including without limitation the rights
8+
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9+
copies of the Software, and to permit persons to whom the Software is
10+
furnished to do so, subject to the following conditions:
11+
12+
The above copyright notice and this permission notice shall be included in all
13+
copies or substantial portions of the Software.
14+
15+
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16+
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17+
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18+
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19+
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20+
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21+
SOFTWARE.

Pretrain/ViCLIP/README.md

Lines changed: 148 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,148 @@
1+
# [ViCLIP](https://arxiv.org/pdf/2307.06942.pdf): a video-text representation learning model trained on [InternVid](https://arxiv.org/pdf/2307.06942.pdf)
2+
3+
[![Dataset meta](https://img.shields.io/badge/%F0%9F%A4%97%20InternVid-Dataset-blue)](https://huggingface.co/datasets/OpenGVLab/InternVid) | [![Model Checkpoint](https://img.shields.io/badge/%F0%9F%A4%97%20ViCLIP-Model-purple)](https://huggingface.co/OpenGVLab/ViCLIP)
4+
5+
6+
# :fire: News
7+
- Training code is released.
8+
9+
- InternVid has been accepted for spotlight presentation of ICLR 2024.
10+
11+
- We release a subset [InternVid-Aesthetics-18M](https://huggingface.co/datasets/OpenGVLab/InternVid/viewer/InternVid-10M/AES). It consists of 18 million video clips that have been assigned high aesthetic scores. For more details on the aesthetic scoring, please refer to [laion aesthetic predictor](https://github.com/LAION-AI/aesthetic-predictor).
12+
13+
- We enhance InternVid-10M-FLT dataset annotations by incorporating video language and type information sourced from YouTube's metainfo. You can find the updated annotations at [this link](https://huggingface.co/datasets/OpenGVLab/InternVid-10M-FLT-INFO).
14+
15+
- We release ViCLIP models trained on different subsets of InternVid. Check their performance [here](#model-performance) and download them [here](#pretrained-data--model).
16+
17+
- We are excited to announce the partial release of a large-scale video-text dataset aimed at facilitating multimodal understanding and generation. As part of this release, we are making available a subset [InternVid-10M-FLT](https://huggingface.co/datasets/OpenGVLab/InternVid) of the dataset, which comprises 10 million video clips. Additionally, we have provided a [ViCLIP](https://huggingface.co/OpenGVLab/ViCLIP) model trained on this subset, using the ViT-L architecture. It achieves SOTA zero-shot action recognition performance on Kinetics.
18+
19+
- We give a step-by-step instructions and clarify the process of accessing and utilizing ViClip in [demo.ipynb](https://github.com/OpenGVLab/InternVideo/blob/main/Data/InternVid/demo.ipynb).
20+
21+
- Some model weights and the corresponding data are released at [Pretrained Data & Model](#pretrained-data--model). Their performance is given at [Model Performance](#model-performance).
22+
23+
Stay tuned for updates!
24+
25+
# Introduction
26+
27+
### ViCLIP: a simple video CLIP for transferrable video-text representation
28+
29+
Built upon <a href="https://github.com/openai/CLIP">CLIP</a>, we make a simple video-text pretraining baseline ViCLIP. It consists of a video encoder (ViT) and a text encoder, as given below. Both modules are initialized from the corresponding CLIP components. We update the native attention in the video encoder to spatiotemporal attention while maintaining other design elements. For efficient learning, we apply masking to videos in pre-training.
30+
31+
<img width="633" alt="87c6263cc4aceee72cc8e37085a8109" src="https://github.com/OpenGVLab/InternVideo/assets/43169235/1e540a2b-f503-4036-b2a8-ba99401fc5b0">
32+
33+
### Model Performance
34+
35+
**Table 1: Zero-shot action recognition results on Kinetics 400/600/700. We report the top-1 accuracy of the compared methods on each dataset.**
36+
|Method | Training Data | K400 | | K600 | | K700 | |
37+
|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|
38+
| | |top-1 | AVG | top-1 | AVG | top-1 | AVG |
39+
CLIP | CLIP400M | 58.42 | 70.14 | 55.11| 67.16| 46.12| 58.38
40+
CLIP | DataComp-1B |56.14| 67.67| 54.15| 65.83| 45.36| 57.01
41+
EVA-CLIP-L | Merged-2B | - | 65.00| - |64.90| - |59.10
42+
EVA-CLIP-E | LAION-2B |- |69.80|-| 69.30| -| 63.40
43+
ViCLIP-B | +InternVid-10M-FLT | 58.52 | 71.11 | 55.37 | 68.27 | 47.09 | 59.98
44+
ViCLIP-B | +InternVid-200M | 56.58 | 69.20 | 53.57 | 66.20 | 45.82 | 58.28
45+
ViCLIP-L| +WebVid10M |59.88| 71.03| 58.66| 69.84| 50.23| 61.86
46+
ViCLIP-L| +InternVid-10M-DIV| 63.00| 74.15| 60.68| 72.07| 52.50| 64.59
47+
ViCLIP-L| +InternVid-10M-FLT| **64.80** | **75.70** | **62.20** | **73.53** | **54.30** | **66.38**
48+
ViCLIP-L | +InternVid-200M | 59.80 | 71.09 | 57.80 | 69.34 | 49.30 | 61.25
49+
50+
**Table 2: Fine-tuned action recognition results on Kinetics 400 and SomethingSomethingV2.**
51+
|Method | Training Data | K400 | | SthSthV2 | |
52+
|:---:|:---:|:---:|:---:|:---:|:---:|
53+
| | |top-1 | top-5 | top-1 | top-5|
54+
CLIP | CLIP400M | 86.7 | 97.2 | 70.1 | 92.5
55+
CLIP | DataComp-1B |85.6| 96.8| 68.9| 91.8
56+
ViCLIP-L| +WebVid10M |85.0| 96.8| 68.7| 91.9
57+
ViCLIP-L| +InternVid-10M-FLT| 86.8 |97.5| 71.2| 93.2
58+
ViCLIP-L| +InternVid-10M-FLT+K710| 88.0| 97.8| 71.8| 93.6
59+
ViCLIP-L | +InternVid-200M | 87.9 |97.9| 73.6| 94.9
60+
ViCLIP-L | +InternVid-200M+K710 | **88.7** | **98.2** | **74.2** | **95.0**
61+
62+
# Installation
63+
64+
### Requirements
65+
66+
```
67+
# create
68+
conda env create -f viclip.yaml
69+
# activate
70+
conda activate viclip
71+
```
72+
73+
### Note
74+
75+
To run pretraining, you have to prepare the weights of the CLIP visual encoder as in the [`extract.ipynb`](preprocess/extract_hfclip.ipynb), and set the `MODEL_PATH` in [`clip_vision.py`](models/backbones/clip/clip_vision.py) and [`clip_text.py`](models/backbones/clip/clip_text.py).
76+
77+
78+
# Pre-Training
79+
80+
We use [CLIP](https://github.com/openai/CLIP) and [OpenCLIP](https://github.com/mlfoundations/open_clip) pretrained models as the unmasked teachers by default:
81+
- Follow [extract.ipynb](preprocess/extract_hfclip.ipynb) to extract visual encoder from CLIP.
82+
- Change `MODEL_PATH` in [`clip_vision.py`](models/backbones/clip/clip_vision.py) and [`clip_text.py`](models/backbones/clip/clip_text.py)..
83+
84+
For training, you can simply run the pretraining scripts in `exp/pretraining` as follows:
85+
```shell
86+
bash exp/exp_pretrain_ViCLIP/viclip_base/run.sh
87+
```
88+
89+
:warning: **Notes:**
90+
1. Set `data_dir` and `your_data_path` like `your_webvid_path` in [data.py](./configs/data.py) before running the scripts.
91+
2. Set `vision_encoder.pretrained` in `vision_encoder.pretrained` in the corresponding config files.
92+
3. Set `--rdzv_endpoint` to your `MASTER_NODE:MASTER_PORT`. You can also use the following commond to automatically set it:
93+
```shell
94+
MASTER_NODE=$(scontrol show hostnames "$SLURM_JOB_NODELIST" | head -n 1)
95+
ALL_NODES=$(scontrol show hostnames "$SLURM_JOB_NODELIST")
96+
MASTER_PORT=$((10000 + $RANDOM % 100))
97+
torchrun --rdzv_endpoint=${MASTER_NODE}:10068 $@
98+
```
99+
4. `save_latest=True` will automatically save the latest checkpoint while training.
100+
5. `auto_resume=True` will automatically loaded the best or latest checkpoint while training.
101+
102+
103+
# Data & Model Zoo
104+
105+
### Pretrained Data & Model
106+
<div>
107+
108+
| Model | Training Data | Descriptions |
109+
| :-----------------: | :----------------------: | :---------------------------------------------------------------------------------------------------: |
110+
| ViCLIP-L-14 \[[HuggingFace](https://huggingface.co/OpenGVLab/ViCLIP) \| [Aliyun](https://pjlab-gvm-data.oss-cn-shanghai.aliyuncs.com/internvideo/viclip/ViClip-InternVid-10M-FLT.pth)\] | InternVid-10M-FLT \[[HuggingFace](https://huggingface.co/datasets/OpenGVLab/InternVid) \| [OpenDataLab](https://opendatalab.com/shepshep/InternVid)\] | - |
111+
| ViCLIP-L-14 \[[Aliyun](https://pjlab-gvm-data.oss-cn-shanghai.aliyuncs.com/internvideo/viclip/ViCLIP-L_InternVid-DIV-10M.pth)\] | InternVid-10M-DIV | - |
112+
| ViCLIP-L-14 \[[Aliyun](https://pjlab-gvm-data.oss-cn-shanghai.aliyuncs.com/internvideo/viclip/ViCLIP-L_WebVid-10M.pth)\] | WebVid-10M | - |
113+
| ViCLIP-L-14 \[[Aliyun](https://pjlab-gvm-data.oss-cn-shanghai.aliyuncs.com/internvideo/viclip/ViCLIP-L_InternVid-10M.pth)\] | InternVid-10M | - |
114+
| ViCLIP-L-14 \[[Aliyun](https://pjlab-gvm-data.oss-cn-shanghai.aliyuncs.com/internvideo/viclip/ViCLIP-L_InternVid-50M.pth)\] | InternVid-50M | - |
115+
| ViCLIP-L-14 \[[Aliyun](https://pjlab-gvm-data.oss-cn-shanghai.aliyuncs.com/internvideo/viclip/ViCLIP-L_InternVid-200M.pth)\] | InternVid-200M | - |
116+
| ViCLIP-B-16 \[[OneDrive](https://pjlab-my.sharepoint.cn/:u:/g/personal/wangyi_pjlab_org_cn/EY6ac22ZVzJLm1-wm_9gPaMBm5MFg36GKTxlkwTemgmKzQ?e=mH6u6A)\] | InternVid-10M-FLT | - |
117+
| ViCLIP-B-16 \[[OneDrive](https://pjlab-my.sharepoint.cn/:u:/g/personal/wangyi_pjlab_org_cn/EVGBg6kq4M1MjbeSdqiXsaMBaBduhR7CQCT11JR4edmZ8Q?e=ILtTfM)\] | InternVid-200M | - |
118+
</div>
119+
120+
121+
122+
## Citation
123+
124+
If you find this work useful for your research, please consider citing InternVid. Your acknowledgement would greatly help us in continuing to contribute resources to the research community.
125+
126+
```
127+
@article{wang2023internvid,
128+
title={InternVid: A Large-scale Video-Text Dataset for Multimodal Understanding and Generation},
129+
author={Wang, Yi and He, Yinan and Li, Yizhuo and Li, Kunchang and Yu, Jiashuo and Ma, Xin and Chen, Xinyuan and Wang, Yaohui and Luo, Ping and Liu, Ziwei and Wang, Yali and Wang, Limin and Qiao, Yu},
130+
journal={arXiv preprint arXiv:2307.06942},
131+
year={2023}
132+
}
133+
134+
@article{wang2022internvideo,
135+
title={InternVideo: General Video Foundation Models via Generative and Discriminative Learning},
136+
author={Wang, Yi and Li, Kunchang and Li, Yizhuo and He, Yinan and Huang, Bingkun and Zhao, Zhiyu and Zhang, Hongjie and Xu, Jilan and Liu, Yi and Wang, Zun and Xing, Sen and Chen, Guo and Pan, Junting and Yu, Jiashuo and Wang, Yali and Wang, Limin and Qiao, Yu},
137+
journal={arXiv preprint arXiv:2212.03191},
138+
year={2022}
139+
}
140+
```
141+
142+
# Acknowledgement
143+
This repository is built based on [VINDLU](https://github.com/klauscc/VindLU), [UniFormer](https://github.com/Sense-X/UniFormer) and [VideoMAE](https://github.com/MCG-NJU/VideoMAE) repository.
144+
145+
# Discussion Group
146+
If you have any questions during the trial, running or deployment, feel free to join our WeChat group discussion! If you have any ideas or suggestions for the project, you are also welcome to join our WeChat group discussion!
147+
148+
![image](https://github.com/OpenGVLab/Ask-Anything/assets/43169235/c3020408-4d53-490b-8060-7fd54b0ef09c)
Lines changed: 30 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,30 @@
1+
{
2+
"note": "this file is a copy of the BEiT model config, not used directly",
3+
"architectures": [
4+
"BeitForImageClassification"
5+
],
6+
"url": "https://huggingface.co/microsoft/beit-base-patch16-224-pt22k-ft22k/raw/main/config.json",
7+
"attention_probs_dropout_prob": 0.0,
8+
"drop_path_rate": 0.1,
9+
"hidden_act": "gelu",
10+
"hidden_dropout_prob": 0.0,
11+
"hidden_size": 768,
12+
"image_size": 224,
13+
"initializer_range": 0.02,
14+
"intermediate_size": 3072,
15+
"layer_norm_eps": 1e-12,
16+
"layer_scale_init_value": 0.1,
17+
"model_type": "beit",
18+
"num_attention_heads": 12,
19+
"num_channels": 3,
20+
"num_hidden_layers": 12,
21+
"patch_size": 16,
22+
"torch_dtype": "float32",
23+
"transformers_version": "4.11.0.dev0",
24+
"use_absolute_position_embeddings": false,
25+
"use_mask_token": false,
26+
"use_mean_pooling": true,
27+
"use_relative_position_bias": true,
28+
"use_shared_relative_position_bias": false,
29+
"vocab_size": 8192
30+
}
Lines changed: 22 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,22 @@
1+
{
2+
"architectures": [
3+
"BertForMaskedLM"
4+
],
5+
"attention_probs_dropout_prob": 0.1,
6+
"hidden_act": "gelu",
7+
"hidden_dropout_prob": 0.1,
8+
"hidden_size": 768,
9+
"initializer_range": 0.02,
10+
"intermediate_size": 3072,
11+
"layer_norm_eps": 1e-12,
12+
"max_position_embeddings": 512,
13+
"model_type": "bert",
14+
"num_attention_heads": 12,
15+
"num_hidden_layers": 12,
16+
"pad_token_id": 0,
17+
"type_vocab_size": 2,
18+
"vocab_size": 30522,
19+
"fusion_layer": 9,
20+
"encoder_width": 768,
21+
"cross_module": "ca"
22+
}
Lines changed: 25 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,25 @@
1+
{
2+
"architectures": [
3+
"BertForMaskedLM"
4+
],
5+
"attention_probs_dropout_prob": 0.1,
6+
"gradient_checkpointing": false,
7+
"hidden_act": "gelu",
8+
"hidden_dropout_prob": 0.1,
9+
"hidden_size": 1024,
10+
"initializer_range": 0.02,
11+
"intermediate_size": 4096,
12+
"layer_norm_eps": 1e-12,
13+
"max_position_embeddings": 512,
14+
"model_type": "bert",
15+
"num_attention_heads": 16,
16+
"num_hidden_layers": 24,
17+
"pad_token_id": 0,
18+
"position_embedding_type": "absolute",
19+
"type_vocab_size": 2,
20+
"use_cache": true,
21+
"vocab_size": 30522,
22+
"fusion_layer": 19,
23+
"encoder_width": 768,
24+
"cross_module": "ca"
25+
}

0 commit comments

Comments
 (0)