Releases: innat/VideoSwin
v2.0
Summary
Keras 3 implementation of Video Swin Transformer. The official PyTorch weight has been converted to Keras 3 compatible. This implementaiton supports to run the model on multiple backend, i.e. TensorFlow, PyTorch, and Jax.
Full Changelog: v1.1...v2.0
v1.1
v1.0
Checkpoints of VideoSwin in Keras
Checkpoints of VideoSwin: Video Swin Transformer model in keras. The pretrained weights are ported from official pytorch model. Following are the list of all available model in .h5 format.
Checkpoint Naming Style
For the variation and brevity, the general format is:
dataset = 'K400' # K400, SSV2
pretrained_dataset = 'IN1K' # 'IN1K', 'IN22K`
size = 'B' # 'B', 'L'
patch_size = (2,4,4)
window_size=(8,7,7) # (8,7,7), (16,7,7)
num_frames = 32
input_size = 224
>> checkpoint_name = (
f'TFVideoSwin{size}'
f'{dataset}_'
f'{dataset_ext + "_"'
f'P{patch_size}_'
f'W{window_size}_'
f'{num_frames}x{input_size}.h5'
)
>> checkpoint_name
TFUniFormerV2_K400_K710_L14_32x224.h5Here, size represents tiny, small, and base. The pretrained_dataset refers the initialized pretrained weights while training the video swin model. For example, IN22K or ImageNet 22K pretrained 2D swin image models are used to initialize in 3D video swin model. The dataset refers the benchmark dataset, i.e., Kinetics, Something-Something-V2. The patch_size and window_size refer the internal parameter of model architecture. The input_frame and input_size for video-swin is 32 and 224 respectively. In keras implementation, the checkpoints are also available in SavedModel and h5 format. Check release page of v.1.1 for the SavedModel checkpoints.
| Model Name |
|---|
| TFVideoSwinT_K400_IN1K_P244_W877_32x224.h5 |
| TFVideoSwinS_K400_IN1K_P244_W877_32x224.h5 |
| TFVideoSwinB_SSV2_K400_P244_W1677_32x224.h5 |
| TFVideoSwinB_K600_IN22K_P244_W877_32x224.h5 |
| TFVideoSwinB_K400_IN22K_P244_W877_32x224.h5 |
| TFVideoSwinB_K400_IN1K_P244_W877_32x224.h5 |
Here, IN1K and IN22K refer to ImageNet 1K and ImageNet 22K. The P244 refers to patch_size of [2,4,4] and W877 refers to window_size of [8,7,7]. All these models give logit as output that makes it easy to add custom head on top of it for downstream task further. Check the notebook.