Patch Size in Swin Transformer SSL Pre-training #4683

fengling0410 · 2022-07-13T05:22:45Z

fengling0410
Jul 13, 2022

Hi all, I'm reading the paper "Self-Supervised Pre-Training of Swin Transformers for 3D Medical Image Analysis" these days but not every clear about the patch size. My understanding is that during training, the SwinViT takes in an image patch of 969696, and the inference is done using sliding window during testing. I noticed that for Swin UNETR pre-training, the train_transform doesn't incorporate spacing(I think the spacing performs downsampling) as in the MONAI tutorial for UNETR SSL pre-training.

I have done some experiments before and found that the performance of unetr degrades a lot without using spacing(2, 2, 1.5) on liver segmentation, and I think that's why the tutorial incorporates spacing in both the training and validation transform. Therefore, I'm a little confused about why the Swin Transformer doesn't use spacing for data transform.

I'm a Master student current learning about Swin Transformer and its application in medical imaging. Please feel free to correct me if you see something wrong in my question. I would really appreciate any insights from the community :)

Answered by tangy5

Jul 13, 2022

Hi Dear fengling0410, thank you very much for your interests in the Swin UNETR and its pre-training. Regarding the spacing, you are right, a consistent spacing such as isotropic 1x1x1 or 2x2x1.5 is better. For example, the Swin UNETR paper performed MSD pancreas/liver tumor segmentation using 1x1x1 spacing in the training/validation transformations. On why spacing is not used during pre-training, the collected pre-training data can be heterogenous (e.g, spacing, slice thickness, contrast, region of interests), we aim to model the naturally heterogeneity in head/neck/lung/abodmen/pelvis CT images. As the Swin Transformer model can benefit from the large-scale pre-training and it can be use…

View full answer

tangy5 · 2022-07-13T23:46:48Z

tangy5
Jul 13, 2022
Collaborator

Hi Dear fengling0410, thank you very much for your interests in the Swin UNETR and its pre-training. Regarding the spacing, you are right, a consistent spacing such as isotropic 1x1x1 or 2x2x1.5 is better. For example, the Swin UNETR paper performed MSD pancreas/liver tumor segmentation using 1x1x1 spacing in the training/validation transformations. On why spacing is not used during pre-training, the collected pre-training data can be heterogenous (e.g, spacing, slice thickness, contrast, region of interests), we aim to model the naturally heterogeneity in head/neck/lung/abodmen/pelvis CT images. As the Swin Transformer model can benefit from the large-scale pre-training and it can be used for different downstream tasks with variant spacing resolutions. Of course, if you have a fixed downstream task and pre-training swin transformer with the same spacing can be better, it's the task-dependent training if doing do. Hope this help, thanks.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Patch Size in Swin Transformer SSL Pre-training #4683

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Patch Size in Swin Transformer SSL Pre-training #4683

Uh oh!

fengling0410 Jul 13, 2022

Replies: 1 comment

Uh oh!

tangy5 Jul 13, 2022 Collaborator

fengling0410
Jul 13, 2022

tangy5
Jul 13, 2022
Collaborator