Patch Size in Swin Transformer SSL Pre-training #4683
-
Hi all, I'm reading the paper "Self-Supervised Pre-Training of Swin Transformers for 3D Medical Image Analysis" these days but not every clear about the patch size. My understanding is that during training, the SwinViT takes in an image patch of 969696, and the inference is done using sliding window during testing. I noticed that for Swin UNETR pre-training, the train_transform doesn't incorporate spacing(I think the spacing performs downsampling) as in the MONAI tutorial for UNETR SSL pre-training. I have done some experiments before and found that the performance of unetr degrades a lot without using spacing(2, 2, 1.5) on liver segmentation, and I think that's why the tutorial incorporates spacing in both the training and validation transform. Therefore, I'm a little confused about why the Swin Transformer doesn't use spacing for data transform. I'm a Master student current learning about Swin Transformer and its application in medical imaging. Please feel free to correct me if you see something wrong in my question. I would really appreciate any insights from the community :) |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment
-
Hi Dear fengling0410, thank you very much for your interests in the Swin UNETR and its pre-training. Regarding the spacing, you are right, a consistent spacing such as isotropic 1x1x1 or 2x2x1.5 is better. For example, the Swin UNETR paper performed MSD pancreas/liver tumor segmentation using 1x1x1 spacing in the training/validation transformations. On why spacing is not used during pre-training, the collected pre-training data can be heterogenous (e.g, spacing, slice thickness, contrast, region of interests), we aim to model the naturally heterogeneity in head/neck/lung/abodmen/pelvis CT images. As the Swin Transformer model can benefit from the large-scale pre-training and it can be used for different downstream tasks with variant spacing resolutions. Of course, if you have a fixed downstream task and pre-training swin transformer with the same spacing can be better, it's the task-dependent training if doing do. Hope this help, thanks. |
Beta Was this translation helpful? Give feedback.
Hi Dear fengling0410, thank you very much for your interests in the Swin UNETR and its pre-training. Regarding the spacing, you are right, a consistent spacing such as isotropic 1x1x1 or 2x2x1.5 is better. For example, the Swin UNETR paper performed MSD pancreas/liver tumor segmentation using 1x1x1 spacing in the training/validation transformations. On why spacing is not used during pre-training, the collected pre-training data can be heterogenous (e.g, spacing, slice thickness, contrast, region of interests), we aim to model the naturally heterogeneity in head/neck/lung/abodmen/pelvis CT images. As the Swin Transformer model can benefit from the large-scale pre-training and it can be use…