This repository contains a PyTorch implementation of PSPNet (Pyramid Scene Parsing Network) based on the CVPR 2017 paper, Pyramid Scene Parsing Network.
Semantic Segmentation involves classifying every pixel in an image into a specific category. While Fully Convolutional Networks (FCNs) have revolutionized this field, they often struggle to capture the global context of a scene due to their structural limitations.
This project aims to implement PSPNet to overcome these limitations, focusing on how the Pyramid Pooling Module (PPM) effectively aggregates global context information to improve segmentation performance.
Traditional FCN-based models excel at extracting local features (texture, shape) but suffer from a limited receptive field. Even with deep layers, the model may fail to understand the overall scene context.
This limitation often leads to the "Mismatched Relationship" problem:
Example: If a boat is floating on a river, an FCN might misclassify the boat as a "car" based solely on its metallic texture and shape, ignoring the surrounding "water" context which makes the existence of a car logically impossible.
PSPNet introduces the Pyramid Pooling Module to incorporate global context information into the local features.
Mechanism: The PPM aggregates context at four different pyramid scales to capture both global and local information:
- Multi-Scale Pooling: The feature map is pooled into four different scales (e.g.,
1x1,2x2,3x3,6x6).1x1bin: Captures the global prior (the overall scene context).- Larger bins: Capture sub-region context.
- Dimension Reduction: A
1x1convolution is applied to each pooled map to reduce the channel depth (bottleneck). - Upsampling: The low-dimension feature maps are upsampled back to the size of the original feature map via bilinear interpolation.
- Concatenation: These upsampled context features are concatenated with the original local feature map.
Conclusion: By fusing local features with global priors, PSPNet can reason that "since the surrounding area is water, the object inside must be a boat, not a car," significantly boosting segmentation accuracy.
The implementation follows the architecture described in the original paper:
- Backbone:
ResNet-50orResNet-101- Dilated Convolution (Atrous Conv): Applied to the last two blocks (
layer3,layer4) to maintain a larger spatial resolution (Output Stride = 8) without losing the receptive field.
- Dilated Convolution (Atrous Conv): Applied to the last two blocks (
- Neck (PPM): Fuses features from four different pool scales.
- Head: Final convolution layer for pixel-wise classification.
- Loss Function: Standard Cross-Entropy Loss (optionally combined with Auxiliary Loss for stable training).