Skip to content

Commit 69421c7

Browse files
No public description
PiperOrigin-RevId: 665358252
1 parent 7f239d8 commit 69421c7

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

43 files changed

+7305
-0
lines changed
Lines changed: 82 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,82 @@
1+
# MaskConver: Revisiting Pure Convolution Model for Panoptic Segmentation (WACV 2024)
2+
3+
[![Paper](https://img.shields.io/badge/arXiv-Paper-<COLOR>.svg)](https://arxiv.org/abs/2312.06052)
4+
5+
6+
[MaskConver](https://arxiv.org/abs/2312.06052) is a pure convolutional panoptic
7+
architecture. MaskConver proposes to fully unify things and stuff representation
8+
by predicting their centers. To that extent, it creates a lightweight class
9+
embedding module that can break the ties when multiple centers co-exist in the
10+
same location. Furthermore, our study shows that the decoder design is critical
11+
in ensuring that the model has sufficient context for accurate detection and
12+
segmentation. We introduce a powerful ConvNeXt-UNet decoder that closes the
13+
performance gap between convolution- and transformer based models. With ResNet50
14+
backbone, our MaskConver achieves 53.6% PQ on the COCO panoptic val set,
15+
outperforming the modern convolution-based model, Panoptic FCN, by 9.3% as well
16+
as transformer-based models such as Mask2Former (+1.7% PQ) and kMaX-DeepLab
17+
(+0.6% PQ). Additionally, MaskConver with a MobileNet backbone reaches 37.2% PQ,
18+
improving over Panoptic-DeepLab by +6.4% under the same FLOPs/latency
19+
constraints. A further optimized version of MaskConver achieves 29.7% PQ, while
20+
running in real-time on mobile devices.
21+
22+
23+
MaskConver meta-architecture:
24+
25+
<p align="center">
26+
<img src = "./docs/maskconver_architecture.png" width="80%">
27+
</p>
28+
29+
The meta architecture of MaskConver contains four components: backbone (gray),
30+
pixel decoder (pink), prediction heads (light blue), and mask embedding
31+
generator (green). The backbone is any commonly deployed neural network, e.g.,
32+
ResNet50. We propose a novel ConvNeXt-UNet for the pixel decoder, which
33+
effectively captures long-range context and high-level semantics by stacking
34+
many ConvNeXt blocks at the highest level of backbone. We propose three
35+
prediction heads: Center Heatmap Head (for predicting center point heatmaps),
36+
Center Embedding Head (for predicting the embeddings for center points), and
37+
Mask Feature Head (for generating mask features). The Mask Embedding Generator
38+
first produces the class embeddings via a lookup table (Class Embedding Lookup
39+
Table module) by taking the predicted semantic classes from the top-K center
40+
points. The output mask embeddings are obtained by modulating the class
41+
embeddings with the center embeddings (via addition and MLP) to mitigate the
42+
center point collision between instances of different classes. In the end, the
43+
mask features are multiplied with the mask embeddings to generate the final
44+
binary masks. Unlike transformer-based methods, MaskConver only exploits
45+
convolutions without any self- or cross-attentions.
46+
47+
<p align="center">
48+
<img src = "./docs/maskconver_plot.png" width="80%">
49+
</p>
50+
51+
<p align="center">
52+
<img src = "./docs/Table1.png" width="80%">
53+
</p>
54+
55+
56+
## Performance Reference
57+
58+
59+
| Backbone | Image Size | Params | FLOPS | PQ | Latency | Link |
60+
| ------------ | ---------- | ------ | ----- | ---- | -------- | ---- |
61+
| MobileNet-MH | 256x256 | 3.4M | 1.5B | 29.7 | 17.12 ms | maskconver_mobilenetv3p5_rf_256_coco.yaml |
62+
| MobileNet-MH | 640x640 | 3.4M | 9.58B | 37.2 | 24.93 ms | maskconver_mobilenetv3p5_rf_640_coco.yaml|
63+
64+
65+
### Citation
66+
67+
Should you find this repository useful, please consider citing:
68+
69+
```
70+
@inproceedings{rashwan2024maskconver,
71+
title={MaskConver: Revisiting Pure Convolution Model for Panoptic Segmentation},
72+
author={Abdullah Rashwan and Jiageng Zhang and Ali Taalimi and Fan Yang and Xingyi Zhou and Chaochao Yan and Liang-Chieh Chen and Yeqing Li},
73+
year={2024},
74+
booktitle={2024 IEEE Winter Conference on Applications of Computer Vision (WACV)},
75+
organization={IEEE}
76+
}
77+
```
78+
79+
80+
81+
82+
Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,14 @@
1+
# Copyright 2024 The TensorFlow Authors. All Rights Reserved.
2+
#
3+
# Licensed under the Apache License, Version 2.0 (the "License");
4+
# you may not use this file except in compliance with the License.
5+
# You may obtain a copy of the License at
6+
#
7+
# http://www.apache.org/licenses/LICENSE-2.0
8+
#
9+
# Unless required by applicable law or agreed to in writing, software
10+
# distributed under the License is distributed on an "AS IS" BASIS,
11+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12+
# See the License for the specific language governing permissions and
13+
# limitations under the License.
14+
Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,14 @@
1+
# Copyright 2024 The TensorFlow Authors. All Rights Reserved.
2+
#
3+
# Licensed under the Apache License, Version 2.0 (the "License");
4+
# you may not use this file except in compliance with the License.
5+
# You may obtain a copy of the License at
6+
#
7+
# http://www.apache.org/licenses/LICENSE-2.0
8+
#
9+
# Unless required by applicable law or agreed to in writing, software
10+
# distributed under the License is distributed on an "AS IS" BASIS,
11+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12+
# See the License for the specific language governing permissions and
13+
# limitations under the License.
14+
Lines changed: 43 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,43 @@
1+
# Copyright 2024 The TensorFlow Authors. All Rights Reserved.
2+
#
3+
# Licensed under the Apache License, Version 2.0 (the "License");
4+
# you may not use this file except in compliance with the License.
5+
# You may obtain a copy of the License at
6+
#
7+
# http://www.apache.org/licenses/LICENSE-2.0
8+
#
9+
# Unless required by applicable law or agreed to in writing, software
10+
# distributed under the License is distributed on an "AS IS" BASIS,
11+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12+
# See the License for the specific language governing permissions and
13+
# limitations under the License.
14+
15+
"""Backbones configurations."""
16+
import dataclasses
17+
18+
from typing import List, Optional
19+
from official.modeling import hyperparams
20+
from official.vision.configs.google import backbones
21+
22+
23+
@dataclasses.dataclass
24+
class ResNetUNet(hyperparams.Config):
25+
"""ResNetUNet config."""
26+
model_id: int = 50
27+
depth_multiplier: float = 1.0
28+
stem_type: str = 'v0'
29+
se_ratio: float = 0.0
30+
stochastic_depth_drop_rate: float = 0.0
31+
scale_stem: bool = True
32+
resnetd_shortcut: bool = False
33+
replace_stem_max_pool: bool = False
34+
bn_trainable: bool = True
35+
classification_output: bool = False
36+
upsample_kernel_sizes: Optional[List[int]] = None
37+
upsample_repeats: Optional[List[int]] = None
38+
upsample_filters: Optional[List[int]] = None
39+
40+
41+
@dataclasses.dataclass
42+
class Backbone(backbones.Backbone):
43+
resnet_unet: ResNetUNet = dataclasses.field(default_factory=ResNetUNet)
Lines changed: 36 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,36 @@
1+
# Copyright 2024 The TensorFlow Authors. All Rights Reserved.
2+
#
3+
# Licensed under the Apache License, Version 2.0 (the "License");
4+
# you may not use this file except in compliance with the License.
5+
# You may obtain a copy of the License at
6+
#
7+
# http://www.apache.org/licenses/LICENSE-2.0
8+
#
9+
# Unless required by applicable law or agreed to in writing, software
10+
# distributed under the License is distributed on an "AS IS" BASIS,
11+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12+
# See the License for the specific language governing permissions and
13+
# limitations under the License.
14+
15+
"""Decoders configurations."""
16+
import dataclasses
17+
18+
from official.modeling import hyperparams
19+
from official.vision.configs import decoders
20+
21+
22+
@dataclasses.dataclass
23+
class MaskConverFPN(hyperparams.Config):
24+
"""FPN config."""
25+
num_filters: int = 256
26+
fusion_type: str = 'sum'
27+
use_separable_conv: bool = False
28+
use_keras_layer: bool = False
29+
use_layer_norm: bool = True
30+
depthwise_kernel_size: int = 7
31+
32+
33+
@dataclasses.dataclass
34+
class Decoder(decoders.Decoder):
35+
maskconver_fpn: MaskConverFPN = dataclasses.field(
36+
default_factory=MaskConverFPN)
Lines changed: 40 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,40 @@
1+
# --experiment_type=deit_imagenet_pretrain
2+
runtime:
3+
distribution_strategy: 'tpu'
4+
mixed_precision_dtype: 'float32'
5+
task:
6+
model:
7+
num_classes: 1001
8+
input_size: [384, 384, 3]
9+
backbone:
10+
type: 'resnet_unet'
11+
resnet_unet:
12+
model_id: 50
13+
stochastic_depth_drop_rate: 0.1
14+
classification_output: true
15+
upsample_kernel_sizes: [7, 7, 7]
16+
upsample_repeats: [18, 1, 1]
17+
upsample_filters: [384, 384, 384]
18+
norm_activation:
19+
activation: 'gelu'
20+
norm_momentum: 0.0
21+
norm_epsilon: 0.00001
22+
use_sync_bn: true
23+
dropout_rate: 0.0
24+
validation_data:
25+
global_batch_size: 1024
26+
trainer:
27+
optimizer_config:
28+
learning_rate:
29+
cosine:
30+
alpha: 0.0
31+
initial_learning_rate: 0.004
32+
name: CosineDecay
33+
offset: 0
34+
type: cosine
35+
optimizer:
36+
adamw:
37+
weight_decay_rate: 0.05
38+
ema:
39+
average_decay: 0.9999
40+
trainable_weights_only: false
Lines changed: 111 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,111 @@
1+
# Train on 4x8 TPU and eval on GPU. PQ: 29.66
2+
# http://tb/7684098868201894928
3+
# Note: Get PQ 29.7 with official evaluation.
4+
# Note: We pad the model output to 640x640 for better eval.
5+
runtime:
6+
distribution_strategy: 'tpu'
7+
mixed_precision_dtype: 'float32'
8+
task:
9+
init_checkpoint: 'maskconver_seg_mnv3p5rf_coco_200k/43437096'
10+
init_checkpoint_modules: ['backbone', 'decoder']
11+
losses:
12+
l2_weight_decay: 0.00001
13+
mask_weight: 5.0
14+
model:
15+
input_size: [256, 256, 3]
16+
level: 3
17+
embedding_size: 256
18+
padded_output_size: [640, 640]
19+
num_instances: 50
20+
norm_activation:
21+
activation: 'relu'
22+
norm_epsilon: 0.001
23+
norm_momentum: 0.99
24+
use_sync_bn: true
25+
backbone:
26+
mobilenet:
27+
filter_size_scale: 1.0
28+
model_id: MobileNetMultiAVGSeg
29+
stochastic_depth_drop_rate: 0.0
30+
output_stride: 16
31+
type: mobilenet
32+
decoder:
33+
aspp:
34+
dilation_rates: [6, 12, 18]
35+
dropout_rate: 0.0
36+
level: 4
37+
num_filters: 256
38+
spp_layer_version: v1
39+
use_depthwise_convolution: true
40+
type: 'aspp'
41+
class_head:
42+
feature_fusion: deeplabv3plus_sum_to_merge
43+
level: 4
44+
low_level: 3
45+
low_level_num_filters: 256
46+
num_filters: 256
47+
prediction_kernel_size: 1
48+
upsample_factor: 1
49+
use_depthwise_convolution: true
50+
num_convs: 2
51+
per_pixel_embedding_head:
52+
feature_fusion: deeplabv3plus_sum_to_merge
53+
level: 4
54+
low_level: 3
55+
low_level_num_filters: 256
56+
num_filters: 256
57+
prediction_kernel_size: 1
58+
upsample_factor: 1
59+
use_depthwise_convolution: true
60+
num_convs: 2
61+
mask_embedding_head:
62+
feature_fusion: deeplabv3plus_sum_to_merge
63+
level: 4
64+
low_level: 3
65+
low_level_num_filters: 256
66+
num_filters: 256
67+
prediction_kernel_size: 1
68+
upsample_factor: 1
69+
use_depthwise_convolution: true
70+
num_convs: 2
71+
panoptic_generator:
72+
object_mask_threshold: 0.01
73+
overlap_threshold: 0.7
74+
rescale_predictions: true
75+
small_area_threshold: 256
76+
train_data:
77+
global_batch_size: 64
78+
parser:
79+
max_num_stuff_centers: 1
80+
gaussaian_iou: 0.7
81+
aug_scale_max: 1.9
82+
aug_scale_min: 0.1
83+
aug_type: null
84+
validation_data:
85+
global_batch_size: 1
86+
parser:
87+
segmentation_resize_eval_groundtruth: false
88+
segmentation_groundtruth_padded_size: [640, 640]
89+
trainer:
90+
optimizer_config:
91+
learning_rate:
92+
cosine:
93+
decay_steps: 500000
94+
initial_learning_rate: 0.04
95+
type: cosine
96+
optimizer:
97+
sgd:
98+
momentum: 0.9
99+
type: sgd
100+
warmup:
101+
linear:
102+
name: linear
103+
warmup_learning_rate: 0
104+
warmup_steps: 2000
105+
type: linear
106+
steps_per_loop: 100
107+
summary_interval: 1000
108+
train_steps: 500000
109+
validation_interval: 1000
110+
validation_steps: 5000
111+
checkpoint_interval: 1000

0 commit comments

Comments
 (0)