tensorflow
diff --git a/‎official/projects/maskconver/README.md‎
Lines changed: 82 additions & 0 deletions b/‎official/projects/maskconver/README.md‎
Lines changed: 82 additions & 0 deletions
diff --git a/‎official/projects/maskconver/__init__.py‎
Lines changed: 14 additions & 0 deletions b/‎official/projects/maskconver/__init__.py‎
Lines changed: 14 additions & 0 deletions
diff --git a/‎official/projects/maskconver/configs/__init__.py‎
Lines changed: 14 additions & 0 deletions b/‎official/projects/maskconver/configs/__init__.py‎
Lines changed: 14 additions & 0 deletions
diff --git a/‎official/projects/maskconver/configs/backbones.py‎
Lines changed: 43 additions & 0 deletions b/‎official/projects/maskconver/configs/backbones.py‎
Lines changed: 43 additions & 0 deletions
diff --git a/‎official/projects/maskconver/configs/decoders.py‎
Lines changed: 36 additions & 0 deletions b/‎official/projects/maskconver/configs/decoders.py‎
Lines changed: 36 additions & 0 deletions
diff --git a/‎official/projects/maskconver/configs/experiments/imagenet_resnetunet_tpu.yaml‎
Lines changed: 40 additions & 0 deletions b/‎official/projects/maskconver/configs/experiments/imagenet_resnetunet_tpu.yaml‎
Lines changed: 40 additions & 0 deletions
diff --git a/‎official/projects/maskconver/configs/experiments/maskconver_mobilenetv3p5_rf_256_coco.yaml‎
Lines changed: 111 additions & 0 deletions b/‎official/projects/maskconver/configs/experiments/maskconver_mobilenetv3p5_rf_256_coco.yaml‎
Lines changed: 111 additions & 0 deletions
@@ -0,0 +1,82 @@
+# MaskConver: Revisiting Pure Convolution Model for Panoptic Segmentation (WACV 2024)
+
+[![Paper](https://img.shields.io/badge/arXiv-Paper-<COLOR>.svg)](https://arxiv.org/abs/2312.06052)
+
+
+[MaskConver](https://arxiv.org/abs/2312.06052) is a pure convolutional panoptic
+architecture. MaskConver proposes to fully unify things and stuff representation
+by predicting their centers. To that extent, it creates a lightweight class
+embedding module that can break the ties when multiple centers co-exist in the
+same location. Furthermore, our study shows that the decoder design is critical
+in ensuring that the model has sufficient context for accurate detection and
+segmentation. We introduce a powerful ConvNeXt-UNet decoder that closes the
+performance gap between convolution- and transformer based models. With ResNet50
+backbone, our MaskConver achieves 53.6% PQ on the COCO panoptic val set,
+outperforming the modern convolution-based model, Panoptic FCN, by 9.3% as well
+as transformer-based models such as Mask2Former (+1.7% PQ) and kMaX-DeepLab
+(+0.6% PQ). Additionally, MaskConver with a MobileNet backbone reaches 37.2% PQ,
+improving over Panoptic-DeepLab by +6.4% under the same FLOPs/latency
+constraints. A further optimized version of MaskConver achieves 29.7% PQ, while
+running in real-time on mobile devices.
+
+
+MaskConver meta-architecture:
+
+<p align="center">
+<img src = "./docs/maskconver_architecture.png" width="80%">
+</p>
+
+The meta architecture of MaskConver contains four components: backbone (gray),
+pixel decoder (pink), prediction heads (light blue), and mask embedding
+generator (green). The backbone is any commonly deployed neural network, e.g.,
+ResNet50. We propose a novel ConvNeXt-UNet for the pixel decoder, which
+effectively captures long-range context and high-level semantics by stacking
+many ConvNeXt blocks at the highest level of backbone. We propose three
+prediction heads: Center Heatmap Head (for predicting center point heatmaps),
+Center Embedding Head (for predicting the embeddings for center points), and
+Mask Feature Head (for generating mask features). The Mask Embedding Generator
+first produces the class embeddings via a lookup table (Class Embedding Lookup
+Table module) by taking the predicted semantic classes from the top-K center
+points. The output mask embeddings are obtained by modulating the class
+embeddings with the center embeddings (via addition and MLP) to mitigate the
+center point collision between instances of different classes. In the end, the
+mask features are multiplied with the mask embeddings to generate the final
+binary masks. Unlike transformer-based methods, MaskConver only exploits
+convolutions without any self- or cross-attentions.
+
+<p align="center">
+<img src = "./docs/maskconver_plot.png" width="80%">
+</p>
+
+<p align="center">
+<img src = "./docs/Table1.png" width="80%">
+</p>
+
+
+## Performance Reference
+
+
+| Backbone     | Image Size | Params | FLOPS | PQ   | Latency  | Link |
+| ------------ | ---------- | ------ | ----- | ---- | -------- | ---- |
+| MobileNet-MH | 256x256    | 3.4M   | 1.5B  | 29.7 | 17.12 ms | maskconver_mobilenetv3p5_rf_256_coco.yaml |
+| MobileNet-MH | 640x640    | 3.4M   | 9.58B | 37.2 | 24.93 ms | maskconver_mobilenetv3p5_rf_640_coco.yaml|
+
+
+### Citation
+
+Should you find this repository useful, please consider citing:
+
+```
+@inproceedings{rashwan2024maskconver,
+  title={MaskConver: Revisiting Pure Convolution Model for Panoptic Segmentation},
+  author={Abdullah Rashwan and Jiageng Zhang and Ali Taalimi and Fan Yang and Xingyi Zhou and Chaochao Yan and Liang-Chieh Chen and Yeqing Li},
+  year={2024},
+  booktitle={2024 IEEE Winter Conference on Applications of Computer Vision (WACV)},
+  organization={IEEE}
+}
+```
+
+
+
+
+
@@ -0,0 +1,14 @@
+# Copyright 2024 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
@@ -0,0 +1,14 @@
+# Copyright 2024 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
@@ -0,0 +1,43 @@
+# Copyright 2024 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""Backbones configurations."""
+import dataclasses
+
+from typing import List, Optional
+from official.modeling import hyperparams
+from official.vision.configs.google import backbones
+
+
+@dataclasses.dataclass
+class ResNetUNet(hyperparams.Config):
+  """ResNetUNet config."""
+  model_id: int = 50
+  depth_multiplier: float = 1.0
+  stem_type: str = 'v0'
+  se_ratio: float = 0.0
+  stochastic_depth_drop_rate: float = 0.0
+  scale_stem: bool = True
+  resnetd_shortcut: bool = False
+  replace_stem_max_pool: bool = False
+  bn_trainable: bool = True
+  classification_output: bool = False
+  upsample_kernel_sizes: Optional[List[int]] = None
+  upsample_repeats: Optional[List[int]] = None
+  upsample_filters: Optional[List[int]] = None
+
+
+@dataclasses.dataclass
+class Backbone(backbones.Backbone):
+  resnet_unet: ResNetUNet = dataclasses.field(default_factory=ResNetUNet)
@@ -0,0 +1,36 @@
+# Copyright 2024 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""Decoders configurations."""
+import dataclasses
+
+from official.modeling import hyperparams
+from official.vision.configs import decoders
+
+
+@dataclasses.dataclass
+class MaskConverFPN(hyperparams.Config):
+  """FPN config."""
+  num_filters: int = 256
+  fusion_type: str = 'sum'
+  use_separable_conv: bool = False
+  use_keras_layer: bool = False
+  use_layer_norm: bool = True
+  depthwise_kernel_size: int = 7
+
+
+@dataclasses.dataclass
+class Decoder(decoders.Decoder):
+  maskconver_fpn: MaskConverFPN = dataclasses.field(
+      default_factory=MaskConverFPN)
@@ -0,0 +1,40 @@
+# --experiment_type=deit_imagenet_pretrain
+runtime:
+  distribution_strategy: 'tpu'
+  mixed_precision_dtype: 'float32'
+task:
+  model:
+    num_classes: 1001
+    input_size: [384, 384, 3]
+    backbone:
+      type: 'resnet_unet'
+      resnet_unet:
+        model_id: 50
+        stochastic_depth_drop_rate: 0.1
+        classification_output: true
+        upsample_kernel_sizes: [7, 7, 7]
+        upsample_repeats: [18, 1, 1]
+        upsample_filters: [384, 384, 384]
+    norm_activation:
+      activation: 'gelu'
+      norm_momentum: 0.0
+      norm_epsilon: 0.00001
+      use_sync_bn: true
+    dropout_rate: 0.0
+  validation_data:
+    global_batch_size: 1024
+trainer:
+  optimizer_config:
+    learning_rate:
+      cosine:
+        alpha: 0.0
+        initial_learning_rate: 0.004
+        name: CosineDecay
+        offset: 0
+      type: cosine
+    optimizer:
+      adamw:
+        weight_decay_rate: 0.05
+    ema:
+      average_decay: 0.9999
+      trainable_weights_only: false
@@ -0,0 +1,111 @@
+# Train on 4x8 TPU and eval on GPU. PQ: 29.66
+# http://tb/7684098868201894928
+# Note: Get PQ 29.7 with official evaluation.
+# Note: We pad the model output to 640x640 for better eval.
+runtime:
+  distribution_strategy: 'tpu'
+  mixed_precision_dtype: 'float32'
+task:
+  init_checkpoint: 'maskconver_seg_mnv3p5rf_coco_200k/43437096'
+  init_checkpoint_modules: ['backbone', 'decoder']
+  losses:
+    l2_weight_decay: 0.00001
+    mask_weight: 5.0
+  model:
+    input_size: [256, 256, 3]
+    level: 3
+    embedding_size: 256
+    padded_output_size: [640, 640]
+    num_instances: 50
+    norm_activation:
+      activation: 'relu'
+      norm_epsilon: 0.001
+      norm_momentum: 0.99
+      use_sync_bn: true
+    backbone:
+      mobilenet:
+        filter_size_scale: 1.0
+        model_id: MobileNetMultiAVGSeg
+        stochastic_depth_drop_rate: 0.0
+        output_stride: 16
+      type: mobilenet
+    decoder:
+      aspp:
+        dilation_rates: [6, 12, 18]
+        dropout_rate: 0.0
+        level: 4
+        num_filters: 256
+        spp_layer_version: v1
+        use_depthwise_convolution: true
+      type: 'aspp'
+    class_head:
+      feature_fusion: deeplabv3plus_sum_to_merge
+      level: 4
+      low_level: 3
+      low_level_num_filters: 256
+      num_filters: 256
+      prediction_kernel_size: 1
+      upsample_factor: 1
+      use_depthwise_convolution: true
+      num_convs: 2
+    per_pixel_embedding_head:
+      feature_fusion: deeplabv3plus_sum_to_merge
+      level: 4
+      low_level: 3
+      low_level_num_filters: 256
+      num_filters: 256
+      prediction_kernel_size: 1
+      upsample_factor: 1
+      use_depthwise_convolution: true
+      num_convs: 2
+    mask_embedding_head:
+      feature_fusion: deeplabv3plus_sum_to_merge
+      level: 4
+      low_level: 3
+      low_level_num_filters: 256
+      num_filters: 256
+      prediction_kernel_size: 1
+      upsample_factor: 1
+      use_depthwise_convolution: true
+      num_convs: 2
+    panoptic_generator:
+      object_mask_threshold: 0.01
+      overlap_threshold: 0.7
+      rescale_predictions: true
+      small_area_threshold: 256
+  train_data:
+    global_batch_size: 64
+    parser:
+      max_num_stuff_centers: 1
+      gaussaian_iou: 0.7
+      aug_scale_max: 1.9
+      aug_scale_min: 0.1
+      aug_type: null
+  validation_data:
+    global_batch_size: 1
+    parser:
+      segmentation_resize_eval_groundtruth: false
+      segmentation_groundtruth_padded_size: [640, 640]
+trainer:
+  optimizer_config:
+    learning_rate:
+      cosine:
+        decay_steps: 500000
+        initial_learning_rate: 0.04
+      type: cosine
+    optimizer:
+      sgd:
+        momentum: 0.9
+      type: sgd
+    warmup:
+      linear:
+        name: linear
+        warmup_learning_rate: 0
+        warmup_steps: 2000
+      type: linear
+  steps_per_loop: 100
+  summary_interval: 1000
+  train_steps: 500000
+  validation_interval: 1000
+  validation_steps: 5000
+  checkpoint_interval: 1000