Skip to content

Commit cb99fe3

Browse files
authored
Add detection and segmentation models to doc folder (#933)
1 parent face20b commit cb99fe3

File tree

4 files changed

+279
-31
lines changed

4 files changed

+279
-31
lines changed

docs/source/models.rst

Lines changed: 154 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,18 @@
11
torchvision.models
2-
==================
2+
##################
3+
4+
5+
The models subpackage contains definitions of models for addressing
6+
different tasks, including: image classification, pixelwise semantic
7+
segmentation, object detection, instance segmentation and person
8+
keypoint detection.
9+
10+
11+
Classification
12+
==============
313

414
The models subpackage contains definitions for the following model
5-
architectures:
15+
architectures for image classification:
616

717
- `AlexNet`_
818
- `VGG`_
@@ -182,8 +192,149 @@ MobileNet v2
182192
.. autofunction:: mobilenet_v2
183193

184194
ResNext
185-
-------------
195+
-------
186196

187197
.. autofunction:: resnext50_32x4d
188198
.. autofunction:: resnext101_32x8d
189199

200+
201+
Semantic Segmentation
202+
=====================
203+
204+
As with image classification models, all pre-trained models expect input images normalized in the same way.
205+
The images have to be loaded in to a range of ``[0, 1]`` and then normalized using
206+
``mean = [0.485, 0.456, 0.406]`` and ``std = [0.229, 0.224, 0.225]``.
207+
They have been trained on images resized such that their minimum size is 520.
208+
209+
The pre-trained models have been trained on a subset of COCO train2017, on the 20 categories that are
210+
present in the Pascal VOC dataset. You can see more information on how the subset has been selected in
211+
``references/segmentation/coco_utils.py``. The classes that the pre-trained model outputs are the following,
212+
in order:
213+
214+
.. code-block:: python
215+
216+
['__background__', 'aeroplane', 'bicycle', 'bird', 'boat', 'bottle', 'bus',
217+
'car', 'cat', 'chair', 'cow', 'diningtable', 'dog', 'horse', 'motorbike',
218+
'person', 'pottedplant', 'sheep', 'sofa', 'train', 'tvmonitor']
219+
220+
The accuracies of the pre-trained models evaluated on COCO val2017 are as follows
221+
222+
================================ ============= ====================
223+
Network mean IoU global pixelwise acc
224+
================================ ============= ====================
225+
FCN ResNet101 63.7 91.9
226+
DeepLabV3 ResNet101 67.4 92.4
227+
================================ ============= ====================
228+
229+
230+
Fully Convolutional Networks
231+
----------------------------
232+
233+
.. autofunction:: torchvision.models.segmentation.fcn_resnet50
234+
.. autofunction:: torchvision.models.segmentation.fcn_resnet101
235+
236+
237+
DeepLabV3
238+
---------
239+
240+
.. autofunction:: torchvision.models.segmentation.deeplabv3_resnet50
241+
.. autofunction:: torchvision.models.segmentation.deeplabv3_resnet101
242+
243+
244+
Object Detection, Instance Segmentation and Person Keypoint Detection
245+
=====================================================================
246+
247+
The pre-trained models for detection, instance segmentation and
248+
keypoint detection are initialized with the classification models
249+
in torchvision.
250+
251+
The models expect a list of ``Tensor[C, H, W]``, in the range ``0-1``.
252+
The models internally resize the images so that they have a minimum size
253+
of ``800``. This option can be changed by passing the option ``min_size``
254+
to the constructor of the models.
255+
256+
257+
For object detection and instance segmentation, the pre-trained
258+
models return the predictions of the following classes:
259+
260+
.. code-block:: python
261+
262+
COCO_INSTANCE_CATEGORY_NAMES = [
263+
'__background__', 'person', 'bicycle', 'car', 'motorcycle', 'airplane', 'bus',
264+
'train', 'truck', 'boat', 'traffic', 'light', 'fire', 'hydrant', 'N/A', 'stop',
265+
'sign', 'parking', 'meter', 'bench', 'bird', 'cat', 'dog', 'horse', 'sheep', 'cow',
266+
'elephant', 'bear', 'zebra', 'giraffe', 'N/A', 'backpack', 'umbrella', 'N/A', 'N/A',
267+
'handbag', 'tie', 'suitcase', 'frisbee', 'skis', 'snowboard', 'sports', 'ball',
268+
'kite', 'baseball', 'bat', 'baseball', 'glove', 'skateboard', 'surfboard', 'tennis',
269+
'racket', 'bottle', 'N/A', 'wine', 'glass', 'cup', 'fork', 'knife', 'spoon', 'bowl',
270+
'banana', 'apple', 'sandwich', 'orange', 'broccoli', 'carrot', 'hot', 'dog', 'pizza',
271+
'donut', 'cake', 'chair', 'couch', 'potted', 'plant', 'bed', 'N/A', 'dining', 'table',
272+
'N/A', 'N/A', 'toilet', 'N/A', 'tv', 'laptop', 'mouse', 'remote', 'keyboard', 'cell',
273+
'phone', 'microwave', 'oven', 'toaster', 'sink', 'refrigerator', 'N/A', 'book',
274+
'clock', 'vase', 'scissors', 'teddy', 'bear', 'hair', 'drier', 'toothbrush'
275+
]
276+
277+
278+
Here are the summary of the accuracies for the models trained on
279+
the instances set of COCO train2017 and evaluated on COCO val2017.
280+
281+
================================ ======= ======== ===========
282+
Network box AP mask AP keypoint AP
283+
================================ ======= ======== ===========
284+
Faster R-CNN ResNet-50 FPN 37.0 - -
285+
Mask R-CNN ResNet-50 FPN 37.9 34.6 -
286+
================================ ======= ======== ===========
287+
288+
For person keypoint detection, the accuracies for the pre-trained
289+
models are as follows
290+
291+
================================ ======= ======== ===========
292+
Network box AP mask AP keypoint AP
293+
================================ ======= ======== ===========
294+
Keypoint R-CNN ResNet-50 FPN 54.6 - 65.0
295+
================================ ======= ======== ===========
296+
297+
For person keypoint detection, the pre-trained model return the
298+
keypoints in the following order:
299+
300+
.. code-block:: python
301+
302+
COCO_PERSON_KEYPOINT_NAMES = [
303+
'nose',
304+
'left_eye',
305+
'right_eye',
306+
'left_ear',
307+
'right_ear',
308+
'left_shoulder',
309+
'right_shoulder',
310+
'left_elbow',
311+
'right_elbow',
312+
'left_wrist',
313+
'right_wrist',
314+
'left_hip',
315+
'right_hip',
316+
'left_knee',
317+
'right_knee',
318+
'left_ankle',
319+
'right_ankle'
320+
]
321+
322+
323+
324+
Faster R-CNN
325+
------------
326+
327+
.. autofunction:: torchvision.models.detection.fasterrcnn_resnet50_fpn
328+
329+
330+
Mask R-CNN
331+
----------
332+
333+
.. autofunction:: torchvision.models.detection.maskrcnn_resnet50_fpn
334+
335+
336+
Keypoint R-CNN
337+
--------------
338+
339+
.. autofunction:: torchvision.models.detection.keypointrcnn_resnet50_fpn
340+

torchvision/models/detection/faster_rcnn.py

Lines changed: 37 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -32,19 +32,20 @@ class FasterRCNN(GeneralizedRCNN):
3232
3333
During training, the model expects both the input tensors, as well as a targets dictionary,
3434
containing:
35-
boxes (Tensor[N, 4]): the ground-truth boxes in [x0, y0, x1, y1] format, with values
36-
between 0 and H and 0 and W
37-
labels (Tensor[N]): the class label for each ground-truth box
35+
- boxes (Tensor[N, 4]): the ground-truth boxes in [x0, y0, x1, y1] format, with values
36+
between 0 and H and 0 and W
37+
- labels (Tensor[N]): the class label for each ground-truth box
38+
3839
The model returns a Dict[Tensor] during training, containing the classification and regression
3940
losses for both the RPN and the R-CNN.
4041
4142
During inference, the model requires only the input tensors, and returns the post-processed
4243
predictions as a List[Dict[Tensor]], one for each input image. The fields of the Dict are as
4344
follows:
44-
boxes (Tensor[N, 4]): the predicted boxes in [x0, y0, x1, y1] format, with values between
45-
0 and H and 0 and W
46-
labels (Tensor[N]): the predicted labels for each image
47-
scores (Tensor[N]): the scores or each prediction
45+
- boxes (Tensor[N, 4]): the predicted boxes in [x0, y0, x1, y1] format, with values between
46+
0 and H and 0 and W
47+
- labels (Tensor[N]): the predicted labels for each image
48+
- scores (Tensor[N]): the scores or each prediction
4849
4950
Arguments:
5051
backbone (nn.Module): the network used to compute the features for the model.
@@ -257,6 +258,35 @@ def fasterrcnn_resnet50_fpn(pretrained=False, progress=True,
257258
"""
258259
Constructs a Faster R-CNN model with a ResNet-50-FPN backbone.
259260
261+
The input to the model is expected to be a list of tensors, each of shape ``[C, H, W]``, one for each
262+
image, and should be in ``0-1`` range. Different images can have different sizes.
263+
264+
The behavior of the model changes depending if it is in training or evaluation mode.
265+
266+
During training, the model expects both the input tensors, as well as a targets dictionary,
267+
containing:
268+
- boxes (``Tensor[N, 4]``): the ground-truth boxes in ``[x0, y0, x1, y1]`` format, with values
269+
between ``0`` and ``H`` and ``0`` and ``W``
270+
- labels (``Tensor[N]``): the class label for each ground-truth box
271+
272+
The model returns a ``Dict[Tensor]`` during training, containing the classification and regression
273+
losses for both the RPN and the R-CNN.
274+
275+
During inference, the model requires only the input tensors, and returns the post-processed
276+
predictions as a ``List[Dict[Tensor]]``, one for each input image. The fields of the ``Dict`` are as
277+
follows:
278+
- boxes (``Tensor[N, 4]``): the predicted boxes in ``[x0, y0, x1, y1]`` format, with values between
279+
``0`` and ``H`` and ``0`` and ``W``
280+
- labels (``Tensor[N]``): the predicted labels for each image
281+
- scores (``Tensor[N]``): the scores or each prediction
282+
283+
Example::
284+
285+
>>> model = torchvision.models.detection.fasterrcnn_resnet50_fpn(pretrained=True)
286+
>>> model.eval()
287+
>>> x = [torch.rand(3, 300, 400), torch.rand(3, 500, 400)]
288+
>>> predictions = model(x)
289+
260290
Arguments:
261291
pretrained (bool): If True, returns a model pre-trained on COCO train2017
262292
progress (bool): If True, displays a progress bar of the download to stderr

torchvision/models/detection/keypoint_rcnn.py

Lines changed: 43 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -26,22 +26,23 @@ class KeypointRCNN(FasterRCNN):
2626
2727
During training, the model expects both the input tensors, as well as a targets dictionary,
2828
containing:
29-
boxes (Tensor[N, 4]): the ground-truth boxes in [x0, y0, x1, y1] format, with values
30-
between 0 and H and 0 and W
31-
labels (Tensor[N]): the class label for each ground-truth box
32-
keypoints (Tensor[N, K, 3]): the K keypoints location for each of the N instances, in the
33-
format [x, y, visibility], where visibility=0 means that the keypoint is not visible.
29+
- boxes (Tensor[N, 4]): the ground-truth boxes in [x0, y0, x1, y1] format, with values
30+
between 0 and H and 0 and W
31+
- labels (Tensor[N]): the class label for each ground-truth box
32+
- keypoints (Tensor[N, K, 3]): the K keypoints location for each of the N instances, in the
33+
format [x, y, visibility], where visibility=0 means that the keypoint is not visible.
34+
3435
The model returns a Dict[Tensor] during training, containing the classification and regression
3536
losses for both the RPN and the R-CNN, and the keypoint loss.
3637
3738
During inference, the model requires only the input tensors, and returns the post-processed
3839
predictions as a List[Dict[Tensor]], one for each input image. The fields of the Dict are as
3940
follows:
40-
boxes (Tensor[N, 4]): the predicted boxes in [x0, y0, x1, y1] format, with values between
41-
0 and H and 0 and W
42-
labels (Tensor[N]): the predicted labels for each image
43-
scores (Tensor[N]): the scores or each prediction
44-
keypoints (Tensor[N, K, 3]): the locations of the predicted keypoints, in [x, y, v] format.
41+
- boxes (Tensor[N, 4]): the predicted boxes in [x0, y0, x1, y1] format, with values between
42+
0 and H and 0 and W
43+
- labels (Tensor[N]): the predicted labels for each image
44+
- scores (Tensor[N]): the scores or each prediction
45+
- keypoints (Tensor[N, K, 3]): the locations of the predicted keypoints, in [x, y, v] format.
4546
4647
Arguments:
4748
backbone (nn.Module): the network used to compute the features for the model.
@@ -228,6 +229,38 @@ def keypointrcnn_resnet50_fpn(pretrained=False, progress=True,
228229
"""
229230
Constructs a Keypoint R-CNN model with a ResNet-50-FPN backbone.
230231
232+
The input to the model is expected to be a list of tensors, each of shape ``[C, H, W]``, one for each
233+
image, and should be in ``0-1`` range. Different images can have different sizes.
234+
235+
The behavior of the model changes depending if it is in training or evaluation mode.
236+
237+
During training, the model expects both the input tensors, as well as a targets dictionary,
238+
containing:
239+
- boxes (``Tensor[N, 4]``): the ground-truth boxes in ``[x0, y0, x1, y1]`` format, with values
240+
between ``0`` and ``H`` and ``0`` and ``W``
241+
- labels (``Tensor[N]``): the class label for each ground-truth box
242+
- keypoints (``Tensor[N, K, 3]``): the ``K`` keypoints location for each of the ``N`` instances, in the
243+
format ``[x, y, visibility]``, where ``visibility=0`` means that the keypoint is not visible.
244+
245+
The model returns a ``Dict[Tensor]`` during training, containing the classification and regression
246+
losses for both the RPN and the R-CNN, and the keypoint loss.
247+
248+
During inference, the model requires only the input tensors, and returns the post-processed
249+
predictions as a ``List[Dict[Tensor]]``, one for each input image. The fields of the ``Dict`` are as
250+
follows:
251+
- boxes (``Tensor[N, 4]``): the predicted boxes in ``[x0, y0, x1, y1]`` format, with values between
252+
``0`` and ``H`` and ``0`` and ``W``
253+
- labels (``Tensor[N]``): the predicted labels for each image
254+
- scores (``Tensor[N]``): the scores or each prediction
255+
- keypoints (``Tensor[N, K, 3]``): the locations of the predicted keypoints, in ``[x, y, v]`` format.
256+
257+
Example::
258+
259+
>>> model = torchvision.models.detection.keypointrcnn_resnet50_fpn(pretrained=True)
260+
>>> model.eval()
261+
>>> x = [torch.rand(3, 300, 400), torch.rand(3, 500, 400)]
262+
>>> predictions = model(x)
263+
231264
Arguments:
232265
pretrained (bool): If True, returns a model pre-trained on COCO train2017
233266
progress (bool): If True, displays a progress bar of the download to stderr

torchvision/models/detection/mask_rcnn.py

Lines changed: 45 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -28,23 +28,24 @@ class MaskRCNN(FasterRCNN):
2828
2929
During training, the model expects both the input tensors, as well as a targets dictionary,
3030
containing:
31-
boxes (Tensor[N, 4]): the ground-truth boxes in [x0, y0, x1, y1] format, with values
32-
between 0 and H and 0 and W
33-
labels (Tensor[N]): the class label for each ground-truth box
34-
masks (Tensor[N, H, W]): the segmentation binary masks for each instance
31+
- boxes (Tensor[N, 4]): the ground-truth boxes in [x0, y0, x1, y1] format, with values
32+
between 0 and H and 0 and W
33+
- labels (Tensor[N]): the class label for each ground-truth box
34+
- masks (Tensor[N, H, W]): the segmentation binary masks for each instance
35+
3536
The model returns a Dict[Tensor] during training, containing the classification and regression
3637
losses for both the RPN and the R-CNN, and the mask loss.
3738
3839
During inference, the model requires only the input tensors, and returns the post-processed
3940
predictions as a List[Dict[Tensor]], one for each input image. The fields of the Dict are as
4041
follows:
41-
boxes (Tensor[N, 4]): the predicted boxes in [x0, y0, x1, y1] format, with values between
42-
0 and H and 0 and W
43-
labels (Tensor[N]): the predicted labels for each image
44-
scores (Tensor[N]): the scores or each prediction
45-
mask (Tensor[N, H, W]): the predicted masks for each instance, in 0-1 range. In order to
46-
obtain the final segmentation masks, the soft masks can be thresholded, generally
47-
with a value of 0.5 (mask >= 0.5)
42+
- boxes (Tensor[N, 4]): the predicted boxes in [x0, y0, x1, y1] format, with values between
43+
0 and H and 0 and W
44+
- labels (Tensor[N]): the predicted labels for each image
45+
- scores (Tensor[N]): the scores or each prediction
46+
- mask (Tensor[N, H, W]): the predicted masks for each instance, in 0-1 range. In order to
47+
obtain the final segmentation masks, the soft masks can be thresholded, generally
48+
with a value of 0.5 (mask >= 0.5)
4849
4950
Arguments:
5051
backbone (nn.Module): the network used to compute the features for the model.
@@ -226,6 +227,39 @@ def maskrcnn_resnet50_fpn(pretrained=False, progress=True,
226227
"""
227228
Constructs a Mask R-CNN model with a ResNet-50-FPN backbone.
228229
230+
The input to the model is expected to be a list of tensors, each of shape ``[C, H, W]``, one for each
231+
image, and should be in ``0-1`` range. Different images can have different sizes.
232+
233+
The behavior of the model changes depending if it is in training or evaluation mode.
234+
235+
During training, the model expects both the input tensors, as well as a targets dictionary,
236+
containing:
237+
- boxes (``Tensor[N, 4]``): the ground-truth boxes in ``[x0, y0, x1, y1]`` format, with values
238+
between ``0`` and ``H`` and ``0`` and ``W``
239+
- labels (``Tensor[N]``): the class label for each ground-truth box
240+
- masks (``Tensor[N, H, W]``): the segmentation binary masks for each instance
241+
242+
The model returns a ``Dict[Tensor]`` during training, containing the classification and regression
243+
losses for both the RPN and the R-CNN, and the mask loss.
244+
245+
During inference, the model requires only the input tensors, and returns the post-processed
246+
predictions as a ``List[Dict[Tensor]]``, one for each input image. The fields of the ``Dict`` are as
247+
follows:
248+
- boxes (``Tensor[N, 4]``): the predicted boxes in ``[x0, y0, x1, y1]`` format, with values between
249+
``0`` and ``H`` and ``0`` and ``W``
250+
- labels (``Tensor[N]``): the predicted labels for each image
251+
- scores (``Tensor[N]``): the scores or each prediction
252+
- mask (``Tensor[N, H, W]``): the predicted masks for each instance, in ``0-1`` range. In order to
253+
obtain the final segmentation masks, the soft masks can be thresholded, generally
254+
with a value of 0.5 (``mask >= 0.5``)
255+
256+
Example::
257+
258+
>>> model = torchvision.models.detection.maskrcnn_resnet50_fpn(pretrained=True)
259+
>>> model.eval()
260+
>>> x = [torch.rand(3, 300, 400), torch.rand(3, 500, 400)]
261+
>>> predictions = model(x)
262+
229263
Arguments:
230264
pretrained (bool): If True, returns a model pre-trained on COCO train2017
231265
progress (bool): If True, displays a progress bar of the download to stderr

0 commit comments

Comments
 (0)