Releases: pytorch/vision
Compat with PyTorch 1.3 and bugfix
This minor release provides binaries compatible with PyTorch 1.3.
Compared to version 0.4.0, it contains a single bugfix for HMDB51 and UCF101 datasets, fixed in #1240
Video support, new datasets and models
This release adds support for video models and datasets, and brings several improvements.
Note: torchvision 0.4 requires PyTorch 1.2 or newer
Highlights
Video and IO
Video is now a first-class citizen in torchvision. The 0.4 release includes:
- efficient IO primitives for reading and writing video files
- Kinetics-400, HMDB51 and UCF101 datasets for action recognition, which are compatible with
torch.utils.data.DataLoader - Pre-trained models for action recognition, trained on Kinetics-400
- Training and evaluation scripts for reproducing the training results.
Writing your own video dataset is easy. We provide an utility class VideoClips that simplifies the task of enumerating all possible clips of fixed size in a list of video files by creating an index of all clips in a set of videos. It additionally allows to specify a fixed frame-rate for the videos.
from torchvision.datasets.video_utils import VideoClips
class MyVideoDataset(object):
def __init__(self, video_paths):
self.video_clips = VideoClips(video_paths,
clip_length_in_frames=16,
frames_between_clips``=1,
frame_rate=15)
def __getitem__(self, idx):
video, audio, info, video_idx = self.video_clips.get_clip(idx)
return video, audio
def __len__(self):
return self.video_clips.num_clips()We provide pre-trained models for action recognition, trained on Kinetics-400, which reproduce the results on the original papers where they have been first introduced, as well the corresponding training scripts.
| model | clip @ 1 |
|---|---|
| r3d_18 | 52.748 |
| mc3_18 | 53.898 |
| r2plus1d_18 | 57.498 |
Bugfixes
- change aspect ratio calculation formula in
references/detection(#1194) - bug fixes in ImageNet (#1149)
- fix save_image when height or width equals 1 (#1059)
- Fix STL10
__repr__(#969) - Fix wrong behavior of
GeneralizedRCNNTransformin Python2. (#960)
Datasets
New
- Add USPS dataset (#961)(#1117)
- Added support for the QMNIST dataset (#995)
- Add HMDB51 and UCF101 datasets (#1156)
- Add Kinetics400 dataset (#1077)
Improvements
- Miscellaneous dataset fixes (#1174)
- Standardize str argument verification in datasets (#1167)
- Always pass
transformandtarget_transformto abstract dataset (#1126) - Remove duplicate transform assignment in FakeDataset (#1125)
- Automatic extraction for Cityscapes Dataset (#1066) (#1068)
- Use joint transform in Cityscapes (#1024)(#1045)
- CelebA: track attr names, support split="all", code cleanup (#1008)
- Add folds option to STL10 (#914)
Models
New
- Add pretrained Wide ResNet (#912)
- Memory efficient densenet (#1003) (#1090)
- Implementation of the MNASNet family of models (#829)(#1043)(#1092)
- Add VideoModelZoo models (#1130)
Improvements
- Fix resnet fpn backbone for resnet18 and resnet34 (#1147)
- Add checks to
roi_headsin detection module (#1091) - Make shallow copy of input list in
GeneralizedRCNNTransform(#1085)(#1111)(#1084) - Make MobileNetV2 number of channel divisible by 8 (#1005)
- typo fix: ouput -> output in Inception and GoogleNet (#1034)
- Remove empty proposals from the RPN (#1026)
- Remove empty boxes before NMS (#1019)
- Reduce code duplication in segmentation models (#1009)
- allow user to define residual settings in MobileNetV2 (#965)
- Use
flatteninstead ofview(#1134)
Documentation
- Consistency in detection box format (#1110)
- Fix Mask R-CNN docs (#1089)
- Add paper references to VGG and Resnet variants (#1088)
- Doc, Test Fixes in
Normalize(#1063) - Add transforms doc to more datasets (#1038)
- Corrected typo: 5 to 0.5 (#1041)
- Update doc for
torchvision.transforms.functional.perspective(#1017) - Improve documentation for
fillcoloroption inRandomAffine(#994) - Fix
COCO_INSTANCE_CATEGORY_NAMES(#991) - Added models information to documentation. (#985)
- Add missing import in
faster_rcnn.pydocumentation (#979) - Improve
make_griddocs (#964)
Tests
- Add test for SVHN (#1086)
- Add tests for Cityscapes Dataset (#1079)
- Update CI to Python 3.6 (#1044)
- Make
test_save_imagemore robust (#1037) - Add a generic test for the datasets (#1015)
- moved fakedata generation to separate module (#1014)
- Create imagenet fakedata on-the-fly (#1012)
- Minor test refactorings (#1011)
- Add test for CIFAR10(0) (#1010)
- Mock MNIST download for less flaky tests (#1004)
- Add test for ImageNet (#976)(#1006)
- Add tests for datasets (#966)
Transforms
New
Improvements
- Allowing 'F' mode for 1 channel FloatTensor in
ToPILImage(#1100) - Add shear parallel to y-axis (#1070)
- fix error message in
to_tensor(#1000) - Fix TypeError in
RandomResizedCrop.get_params(#1036) - Fix
normalizefor differentdtypethanfloat32(#1021)
Ops
- Renamed
vision.hfiles tovision_cpu.handvision_cuda.h(#1051)(#1052) - Optimize
nms_cudaby avoiding extratorch.catcall (#945)
Reference scripts
- Expose data-path in the detection reference scripts (#1109)
- Make
utils.pywork with pytorch-cpu (#1023) - Add mixed precision training with Apex (#972)(#1124)
- Add reference code for similarity learning (#1101)
Build
- Add windows build steps and wheel build scripts (#998)
- add packaging scripts (#996)
- Allow forcing GPU build with
FORCE_CUDA=1(#927)
Misc
Training scripts, detection/segmentation models and more
This release brings several new features to torchvision, including models for semantic segmentation, object detection, instance segmentation and person keypoint detection, and custom C++ / CUDA ops specific to computer vision.
Note: torchvision 0.3 requires PyTorch 1.1 or newer
Highlights
Reference training / evaluation scripts
We now provide under the references/ folder scripts for training and evaluation of the following tasks: classification, semantic segmentation, object detection, instance segmentation and person keypoint detection.
Their purpose is twofold:
- serve as a log of how to train a specific model.
- provide baseline training and evaluation scripts to bootstrap research
They all have an entry-point train.py which performs both training and evaluation for a particular task. Other helper files, specific to each training script, are also present in the folder, and they might get integrated into the torchvision library in the future.
We expect users should copy-paste and modify those reference scripts and use them for their own needs.
TorchVision Ops
TorchVision now contains custom C++ / CUDA operators in torchvision.ops. Those operators are specific to computer vision, and make it easier to build object detection models.
Those operators currently do not support PyTorch script mode, but support for it is planned for future releases.
List of supported ops
roi_pool(and the module versionRoIPool)roi_align(and the module versionRoIAlign)nms, for non-maximum suppression of bounding boxesbox_iou, for computing the intersection over union metric between two sets of bounding boxes
All the other ops present in torchvision.ops and its subfolders are experimental, in particular:
FeaturePyramidNetworkis a module that adds a FPN on top of a module that returns a set of feature maps.MultiScaleRoIAlignis a wrapper aroundroi_alignthat works with multiple feature map scales
Here are a few examples on using torchvision ops:
import torch
import torchvision
# create 10 random boxes
boxes = torch.rand(10, 4) * 100
# they need to be in [x0, y0, x1, y1] format
boxes[:, 2:] += boxes[:, :2]
# create a random image
image = torch.rand(1, 3, 200, 200)
# extract regions in `image` defined in `boxes`, rescaling
# them to have a size of 3x3
pooled_regions = torchvision.ops.roi_align(image, [boxes], output_size=(3, 3))
# check the size
print(pooled_regions.shape)
# torch.Size([10, 3, 3, 3])
# or compute the intersection over union between
# all pairs of boxes
print(torchvision.ops.box_iou(boxes, boxes).shape)
# torch.Size([10, 10])Models for more tasks
The 0.3 release of torchvision includes pre-trained models for other tasks than image classification on ImageNet.
We include two new categories of models: region-based models, like Faster R-CNN, and dense pixelwise prediction models, like DeepLabV3.
Object Detection, Instance Segmentation and Person Keypoint Detection models
Warning: The API is currently experimental and might change in future versions of torchvision
The 0.3 release contains pre-trained models for Faster R-CNN, Mask R-CNN and Keypoint R-CNN, all of them using ResNet-50 backbone with FPN.
They have been trained on COCO train2017 following the reference scripts in references/, and give the following results on COCO val2017
| Network | box AP | mask AP | keypoint AP |
|---|---|---|---|
| Faster R-CNN ResNet-50 FPN | 37.0 | ||
| Mask R-CNN ResNet-50 FPN | 37.9 | 34.6 | |
| Keypoint R-CNN ResNet-50 FPN | 54.6 | 65.0 |
The implementations of the models for object detection, instance segmentation and keypoint detection are fast, specially during training.
In the following table, we use 8 V100 GPUs, with CUDA 10.0 and CUDNN 7.4 to report the results. During training, we use a batch size of 2 per GPU, and during testing a batch size of 1 is used.
For test time, we report the time for the model evaluation and post-processing (including mask pasting in image), but not the time for computing the precision-recall.
| Network | train time (s / it) | test time (s / it) | memory (GB) |
|---|---|---|---|
| Faster R-CNN ResNet-50 FPN | 0.2288 | 0.0590 | 5.2 |
| Mask R-CNN ResNet-50 FPN | 0.2728 | 0.0903 | 5.4 |
| Keypoint R-CNN ResNet-50 FPN | 0.3789 | 0.1242 | 6.8 |
You can load and use pre-trained detection and segmentation models with a few lines of code
import torchvision
model = torchvision.models.detection.maskrcnn_resnet50_fpn(pretrained=True)
# set it to evaluation mode, as the model behaves differently
# during training and during evaluation
model.eval()
image = PIL.Image.open('/path/to/an/image.jpg')
image_tensor = torchvision.transforms.functional.to_tensor(image)
# pass a list of (potentially different sized) tensors
# to the model, in 0-1 range. The model will take care of
# batching them together and normalizing
output = model([image_tensor])
# output is a list of dict, containing the postprocessed predictionsPixelwise Semantic Segmentation models
Warning: The API is currently experimental and might change in future versions of torchvision
The 0.3 release also contains models for dense pixelwise prediction on images.
It adds FCN and DeepLabV3 segmentation models, using a ResNet50 and ResNet101 backbones.
Pre-trained weights for ResNet101 backbone are available, and have been trained on a subset of COCO train2017, which contains the same 20 categories as those from Pascal VOC.
The pre-trained models give the following results on the subset of COCO val2017 which contain the same 20 categories as those present in Pascal VOC:
| Network | mean IoU | global pixelwise acc |
|---|---|---|
| FCN ResNet101 | 63.7 | 91.9 |
| DeepLabV3 ResNet101 | 67.4 | 92.4 |
New Datasets
- Add Caltech101, Caltech256, and CelebA (#775)
- ImageNet dataset (#764) (#858) (#870)
- Added Semantic Boundaries Dataset (#808) (#865)
- Add VisionDataset as a base class for all datasets (#749) (#859) (#838) (#876) (#878)
New Models
Classification
- Add GoogLeNet (Inception v1) (#678) (#821) (#828) (#816)
- Add MobileNet V2 (#818) (#917)
- Add ShuffleNet v2 (#849) (#886) (#889) (#892) (#916)
- Add ResNeXt-50 32x4d and ResNeXt-101 32x8d (#822) (#852) (#917)
Segmentation
- Fully-Convolutional Network (FCN) with ResNet 101 backbone
- DeepLabV3 with ResNet 101 backbone
Detection
- Faster R-CNN R-50 FPN trained on COCO train2017 (#898) (#921)
- Mask R-CNN R-50 FPN trained on COCO train2017 (#898) (#921)
- Keypoint R-CNN R-50 FPN trained on COCO train2017 (#898) (#921) (#922)
Breaking changes
- Make
CocoDatasetids deterministically ordered (#868)
New Transforms
- Add bias vector to
LinearTransformation(#793) (#843) (#881) - Add Random Perspective transform (#781) (#879)
Bugfixes
Improvements
- Fixing mutation of 2d tensors in
to_pil_image(#762) - Replace
tensor.viewwithtensor.unsqueeze(0)inmake_grid(#765) - Change usage of
viewtoreshapeinresnetto enable running with mkldnn (#890) - Improve
normalizeto work with tensors located on any device (#787) - Raise an
IndexErrorforFakeData.__getitem__()if the index would be out of range (#780) - Aspect ratio is now sampled from a logarithmic distribution in
RandomResizedCrop. (#799) - Modernize inception v3 weight initialization code (#824)
- Remove duplicate code from densenet load_state_dict (#827)
- Replace
endswithcalls in a loop with a singleendswithcall inDatasetFolder(#832) - Added missing dot in webp image extensions (#836)
- fix inconsistent behavior for
~expression (#850) - Minor Compressions in statements in
folder.py(#874) - Minor fix to evaluation formula of
PILLOW_VERSIONintransforms.functional.affine(#895) - added
is_valid_fileparameter toDatasetFolder(#867) - Add support for joint transformations in
VisionDataset(#872) - Auto calculating return dimension of
squeezenetforward method (#884) - Added
progressflag to model getters (#875) (#910) - Add support for other normalizations (i.e.,
GroupNorm) inResNet(#813) - Add dilation option to
ResNet(#866)
Testing
- Add basic model testing. (#811)
- Add test for
num_classintest_model.py(#815) - Added test for
normalizefunctionality inmake_gridfunction. (#840) - Added downloaded directory not empty check in
test_datasets_utils(#844) - Added test for
save_imagein utils (#847) - Added tests for
check_md5andcheck_integrity(#873)
Misc
- Remove shebang in
setup.py(#773) - configurable version and package names (#842)
- More hub models (#851)
- Update travis to use more recent GCC (#891)
Documentation
- Add comments regarding downsampling layers of resnet (#794)
- Remove unnecessary bullet point in InceptionV3 doc (#814)
- Fix
cropandresized_cropdocs infunctional.py(#817) - Added dimensions in the comments of googlenet (#788)
- Update transform doc with random offset of padding due to
pad_if_needed(#791) - Added the argument
transform_inputin docs of InceptionV3 (#789) - Update documentation for MNIST datasets (#778)
- Fixed typo in
normalize()function. (#823) - Fix typo in squeezenet (#841)
- Fix typo in DenseNet comment (#857)
- Typo and syntax fixes to transform docstrings (#887)
More datasets, transforms and bugfixes
This version introduces several improvements and fixes.
Support for arbitrary input sizes for models
It is now possible to feed larger images than 224x224 into the models in torchvision.
We added an adaptive pooling just before the classifier, which adapts the size of the feature maps before the last layer, allowing for larger input images.
Relevant PRs: #744 #747 #746 #672 #643
Bugfixes
- Fix invalid argument error when using lsun method in windows (#508)
- Fix FashionMNIST loading MNIST (#640)
- Fix inception v3 input transform for trace & onnx (#621)
Datasets
- Add support for webp and tiff images in ImageFolder #736 #724
- Add K-MNIST dataset #687
- Add Cityscapes dataset #695 #725 #739 #700
- Add Flicker 8k and 30k datasets #674
- Add VOCDetection and VOCSegmentation datasets #663
- Add SBU Captioned Photo Dataset (#665)
- Updated URLs for EMNIST #726
- MNIST and FashionMNIST now have their own 'raw' and 'processed' folder #601
- Add metadata to some datasets (#501)
Improvements
- Allow RandomCrop to crop in the padded region #564
- ColorJitter now supports min/max values #548
- Generalize resnet to use block.extension #487
- Move area calculation out of for loop in RandomResizedCrop #641
- Add option to zero-init the residual branch in resnet (#498)
- Improve error messages in to_pil_image #673
- Added the option of converting to tensor for numpy arrays having only two dimensions in to_tensor (#686)
- Optimize _find_classes in DatasetFolder via scandir in Python3 (#559)
- Add padding_mode to RandomCrop (#489 #512)
- Make DatasetFolder more generic (#527)
- Add in-place option to normalize (#699)
- Add Hamming and Box interpolations to transforms.py (#693)
- Added the support of 2-channel Image modes such as 'LA' and adding a mode in 4 channel modes (#688)
- Improve support for 'P' image mode in pad (#683)
- Make torchvision depend on pillow-simd if already installed (#522)
- Make tests run faster (#745)
- Add support for non-square crops in RandomResizedCrop (#715)
Breaking changes
- save_images now round to nearest integer #754
Misc
- Added code coverage to travis #703
- Add downloads and docs badge to README (#702)
- Add progress to download_url #497 #524 #535
- Replace 'residual' with 'identity' in resnet.py (#679)
- Consistency changes in the models
- Refactored MNIST and CIFAR to have data and target fields #578 #594
- Update torchvision to newer versions of PyTorch
- Relax assertion in
transforms.Lambda.__init__(#637) - Cast MNIST target to int (#605)
- Change default target type of FakedDataset to long (#581)
- Improve docs of functional transforms (#602)
- Docstring improvements
- Add is_image_file to folder_dataset (#507)
- Add deprecation warning in MNIST train[test]_labels[data] (#742)
- Mention TORCH_MODEL_ZOO in models documentation. (#624)
- Add scipy as a dependency to setup.py (#675)
- Added size information for inception v3 (#719)
New datasets, transforms and fixes
This version introduces several fixes and improvements to the previous version.
Better printing of Datasets and Transforms
- Add descriptions to Transform objects.
# Now T.Compose([T.RandomHorizontalFlip(), T.RandomCrop(224), T.ToTensor()]) prints
Compose(
RandomHorizontalFlip(p=0.5)
RandomCrop(size=(224, 224), padding=0)
ToTensor()
)- Add descriptions to Datasets
# now torchvision.datasets.MNIST('~') prints
Dataset MNIST
Number of datapoints: 60000
Split: train
Root Location: /private/home/fmassa
Transforms (if any): None
Target Transforms (if any): NoneNew transforms
-
Add RandomApply, RandomChoice, RandomOrder transformations #402
- RandomApply: applies a list of transformation with a probability
- RandomChoice: choose randomly a single transformation from a list
- RandomOrder: apply transformations in a random order
-
Add random affine transformation #411
-
Add reflect, symmetric and edge padding to
transforms.pad#460
Performance improvements
- Speedup MNIST preprocessing by a factor of 1000x
- make weight initialization optional to speed VGG construction. This makes loading pre-trained VGG models much faster
- Accelerate
transforms.adjust_gammaby using PIL's point function instead of custom numpy-based implementation
New Datasets
- EMNIST - an extension of MNIST for hand-written letters
- OMNIGLOT - a dataset for one-shot learning, with 1623 different handwritten characters from 50 different alphabets
- Add a DatasetFolder class - generalization of ImageFolder
Miscellaneous improvements
- FakeData accepts a seed argument, so having multiple different FakeData instances is now possible
- Use consistent datatypes in Dataset targets. Now all datasets that returns labels will have them as int
- Add probability parameter in
RandomHorizontalFlipandRandomHorizontalFlip - Replace
np.randombyrandomin transforms - improves reproducibility in multi-threaded environments with default arguments - Detect tif images in ImageFolder
- Add
pad_if_neededtoRandomCrop, so that if the crop size is larger than the image, the image is automatically padded - Add support in
transforms.ToTensorfor PIL Images with mode '1'
Bugfixes
- Fix passing list of tensors to
utils.save_image - single images passed to
make_gridnow are now also normalized - Fix PIL img close warnings
- Added missing weight initializations to densenet
- Avoid division by zero in
make_gridwhen the image is constant - Fix
ToTensorwhen PIL Image has mode F - Fix bug with
to_tensorwhen the input is numpy array of type np.float32.
v0.2.0: New transforms + a new functional interface
This version introduced a functional interface to the transforms, allowing for joint random transformation of inputs and targets. We also introduced a few breaking changes to some datasets and transforms (see below for more details).
Transforms
We have introduced a functional interface for the torchvision transforms, available under torchvision.transforms.functional. This now makes it possible to do joint random transformations on inputs and targets, which is especially useful in tasks like object detection, segmentation and super resolution. For example, you can now do the following:
from torchvision import transforms
import torchvision.transforms.functional as F
import random
def my_segmentation_transform(input, target):
i, j, h, w = transforms.RandomCrop.get_params(input, (100, 100))
input = F.crop(input, i, j, h, w)
target = F.crop(target, i, j, h, w)
if random.random() > 0.5:
input = F.hflip(input)
target = F.hflip(target)
F.to_tensor(input), F.to_tensor(target)
return input, targetThe following transforms have also been added:
F.vflipandRandomVerticalFlip- FiveCrop and TenCrop
- Various color transformations:
ColorJitterF.adjust_brightnessF.adjust_contrastF.adjust_saturationF.adjust_hue
LinearTransformationfor applications such as whiteningGrayscaleandRandomGrayscaleRotateandRandomRotationToPILImagenow supportsRGBAimagesToPILImagenow accepts amodeargument so you can specify which colorspace the image should beRandomResizedCropnow acceptsscaleandratioranges as input parameters
Documentation
Documentation is now auto generated and publishing to pytorch.org
Datasets:
SEMEION Dataset of handwritten digits added
Phototour dataset patches computed via multi-scale Harris corners now available by setting name equal to notredame_harris, yosemite_harris or liberty_harris in the Phototour dataset
Bug fixes:
- Pre-trained densenet models is now CPU compatible #251
Breaking changes:
This version also introduced some breaking changes:
- The
SVHNdataset has now been made consistent with other datasets by making the label for the digit 0 be 0, instead of 10 (as it was previously) (see #194 for more details) - the
labelsfor the unlabelledSTL10dataset is now an array filled with-1 - the order of the input args to the deprecated
Scaletransform has changed from(width, height)to(height, width)to be consistent with other transforms
More models and some bug fixes
- Ability to switch image backends between PIL and accimage
- Added more tests
- Various bug fixes and doc improvements
Models
- Fix for inception v3 input transform bug #144
- Added pretrained VGG models with batch norm
Datasets
- Fix indexing bug in LSUN dataset (#177)
- enable
~to be used in dataset paths ImageFoldernow returns the same (sorted) file order on different machines (#193)
Transforms
- transforms.Scale now accepts a tuple as new size or single integer
Utils
- can now pass a pad value to make_grid and save_image
More models and datasets. Some bugfixes
New Features
Models
- SqueezeNet 1.0 and 1.1 models added, along with pre-trained weights
- Add pre-trained weights for VGG models
- Fix location of dropout in VGG
torchvision.modelsnow exposenum_classesas a constructor argument- Add InceptionV3 model and pre-trained weights
- Add DenseNet models and pre-trained weights
Datasets
- Add STL10 dataset
- Add SVHN dataset
- Add PhotoTour dataset
Transforms and Utilities
transforms.Padnow allows fill colors of either number tuples, or named colors like"white"- add normalization options to
make_gridandsave_image ToTensornow supports more input types
Performance Improvements
Bug Fixes
- ToPILImage now supports a single image
- Python3 compatibility bug fixes
ToTensornow copes with all PIL Image types, not just RGB images- ImageFolder now only scans subdirectories.
- Having files like
.DS_Storeis now no longer a blocking hindrance - Check for non-zero number of images in ImageFolder
- Subdirectories of classes have recursive scans for images
- Having files like
- LSUN test set loads now
Just a version bump
A small release, just needed a version bump because of PyPI.
Add models and modelzoo, some bugfixes
New Features
- Add
torchvision.models: Definitions and pre-trained models for common vision models- ResNet, AlexNet, VGG models added with downloadable pre-trained weights
- adding padding to RandomCrop. Also add
transforms.Pad - Add MNIST dataset
Performance Fixes
- Fixing performance of LSUN Dataset
Bug Fixes
- Some Python3 fixes
- Bug fixes in save_image, add single channel support