Merge pull request #15 from JuliaML/svhn

Evizero · web-flow · commit d98ed58a5e8c · 2018-03-02T23:58:25.000+01:00
add SVHN dataset (Format 2)
diff --git a/.travis.yml b/.travis.yml
@@ -1,31 +1,28 @@
 language: julia
-
 os:
-    - linux
-    - osx
-
+  - linux
+  - osx
 julia:
-    - 0.6
-    - nightly
-matrix:
-    allow_failures:
-        - julia: nightly
+  - 0.6
+  - nightly
+notifications:
+  email: false
 git:
-    depth: 5000
+  depth: 99999999
 
-notifications:
-    email: false
+matrix:
+  allow_failures:
+    - julia: nightly
 
-before_script:
-  - export PATH=$HOME/.local/bin:$PATH
+addons:
+  apt: # apt-get for linux
+    packages:
+      - hdf5-tools
 
 install:
   #- sudo pip install pymdown-extensions
 
 after_success:
+  - julia -e 'cd(Pkg.dir("MLDatasets")); Pkg.add("Coverage"); using Coverage; Coveralls.submit(Coveralls.process_folder())'
   - julia -e 'Pkg.add("Documenter")'
   - julia -e 'cd(Pkg.dir("MLDatasets")); include(joinpath("docs", "make.jl"))'
-
-script:
-  - if [[ -a .git/shallow ]]; then git fetch --unshallow; fi
-  - julia -e 'Pkg.clone(pwd()); Pkg.build("MLDatasets"); Pkg.test("MLDatasets"; coverage=true)'
diff --git a/README.md b/README.md
@@ -70,6 +70,9 @@ Dataset | Classes | `traintensor` | `trainlabels` | `testtensor` | `testlabels`
 [**FashionMNIST**](https://juliaml.github.io/MLDatasets.jl/latest/datasets/FashionMNIST/) | 10 | 28x28x60000 | 60000 | 28x28x10000 | 10000
 [**CIFAR-10**](https://juliaml.github.io/MLDatasets.jl/latest/datasets/CIFAR10/) | 10 | 32x32x3x50000 | 50000 | 32x32x3x10000 | 10000
 [**CIFAR-100**](https://juliaml.github.io/MLDatasets.jl/latest/datasets/CIFAR100/) | 100 (20) | 32x32x3x50000 | 50000 (x2) | 32x32x3x10000 | 10000 (x2)
+[**SVHN-2**](https://juliaml.github.io/MLDatasets.jl/latest/datasets/SVHN2/) (*) | 10 | 32x32x3x73257 | 73257 | 32x32x3x26032 | 26032
+
+(*) Note that the SVHN-2 dataset provides an additional 531131 observations aside from the training- and testset
 
 ### Language Modeling
 
diff --git a/REQUIRE b/REQUIRE
@@ -5,3 +5,4 @@ ColorTypes 0.4
 DataDeps
 GZip
 BinDeps
+MAT
diff --git a/docs/make.jl b/docs/make.jl
@@ -18,6 +18,7 @@ makedocs(
                 "Fashion MNIST" => "datasets/FashionMNIST.md",
                 "CIFAR-10" => "datasets/CIFAR10.md",
                 "CIFAR-100" => "datasets/CIFAR100.md",
+                "SVHN format 2" => "datasets/SVHN2.md",
             ],
         ],
         hide("Indices" => "indices.md"),
diff --git a/docs/src/datasets/SVHN2.md b/docs/src/datasets/SVHN2.md
@@ -0,0 +1,152 @@
+# [The Street View House Numbers (SVHN) Dataset](@id SVHN2)
+
+Description from the [official
+website](http://ufldl.stanford.edu/housenumbers/):
+
+> SVHN is a real-world image dataset for developing machine
+> learning and object recognition algorithms with minimal
+> requirement on data preprocessing and formatting. It can be
+> seen as similar in flavor to MNIST (e.g., the images are of
+> small cropped digits), but incorporates an order of magnitude
+> more labeled data (over 600,000 digit images) and comes from a
+> significantly harder, unsolved, real world problem (recognizing
+> digits and numbers in natural scene images). SVHN is obtained
+> from house numbers in Google Street View images.
+
+About Format 2 (Cropped Digits):
+
+> All digits have been resized to a fixed resolution of 32-by-32
+> pixels. The original character bounding boxes are extended in
+> the appropriate dimension to become square windows, so that
+> resizing them to 32-by-32 pixels does not introduce aspect
+> ratio distortions. Nevertheless this preprocessing introduces
+> some distracting digits to the sides of the digit of interest.
+
+!!! note
+
+    For non-commercial use only
+
+## Contents
+
+```@contents
+Pages = ["SVHN2.md"]
+Depth = 3
+```
+
+## Overview
+
+The `MLDatasets.SVHN2` sub-module provides a programmatic
+interface to download, load, and work with the SVHN2 dataset of
+handwritten digits.
+
+```julia
+using MLDatasets
+
+# load full training set
+train_x, train_y = SVHN2.traindata()
+
+# load full test set
+test_x,  test_y  = SVHN2.testdata()
+
+# load additional train set
+extra_x, extra_y = SVHN2.extradata()
+```
+
+The provided functions also allow for optional arguments, such as
+the directory `dir` where the dataset is located, or the specific
+observation `indices` that one wants to work with. For more
+information on the interface take a look at the documentation
+(e.g. `?SVHN2.traindata`).
+
+Function | Description
+---------|-------------
+[`download([dir])`](@ref SVHN2.download) | Trigger interactive download of the dataset
+[`classnames()`](@ref SVHN2.classnames) | Return the class names as a vector of strings
+[`traintensor([T], [indices]; [dir])`](@ref SVHN2.traintensor) | Load the training images as an array of eltype `T`
+[`trainlabels([indices]; [dir])`](@ref SVHN2.trainlabels) | Load the labels for the training images
+[`traindata([T], [indices]; [dir])`](@ref SVHN2.traindata) | Load images and labels of the training data
+[`testtensor([T], [indices]; [dir])`](@ref SVHN2.testtensor) | Load the test images as an array of eltype `T`
+[`testlabels([indices]; [dir])`](@ref SVHN2.testlabels) | Load the labels for the test images
+[`testdata([T], [indices]; [dir])`](@ref SVHN2.testdata) | Load images and labels of the test data
+[`extratensor([T], [indices]; [dir])`](@ref SVHN2.extratensor) | Load the extra images as an array of eltype `T`
+[`extralabels([indices]; [dir])`](@ref SVHN2.extralabels) | Load the labels for the extra training images
+[`extradata([T], [indices]; [dir])`](@ref SVHN2.extradata) | Load images and labels of the extra training data
+
+This module also provides utility functions to make working with
+the SVHN (format 2) dataset in Julia more convenient.
+
+Function | Description
+---------|-------------
+[`convert2features(array)`](@ref SVHN2.convert2features) | Convert the SVHN tensor to a flat feature matrix
+[`convert2image(array)`](@ref SVHN2.convert2image) | Convert the SVHN tensor/matrix to a colorant array
+
+You can use the function
+[`convert2features`](@ref SVHN2.convert2features) to convert
+the given SVHN tensor to a feature matrix (or feature vector
+in the case of a single image). The purpose of this function is
+to drop the spatial dimensions such that traditional ML
+algorithms can process the dataset.
+
+```julia
+julia> SVHN2.convert2features(SVHN2.traindata()[1]) # full training data
+3072×73257 Array{N0f8,2}:
+[...]
+```
+
+To visualize an image or a prediction we provide the function
+[`convert2image`](@ref SVHN2.convert2image) to convert the
+given SVHN2 horizontal-major tensor (or feature matrix) to a
+vertical-major `Colorant` array.
+
+```julia
+julia> SVHN2.convert2image(SVHN2.traindata(1)[1]) # first training image
+32×32 Array{RGB{N0f8},2}:
+[...]
+```
+
+## API Documentation
+
+```@docs
+SVHN2
+```
+
+### Trainingset
+
+```@docs
+SVHN2.traintensor
+SVHN2.trainlabels
+SVHN2.traindata
+```
+
+### Testset
+
+```@docs
+SVHN2.testtensor
+SVHN2.testlabels
+SVHN2.testdata
+```
+
+### Extraset
+
+```@docs
+SVHN2.extratensor
+SVHN2.extralabels
+SVHN2.extradata
+```
+
+### Utilities
+
+```@docs
+SVHN2.download
+SVHN2.classnames
+SVHN2.convert2features
+SVHN2.convert2image
+```
+
+## References
+
+- **Authors**: Yuval Netzer, Tao Wang, Adam Coates, Alessandro Bissacco, Bo Wu, Andrew Y. Ng
+
+- **Website**: http://ufldl.stanford.edu/housenumbers
+
+- **[Netzer et al., 2011]** Yuval Netzer, Tao Wang, Adam Coates, Alessandro Bissacco, Bo Wu, Andrew Y. Ng. "Reading Digits in Natural Images with Unsupervised Feature Learning" NIPS Workshop on Deep Learning and Unsupervised Feature Learning 2011
diff --git a/docs/src/index.md b/docs/src/index.md
@@ -73,6 +73,9 @@ Dataset | Classes | `traintensor` | `trainlabels` | `testtensor` | `testlabels`
 [**FashionMNIST**](@ref FashionMNIST) | 10 | 28x28x60000 | 60000 | 28x28x10000 | 10000
 [**CIFAR-10**](@ref CIFAR10) | 10 | 32x32x3x50000 | 50000 | 32x32x3x10000 | 10000
 [**CIFAR-100**](@ref CIFAR100) | 100 (20) | 32x32x3x50000 | 50000 (x2) | 32x32x3x10000 | 10000 (x2)
+[**SVHN-2**](@ref SVHN2) (*) | 10 | 32x32x3x73257 | 73257 | 32x32x3x26032 | 26032
+
+(*) Note that the SVHN-2 dataset provides an additional 531131 observations aside from the training- and testset
 
 ### Language Modeling
 
diff --git a/src/MLDatasets.jl b/src/MLDatasets.jl
@@ -16,6 +16,7 @@ include("CIFAR10/CIFAR10.jl")
 include("CIFAR100/CIFAR100.jl")
 include("MNIST/MNIST.jl")
 include("FashionMNIST/FashionMNIST.jl")
+include("SVHN2/SVHN2.jl")
 include("PTBLM/PTBLM.jl")
 include("UD_English/UD_English.jl")
 
diff --git a/src/SVHN2/SVHN2.jl b/src/SVHN2/SVHN2.jl
@@ -0,0 +1,116 @@
+export SVHN2
+
+"""
+The Street View House Numbers (SVHN) Dataset
+
+- Authors: Yuval Netzer, Tao Wang, Adam Coates, Alessandro Bissacco, Bo Wu, Andrew Y. Ng
+- Website: http://ufldl.stanford.edu/housenumbers
+
+SVHN was obtained from house numbers in Google Street View
+images. As such they are quite diverse in terms of orientation
+and image background. Similar to MNIST, SVHN has 10 classes (the
+digits 0-9), but unlike MNIST there is more data and the images
+are a little bigger (32x32 instead of 28x28) with an additional
+RGB color channel. The dataset is split up into three subsets:
+73257 digits for training, 26032 digits for testing, and 531131
+additional to use as extra training data.
+
+## Interface
+
+- [`SVHN2.traintensor`](@ref), [`SVHN2.trainlabels`](@ref), [`SVHN2.traindata`](@ref)
+- [`SVHN2.testtensor`](@ref), [`SVHN2.testlabels`](@ref), [`SVHN2.testdata`](@ref)
+- [`SVHN2.extratensor`](@ref), [`SVHN2.extralabels`](@ref), [`SVHN2.extradata`](@ref)
+
+## Utilities
+
+- [`SVHN2.download`](@ref)
+- [`SVHN2.classnames`](@ref)
+- [`SVHN2.convert2features`](@ref)
+- [`SVHN2.convert2image`](@ref)
+"""
+module SVHN2
+    using DataDeps
+    using MAT
+    using ImageCore
+    using ColorTypes
+    using FixedPointNumbers
+    using ..bytes_to_type
+    using ..datafile
+    using ..download_dep
+    using ..download_docstring
+
+    export
+
+        traintensor,
+        testtensor,
+        extratensor,
+
+        trainlabels,
+        testlabels,
+        extralabels,
+
+        traindata,
+        testdata,
+        extradata,
+
+        convert2image,
+        convert2features,
+
+        download
+
+    const DEPNAME = "SVHN2"
+    const TRAINDATA = "train_32x32.mat"
+    const TESTDATA  = "test_32x32.mat"
+    const EXTRADATA = "extra_32x32.mat"
+    const CLASSES = [1, 2, 3, 4, 5, 6, 7, 8, 9, 0]
+
+    """
+        download([dir]; [i_accept_the_terms_of_use])
+
+    Trigger the (interactive) download of the full dataset into
+    "`dir`". If no `dir` is provided the dataset will be
+    downloaded into "~/.julia/datadeps/$DEPNAME".
+
+    This function will display an interactive dialog unless
+    either the keyword parameter `i_accept_the_terms_of_use` or
+    the environment variable `DATADEPS_ALWAY_ACCEPT` is set to
+    `true`. Note that using the data responsibly and respecting
+    copyright/terms-of-use remains your responsibility.
+    """
+    download(args...; kw...) = download_dep(DEPNAME, args...; kw...)
+
+    include("interface.jl")
+    include("utils.jl")
+
+    function __init__()
+        RegisterDataDep(
+            DEPNAME,
+            """
+            Dataset: The Street View House Numbers (SVHN) Dataset
+            Authors: Yuval Netzer, Tao Wang, Adam Coates, Alessandro Bissacco, Bo Wu, Andrew Y. Ng
+            Website: http://ufldl.stanford.edu/housenumbers
+            Format: Cropped Digits (Format 2 on the website)
+            Note: for non-commercial use only
+
+            [Netzer et al., 2011]
+                Yuval Netzer, Tao Wang, Adam Coates, Alessandro Bissacco, Bo Wu, Andrew Y. Ng
+                "Reading Digits in Natural Images with Unsupervised Feature Learning"
+                NIPS Workshop on Deep Learning and Unsupervised Feature Learning 2011
+
+            The dataset is split up into three subsets: 73257
+            digits for training, 26032 digits for testing, and
+            531131 additional to use as extra training data.
+
+            The files are available for download at the official
+            website linked above. Note that using the data
+            responsibly and respecting copyright remains your
+            responsibility. For example the website mentions that
+            the data is for non-commercial use only. Please read
+            the website to make sure you want to download the
+            dataset.
+            """,
+            "http://ufldl.stanford.edu/housenumbers/" .* [TRAINDATA, TESTDATA, EXTRADATA],
+            "2fa3b0b79baf39de36ed7579e6947760e6241f4c52b6b406cabc44d654c13a50"
+        )
+    end
+end
diff --git a/src/SVHN2/interface.jl b/src/SVHN2/interface.jl
diff --git a/src/SVHN2/utils.jl b/src/SVHN2/utils.jl
diff --git a/test/runtests.jl b/test/runtests.jl
diff --git a/test/tst_svhn2.jl b/test/tst_svhn2.jl

-Original file line number
+Diff line change
 DataDeps
 GZip
 BinDeps
 +MAT