Skip to content

Commit e16cf29

Browse files
committed
finish FashionMNIST integration
1 parent 87e1984 commit e16cf29

File tree

16 files changed

+366
-252
lines changed

16 files changed

+366
-252
lines changed

README.md

Lines changed: 45 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -1,34 +1,46 @@
11
# MLDatasets.jl
2+
23
[![Build Status](https://travis-ci.org/JuliaML/MLDatasets.jl.svg?branch=master)](https://travis-ci.org/JuliaML/MLDatasets.jl)
34

4-
`MLDatasets` provides an access to common machine learning datasets for [Julia](http://julialang.org/).
5-
Currently, julia 0.5 is supported.
5+
`MLDatasets` provides an access to common machine learning
6+
datasets for [Julia](http://julialang.org/). Currently, julia 0.5
7+
is supported.
68

7-
The datasets are automatically downloaded to the specified directory.
8-
The default directory is `MLDatasets/datasets`.
9+
The datasets are automatically downloaded to the specified
10+
directory. The default directory is `MLDatasets/datasets`.
911

1012
## Installation
13+
1114
```julia
1215
julia> Pkg.clone("https://github.com/JuliaML/MLDatasets.jl.git")
1316
```
1417

1518
## Basic Usage
19+
1620
```julia
1721
using MLDatasets
1822

1923
train_x, train_y = MNIST.traindata()
2024
test_x, test_y = MNIST.testdata()
2125
```
26+
2227
Use `traindata(<directory>)` and `testdata(<directory>)` to change the default directory.
2328

2429
## Available Datasets
30+
2531
### Image Classification
32+
2633
#### CIFAR-10
27-
The [CIFAR-10](https://www.cs.toronto.edu/~kriz/cifar.html) dataset consists of 60000 32x32 color images in 10 classes.
34+
35+
The [CIFAR-10](https://www.cs.toronto.edu/~kriz/cifar.html)
36+
dataset consists of 60000 32x32 color images in 10 classes.
2837

2938
#### CIFAR-100
30-
The [CIFAR-100](https://www.cs.toronto.edu/~kriz/cifar.html) dataset consists of 600 32x32 color images in 100 classes.
31-
The 100 classes are grouped into 20 superclasses (fine and coarse labels).
39+
40+
The [CIFAR-100](https://www.cs.toronto.edu/~kriz/cifar.html)
41+
dataset consists of 600 32x32 color images in 100 classes. The
42+
100 classes are grouped into 20 superclasses (fine and coarse
43+
labels).
3244

3345
#### MNIST
3446

@@ -38,12 +50,27 @@ of 60000 28x28 images of handwritten digits.
3850
Take a look at the [sub-module](src/MNIST/README.md) for more
3951
information
4052

53+
#### Fashion-MNIST
54+
55+
The [Fashion-MNIST](https://github.com/zalandoresearch/fashion-mnist)
56+
dataset consists of 60000 28x28 images of fashion products. It
57+
was designed to be a drop-in replacement for the MNIST dataset
58+
59+
Take a look at the [sub-module](src/FashionMNIST/README.md) for more
60+
information
61+
4162
### Language Modeling
63+
4264
#### PTBLM
43-
The `PTBLM` dataset consists of Penn Treebank sentences for language modeling, available from [tomsercu/lstm](https://github.com/tomsercu/lstm).
44-
The unknown words are replaced with `<unk>` so that the total vocaburary size becomes 10000.
65+
66+
The `PTBLM` dataset consists of Penn Treebank sentences for
67+
language modeling, available from
68+
[tomsercu/lstm](https://github.com/tomsercu/lstm). The unknown
69+
words are replaced with `<unk>` so that the total vocaburary size
70+
becomes 10000.
4571

4672
This is the first sentence of the PTBLM dataset.
73+
4774
```julia
4875
x, y = PTBLM.traindata()
4976

@@ -52,11 +79,18 @@ x[1]
5279
y[1]
5380
> ["it", "was", "n't", "black", "monday", "<eos>"]
5481
```
82+
5583
where `MLDataset` adds the special word: `<eos>` to the end of `y`.
5684

5785
### Text Analysis (POS-Tagging, Parsing)
86+
5887
#### UD English
59-
The [UD_English](https://github.com/UniversalDependencies/UD_English) dataset is an annotated corpus of morphological features, POS-tags and syntactic trees. The dataset follows CoNLL-style format.
88+
89+
The [UD_English](https://github.com/UniversalDependencies/UD_English)
90+
dataset is an annotated corpus of morphological features,
91+
POS-tags and syntactic trees. The dataset follows CoNLL-style
92+
format.
93+
6094
```julia
6195
traindata = UD_English.traindata()
6296
devdata = UD_English.devdata()
@@ -69,5 +103,6 @@ testdata = UD_English.devdata()
69103
| **CIFAR-10** | image | 32x32x3x50000 | 50000 | 32x32x3x10000 | 10000 |
70104
| **CIFAR-100** | image | 32x32x3x500 | 2x500 | 32x32x3x100 | 2x100 |
71105
| **MNIST** | image | 28x28x60000 | 60000 | 28x28x10000 | 10000 |
106+
| **FashionMNIST** | image | 28x28x60000 | 60000 | 28x28x10000 | 10000 |
72107
| **PTBLM** | text | 42068 | 42068 | 3761 | 3761 |
73108
| **UD_English** | text | 12543 | - | 2077 | - |

src/FashionMNIST/FashionMNIST.jl

Lines changed: 51 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -2,6 +2,12 @@ export FashionMNIST
22
module FashionMNIST
33
using ImageCore
44
using ColorTypes
5+
import ..downloaded_file
6+
import ..download_helper
7+
import ..DownloadSettings
8+
import ..MNIST.convert2image
9+
import ..MNIST.convert2features
10+
import ..MNIST.Reader
511

612
export
713

@@ -17,11 +23,52 @@ module FashionMNIST
1723
convert2image,
1824
convert2features,
1925

20-
download_helper
26+
download
2127

22-
const DEFAULT_DIR = abspath(joinpath(dirname(@__FILE__), "..", "..", "datasets", "fashion_mnist"))
28+
const DEFAULT_DIR = abspath(joinpath(@__DIR__, "..", "..", "datasets", "fashion_mnist"))
29+
30+
const TRAINIMAGES = "train-images-idx3-ubyte.gz"
31+
const TRAINLABELS = "train-labels-idx1-ubyte.gz"
32+
const TESTIMAGES = "t10k-images-idx3-ubyte.gz"
33+
const TESTLABELS = "t10k-labels-idx1-ubyte.gz"
34+
35+
const CLASSES = [
36+
"T-Shirt",
37+
"Trouser",
38+
"Pullover",
39+
"Dress",
40+
"Coat",
41+
"Sandal",
42+
"Shirt",
43+
"Sneaker",
44+
"Bag",
45+
"Ankle boot"
46+
]
47+
48+
const SETTINGS = DownloadSettings(
49+
"http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/",
50+
"""
51+
Dataset: Fashion-MNIST
52+
Authors: Han Xiao, Kashif Rasul, Roland Vollgraf
53+
Website: https://github.com/zalandoresearch/fashion-mnist
54+
License: MIT
55+
56+
[Han Xiao et al. 2017]
57+
Han Xiao, Kashif Rasul, and Roland Vollgraf.
58+
"Fashion-MNIST: a Novel Image Dataset for Benchmarking Machine Learning Algorithms."
59+
arXiv:1708.07747
60+
61+
The files are available for download at the offical
62+
website linked above. We can download these files for you
63+
if you wish, but that doesn't free you from the burden of
64+
using the data responsibly and respect lincense and
65+
authorship.
66+
""",
67+
[TRAINIMAGES, TRAINLABELS, TESTIMAGES, TESTLABELS]
68+
)
69+
70+
download(dir = DEFAULT_DIR; kw...) =
71+
download_helper(SETTINGS, dir; kw...)
2372

24-
include("reader.jl")
2573
include("interface.jl")
26-
include(joinpath("..", "MNIST", "utils.jl"))
2774
end

src/FashionMNIST/README.md

Lines changed: 82 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,82 @@
1+
# Fashion-MNIST
2+
3+
Description from the [official website](https://github.com/zalandoresearch/fashion-mnist)
4+
5+
> Fashion-MNIST is a dataset of Zalando's article
6+
> images—consisting of a training set of 60,000 examples and a
7+
> test set of 10,000 examples. Each example is a 28x28 grayscale
8+
> image, associated with a label from 10 classes. We intend
9+
> Fashion-MNIST to serve as a direct drop-in replacement for the
10+
> original MNIST dataset for benchmarking machine learning
11+
> algorithms. It shares the same image size and structure of
12+
> training and testing splits.
13+
14+
## Usage
15+
16+
This sub-module provides a programmatic interface to download,
17+
load, and work with the MNIST dataset of handwritten digits.
18+
19+
```julia
20+
using MLDatasets
21+
22+
# download dataset
23+
FashionMNIST.download()
24+
25+
# load full training set
26+
train_x, train_y = FashionMNIST.traindata()
27+
28+
# load full test set
29+
test_x, test_y = FashionMNIST.testdata()
30+
```
31+
32+
The provided functions also allow for optional arguments, such as
33+
the directory `dir` where the dataset is located, or the specific
34+
observation `indices` that one wants to work with. For more
35+
information on the interface take a look at the documentation
36+
(e.g. `?FashionMNIST.traindata`).
37+
38+
Function | Description
39+
---------|-------------
40+
`download([dir])` | Trigger interactive download of the dataset
41+
`traintensor([indices]; [dir], [decimal=true])` | Load the training images as an array
42+
`trainlabels([indices]; [dir])` | Load the labels for the training images
43+
`testtensor([indices]; [dir], [decimal=true])` | Load the test images as an array
44+
`testlabels([indices]; [dir])` | Load the labels for the test images
45+
`traindata([indices]; [dir], [decimal=true])` | Load images and labels of the training data
46+
`testdata([indices]; [dir], [decimal=true])` | Load images and labels of the test data
47+
48+
This module also provides utility functions to make working with
49+
the FashionMNIST dataset in Julia more convenient.
50+
51+
You can use the function `convert2features` to convert the given
52+
FashionMNIST tensor to a feature matrix (or feature vector in the case
53+
of a single image). The purpose of this function is to drop the
54+
spatial dimensions such that traditional ML algorithms can
55+
process the dataset.
56+
57+
```julia
58+
julia> FashionMNIST.convert2features(FashionMNIST.traintensor()) # full training data
59+
784×60000 Array{Float64,2}:
60+
[...]
61+
```
62+
63+
To visualize an image or a prediction we provide the function
64+
`convert2image` to convert the given FashionMNIST horizontal-major
65+
tensor (or feature matrix) to a vertical-major `Colorant` array.
66+
The values are also color corrected according to the website's
67+
description, which means that the digits are black on a white
68+
background.
69+
70+
```julia
71+
julia> FashionMNIST.convert2image(FashionMNIST.traintensor(1)) # first training image
72+
28×28 Array{Gray{Float64},2}:
73+
[...]
74+
```
75+
76+
## References
77+
78+
- **Authors**: Han Xiao, Kashif Rasul, Roland Vollgraf
79+
80+
- **Website**: https://github.com/zalandoresearch/fashion-mnist
81+
82+
- **[Han Xiao et al. 2017]** Han Xiao, Kashif Rasul, and Roland Vollgraf. "Fashion-MNIST: a Novel Image Dataset for Benchmarking Machine Learning Algorithms." arXiv:1708.07747

src/FashionMNIST/interface.jl

Lines changed: 22 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -60,11 +60,8 @@ julia> FashionMNIST.convert2image(FashionMNIST.traintensor(1)) # convert to colu
6060
```
6161
"""
6262
function traintensor(args...; dir=DEFAULT_DIR, decimal=true)
63-
if decimal
64-
Reader.readtrainimages(dir, args...) ./ 255
65-
else
66-
convert(Array{Float64}, Reader.readtrainimages(dir, args...))
67-
end
63+
rawimages = Reader.readimages(downloaded_file(SETTINGS, dir, TRAINIMAGES), args...)
64+
decimal ? rawimages ./ 255 : convert(Array{Float64}, rawimages)
6865
end
6966

7067
"""
@@ -129,11 +126,8 @@ julia> FashionMNIST.convert2image(FashionMNIST.testtensor(1)) # convert to colum
129126
```
130127
"""
131128
function testtensor(args...; dir=DEFAULT_DIR, decimal=true)
132-
if decimal
133-
Reader.readtestimages(dir, args...) ./ 255
134-
else
135-
convert(Array{Float64}, Reader.readtestimages(dir, args...))
136-
end
129+
rawimages = Reader.readimages(downloaded_file(SETTINGS, dir, TESTIMAGES), args...)
130+
decimal ? rawimages ./ 255 : convert(Array{Float64}, rawimages)
137131
end
138132

139133
"""
@@ -174,8 +168,15 @@ julia> FashionMNIST.trainlabels(dir="/home/user/fashion_mnist")
174168
WARNING: The FashionMNIST file "train-labels-idx1-ubyte.gz" was not found in "/home/user/fashion_mnist". You can download [...]
175169
```
176170
"""
177-
trainlabels(args...; dir=DEFAULT_DIR) = Vector{Int}(Reader.readtrainlabels(dir, args...))
178-
trainlabels(index::Integer; dir=DEFAULT_DIR) = Int(Reader.readtrainlabels(dir, index))
171+
function trainlabels(args...; dir=DEFAULT_DIR)
172+
path = downloaded_file(SETTINGS, dir, TRAINLABELS)
173+
Vector{Int}(Reader.readlabels(path, args...))
174+
end
175+
176+
function trainlabels(index::Integer; dir=DEFAULT_DIR)
177+
path = downloaded_file(SETTINGS, dir, TRAINLABELS)
178+
Int(Reader.readlabels(path, index))
179+
end
179180

180181
"""
181182
testlabels([indices]; [dir])
@@ -215,8 +216,15 @@ julia> FashionMNIST.testlabels(dir="/home/user/fashion_mnist")
215216
WARNING: The FashionMNIST file "t10k-labels-idx1-ubyte.gz" was not found in "/home/user/fashion_mnist". You can download [...]
216217
```
217218
"""
218-
testlabels(args...; dir=DEFAULT_DIR) = Vector{Int}(Reader.readtestlabels(dir, args...))
219-
testlabels(index::Integer; dir=DEFAULT_DIR) = Int(Reader.readtestlabels(dir, index))
219+
function testlabels(args...; dir=DEFAULT_DIR)
220+
path = downloaded_file(SETTINGS, dir, TESTLABELS)
221+
Vector{Int}(Reader.readlabels(path, args...))
222+
end
223+
224+
function testlabels(index::Integer; dir=DEFAULT_DIR)
225+
path = downloaded_file(SETTINGS, dir, TESTLABELS)
226+
Int(Reader.readlabels(path, index))
227+
end
220228

221229
"""
222230
traindata([indices]; [dir], [decimal=true]) -> Tuple

src/FashionMNIST/reader.jl

Lines changed: 0 additions & 13 deletions
This file was deleted.

src/MNIST/MNIST.jl

Lines changed: 37 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -2,6 +2,9 @@ export MNIST
22
module MNIST
33
using ImageCore
44
using ColorTypes
5+
import ..downloaded_file
6+
import ..download_helper
7+
import ..DownloadSettings
58

69
export
710

@@ -17,9 +20,40 @@ module MNIST
1720
convert2image,
1821
convert2features,
1922

20-
download_helper
21-
22-
const DEFAULT_DIR = abspath(joinpath(dirname(@__FILE__), "..", "..", "datasets", "mnist"))
23+
download
24+
25+
const DEFAULT_DIR = abspath(joinpath(@__DIR__, "..", "..", "datasets", "mnist"))
26+
27+
const TRAINIMAGES = "train-images-idx3-ubyte.gz"
28+
const TRAINLABELS = "train-labels-idx1-ubyte.gz"
29+
const TESTIMAGES = "t10k-images-idx3-ubyte.gz"
30+
const TESTLABELS = "t10k-labels-idx1-ubyte.gz"
31+
32+
const SETTINGS = DownloadSettings(
33+
"http://yann.lecun.com/exdb/mnist/",
34+
"""
35+
Dataset: THE MNIST DATABASE of handwritten digits
36+
Authors: Yann LeCun, Corinna Cortes, Christopher J.C. Burges
37+
Website: http://yann.lecun.com/exdb/mnist/
38+
39+
[LeCun et al., 1998a]
40+
Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner.
41+
"Gradient-based learning applied to document recognition."
42+
Proceedings of the IEEE, 86(11):2278-2324, November 1998
43+
44+
The files are available for download at the offical
45+
website linked above. We can download these files for you
46+
if you wish, but that doesn't free you from the burden of
47+
using the data responsibly and respect copyright. The
48+
authors of MNIST aren't really explicit about any terms
49+
of use, so please read the website to make sure you want
50+
to download the dataset.
51+
""",
52+
[TRAINIMAGES, TRAINLABELS, TESTIMAGES, TESTLABELS]
53+
)
54+
55+
download(dir = DEFAULT_DIR; kw...) =
56+
download_helper(SETTINGS, dir; kw...)
2357

2458
include(joinpath("Reader","Reader.jl"))
2559
include("interface.jl")

0 commit comments

Comments
 (0)