Skip to content

Commit 832de90

Browse files
authored
Merge pull request #9 from JuliaML/fashion
finish FashionMNIST integration
2 parents 3bbc0ae + 94c855e commit 832de90

22 files changed

+911
-206
lines changed

.travis.yml

Lines changed: 7 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,13 @@ os:
55
- osx
66

77
julia:
8-
- 0.5
8+
- 0.6
9+
- nightly
10+
matrix:
11+
allow_failures:
12+
- julia: nightly
13+
git:
14+
depth: 5000
915

1016
notifications:
1117
email: false

README.md

Lines changed: 45 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -1,34 +1,46 @@
11
# MLDatasets.jl
2+
23
[![Build Status](https://travis-ci.org/JuliaML/MLDatasets.jl.svg?branch=master)](https://travis-ci.org/JuliaML/MLDatasets.jl)
34

4-
`MLDatasets` provides an access to common machine learning datasets for [Julia](http://julialang.org/).
5-
Currently, julia 0.5 is supported.
5+
`MLDatasets` provides an access to common machine learning
6+
datasets for [Julia](http://julialang.org/). Currently, julia 0.5
7+
is supported.
68

7-
The datasets are automatically downloaded to the specified directory.
8-
The default directory is `MLDatasets/datasets`.
9+
The datasets are automatically downloaded to the specified
10+
directory. The default directory is `MLDatasets/datasets`.
911

1012
## Installation
13+
1114
```julia
1215
julia> Pkg.clone("https://github.com/JuliaML/MLDatasets.jl.git")
1316
```
1417

1518
## Basic Usage
19+
1620
```julia
1721
using MLDatasets
1822

1923
train_x, train_y = MNIST.traindata()
2024
test_x, test_y = MNIST.testdata()
2125
```
26+
2227
Use `traindata(<directory>)` and `testdata(<directory>)` to change the default directory.
2328

2429
## Available Datasets
30+
2531
### Image Classification
32+
2633
#### CIFAR-10
27-
The [CIFAR-10](https://www.cs.toronto.edu/~kriz/cifar.html) dataset consists of 60000 32x32 color images in 10 classes.
34+
35+
The [CIFAR-10](https://www.cs.toronto.edu/~kriz/cifar.html)
36+
dataset consists of 60000 32x32 color images in 10 classes.
2837

2938
#### CIFAR-100
30-
The [CIFAR-100](https://www.cs.toronto.edu/~kriz/cifar.html) dataset consists of 600 32x32 color images in 100 classes.
31-
The 100 classes are grouped into 20 superclasses (fine and coarse labels).
39+
40+
The [CIFAR-100](https://www.cs.toronto.edu/~kriz/cifar.html)
41+
dataset consists of 600 32x32 color images in 100 classes. The
42+
100 classes are grouped into 20 superclasses (fine and coarse
43+
labels).
3244

3345
#### MNIST
3446

@@ -38,12 +50,27 @@ of 60000 28x28 images of handwritten digits.
3850
Take a look at the [sub-module](src/MNIST/README.md) for more
3951
information
4052

53+
#### Fashion-MNIST
54+
55+
The [Fashion-MNIST](https://github.com/zalandoresearch/fashion-mnist)
56+
dataset consists of 60000 28x28 images of fashion products. It
57+
was designed to be a drop-in replacement for the MNIST dataset
58+
59+
Take a look at the [sub-module](src/FashionMNIST/README.md) for more
60+
information
61+
4162
### Language Modeling
63+
4264
#### PTBLM
43-
The `PTBLM` dataset consists of Penn Treebank sentences for language modeling, available from [tomsercu/lstm](https://github.com/tomsercu/lstm).
44-
The unknown words are replaced with `<unk>` so that the total vocaburary size becomes 10000.
65+
66+
The `PTBLM` dataset consists of Penn Treebank sentences for
67+
language modeling, available from
68+
[tomsercu/lstm](https://github.com/tomsercu/lstm). The unknown
69+
words are replaced with `<unk>` so that the total vocaburary size
70+
becomes 10000.
4571

4672
This is the first sentence of the PTBLM dataset.
73+
4774
```julia
4875
x, y = PTBLM.traindata()
4976

@@ -52,11 +79,18 @@ x[1]
5279
y[1]
5380
> ["it", "was", "n't", "black", "monday", "<eos>"]
5481
```
82+
5583
where `MLDataset` adds the special word: `<eos>` to the end of `y`.
5684

5785
### Text Analysis (POS-Tagging, Parsing)
86+
5887
#### UD English
59-
The [UD_English](https://github.com/UniversalDependencies/UD_English) dataset is an annotated corpus of morphological features, POS-tags and syntactic trees. The dataset follows CoNLL-style format.
88+
89+
The [UD_English](https://github.com/UniversalDependencies/UD_English)
90+
dataset is an annotated corpus of morphological features,
91+
POS-tags and syntactic trees. The dataset follows CoNLL-style
92+
format.
93+
6094
```julia
6195
traindata = UD_English.traindata()
6296
devdata = UD_English.devdata()
@@ -69,5 +103,6 @@ testdata = UD_English.devdata()
69103
| **CIFAR-10** | image | 32x32x3x50000 | 50000 | 32x32x3x10000 | 10000 |
70104
| **CIFAR-100** | image | 32x32x3x500 | 2x500 | 32x32x3x100 | 2x100 |
71105
| **MNIST** | image | 28x28x60000 | 60000 | 28x28x10000 | 10000 |
106+
| **FashionMNIST** | image | 28x28x60000 | 60000 | 28x28x10000 | 10000 |
72107
| **PTBLM** | text | 42068 | 42068 | 3761 | 3761 |
73108
| **UD_English** | text | 12543 | - | 2077 | - |

REQUIRE

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
julia 0.5
1+
julia 0.6
22
ImageCore 0.1.2
33
ColorTypes 0.4
44
GZip

src/CIFAR10.jl

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -3,7 +3,7 @@ module CIFAR10
33

44
using BinDeps
55

6-
const defdir = joinpath(Pkg.dir("MLDatasets"), "datasets/cifar10")
6+
const defdir = joinpath(Pkg.dir("MLDatasets"), "datasets", "cifar10")
77

88
function getdata(dir)
99
mkpath(dir)
@@ -25,7 +25,7 @@ function readdata(data::Vector{UInt8})
2525
end
2626

2727
function traindata(dir=defdir)
28-
files = ["$(dir)/cifar-10-batches-bin/data_batch_$(i).bin" for i=1:5]
28+
files = [joinpath(dir,"cifar-10-batches-bin","data_batch_$i.bin") for i=1:5]
2929
all(isfile, files) || getdata(dir)
3030
data = UInt8[]
3131
for file in files
@@ -35,7 +35,7 @@ function traindata(dir=defdir)
3535
end
3636

3737
function testdata(dir=defdir)
38-
file = "$(dir)/cifar-10-batches-bin/test_batch.bin"
38+
file = joinpath(dir,"cifar-10-batches-bin","test_batch.bin")
3939
isfile(file) || getdata(dir)
4040
readdata(open(read,file))
4141
end

src/CIFAR100.jl

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -3,7 +3,7 @@ module CIFAR100
33

44
using BinDeps
55

6-
const defdir = joinpath(Pkg.dir("MLDatasets"), "datasets/cifar100")
6+
const defdir = joinpath(Pkg.dir("MLDatasets"), "datasets","cifar100")
77

88
function getdata(dir)
99
mkpath(dir)
@@ -25,13 +25,13 @@ function readdata(data::Vector{UInt8})
2525
end
2626

2727
function traindata(dir=defdir)
28-
file = joinpath(dir, "cifar-100-binary/train.bin")
28+
file = joinpath(dir, "cifar-100-binary","train.bin")
2929
isfile(file) || getdata(dir)
3030
readdata(open(read,file))
3131
end
3232

3333
function testdata(dir=defdir)
34-
file = joinpath(dir, "cifar-100-binary/test.bin")
34+
file = joinpath(dir, "cifar-100-binary","test.bin")
3535
isfile(file) || getdata(dir)
3636
readdata(open(read,file))
3737
end

src/FashionMNIST/FashionMNIST.jl

Lines changed: 74 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,74 @@
1+
export FashionMNIST
2+
module FashionMNIST
3+
using ImageCore
4+
using ColorTypes
5+
import ..downloaded_file
6+
import ..download_helper
7+
import ..DownloadSettings
8+
import ..MNIST.convert2image
9+
import ..MNIST.convert2features
10+
import ..MNIST.Reader
11+
12+
export
13+
14+
traintensor,
15+
testtensor,
16+
17+
trainlabels,
18+
testlabels,
19+
20+
traindata,
21+
testdata,
22+
23+
convert2image,
24+
convert2features,
25+
26+
download
27+
28+
const DEFAULT_DIR = abspath(joinpath(@__DIR__, "..", "..", "datasets", "fashion_mnist"))
29+
30+
const TRAINIMAGES = "train-images-idx3-ubyte.gz"
31+
const TRAINLABELS = "train-labels-idx1-ubyte.gz"
32+
const TESTIMAGES = "t10k-images-idx3-ubyte.gz"
33+
const TESTLABELS = "t10k-labels-idx1-ubyte.gz"
34+
35+
const CLASSES = [
36+
"T-Shirt",
37+
"Trouser",
38+
"Pullover",
39+
"Dress",
40+
"Coat",
41+
"Sandal",
42+
"Shirt",
43+
"Sneaker",
44+
"Bag",
45+
"Ankle boot"
46+
]
47+
48+
const SETTINGS = DownloadSettings(
49+
"http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/",
50+
"""
51+
Dataset: Fashion-MNIST
52+
Authors: Han Xiao, Kashif Rasul, Roland Vollgraf
53+
Website: https://github.com/zalandoresearch/fashion-mnist
54+
License: MIT
55+
56+
[Han Xiao et al. 2017]
57+
Han Xiao, Kashif Rasul, and Roland Vollgraf.
58+
"Fashion-MNIST: a Novel Image Dataset for Benchmarking Machine Learning Algorithms."
59+
arXiv:1708.07747
60+
61+
The files are available for download at the offical
62+
website linked above. We can download these files for you
63+
if you wish, but that doesn't free you from the burden of
64+
using the data responsibly and respect lincense and
65+
authorship.
66+
""",
67+
[TRAINIMAGES, TRAINLABELS, TESTIMAGES, TESTLABELS]
68+
)
69+
70+
download(dir = DEFAULT_DIR; kw...) =
71+
download_helper(SETTINGS, dir; kw...)
72+
73+
include("interface.jl")
74+
end

src/FashionMNIST/README.md

Lines changed: 82 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,82 @@
1+
# Fashion-MNIST
2+
3+
Description from the [official website](https://github.com/zalandoresearch/fashion-mnist)
4+
5+
> Fashion-MNIST is a dataset of Zalando's article
6+
> images—consisting of a training set of 60,000 examples and a
7+
> test set of 10,000 examples. Each example is a 28x28 grayscale
8+
> image, associated with a label from 10 classes. We intend
9+
> Fashion-MNIST to serve as a direct drop-in replacement for the
10+
> original MNIST dataset for benchmarking machine learning
11+
> algorithms. It shares the same image size and structure of
12+
> training and testing splits.
13+
14+
## Usage
15+
16+
This sub-module provides a programmatic interface to download,
17+
load, and work with the MNIST dataset of handwritten digits.
18+
19+
```julia
20+
using MLDatasets
21+
22+
# download dataset
23+
FashionMNIST.download()
24+
25+
# load full training set
26+
train_x, train_y = FashionMNIST.traindata()
27+
28+
# load full test set
29+
test_x, test_y = FashionMNIST.testdata()
30+
```
31+
32+
The provided functions also allow for optional arguments, such as
33+
the directory `dir` where the dataset is located, or the specific
34+
observation `indices` that one wants to work with. For more
35+
information on the interface take a look at the documentation
36+
(e.g. `?FashionMNIST.traindata`).
37+
38+
Function | Description
39+
---------|-------------
40+
`download([dir])` | Trigger interactive download of the dataset
41+
`traintensor([indices]; [dir], [decimal=true])` | Load the training images as an array
42+
`trainlabels([indices]; [dir])` | Load the labels for the training images
43+
`testtensor([indices]; [dir], [decimal=true])` | Load the test images as an array
44+
`testlabels([indices]; [dir])` | Load the labels for the test images
45+
`traindata([indices]; [dir], [decimal=true])` | Load images and labels of the training data
46+
`testdata([indices]; [dir], [decimal=true])` | Load images and labels of the test data
47+
48+
This module also provides utility functions to make working with
49+
the FashionMNIST dataset in Julia more convenient.
50+
51+
You can use the function `convert2features` to convert the given
52+
FashionMNIST tensor to a feature matrix (or feature vector in the case
53+
of a single image). The purpose of this function is to drop the
54+
spatial dimensions such that traditional ML algorithms can
55+
process the dataset.
56+
57+
```julia
58+
julia> FashionMNIST.convert2features(FashionMNIST.traintensor()) # full training data
59+
784×60000 Array{Float64,2}:
60+
[...]
61+
```
62+
63+
To visualize an image or a prediction we provide the function
64+
`convert2image` to convert the given FashionMNIST horizontal-major
65+
tensor (or feature matrix) to a vertical-major `Colorant` array.
66+
The values are also color corrected according to the website's
67+
description, which means that the digits are black on a white
68+
background.
69+
70+
```julia
71+
julia> FashionMNIST.convert2image(FashionMNIST.traintensor(1)) # first training image
72+
28×28 Array{Gray{Float64},2}:
73+
[...]
74+
```
75+
76+
## References
77+
78+
- **Authors**: Han Xiao, Kashif Rasul, Roland Vollgraf
79+
80+
- **Website**: https://github.com/zalandoresearch/fashion-mnist
81+
82+
- **[Han Xiao et al. 2017]** Han Xiao, Kashif Rasul, and Roland Vollgraf. "Fashion-MNIST: a Novel Image Dataset for Benchmarking Machine Learning Algorithms." arXiv:1708.07747

0 commit comments

Comments
 (0)