Skip to content

Commit 5106caf

Browse files
authored
Merge pull request #4 from JuliaML/cs/mnist
Rewrite of MNIST submodule
2 parents c184466 + a568d66 commit 5106caf

File tree

16 files changed

+1127
-56
lines changed

16 files changed

+1127
-56
lines changed

.gitignore

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,2 +1,2 @@
1-
/datasets
1+
datasets/
22
sandbox.jl

README.md

Lines changed: 10 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -1,10 +1,10 @@
11
# MLDatasets.jl
22
[![Build Status](https://travis-ci.org/JuliaML/MLDatasets.jl.svg?branch=master)](https://travis-ci.org/JuliaML/MLDatasets.jl)
33

4-
`MLDatasets` provides an access to common machine learning datasets for [Julia](http://julialang.org/).
4+
`MLDatasets` provides an access to common machine learning datasets for [Julia](http://julialang.org/).
55
Currently, julia 0.5 is supported.
66

7-
The datasets are automatically downloaded to the specified directory.
7+
The datasets are automatically downloaded to the specified directory.
88
The default directory is `MLDatasets/datasets`.
99

1010
## Installation
@@ -27,15 +27,20 @@ Use `traindata(<directory>)` and `testdata(<directory>)` to change the default d
2727
The [CIFAR-10](https://www.cs.toronto.edu/~kriz/cifar.html) dataset consists of 60000 32x32 color images in 10 classes.
2828

2929
#### CIFAR-100
30-
The [CIFAR-100](https://www.cs.toronto.edu/~kriz/cifar.html) dataset consists of 600 32x32 color images in 100 classes.
30+
The [CIFAR-100](https://www.cs.toronto.edu/~kriz/cifar.html) dataset consists of 600 32x32 color images in 100 classes.
3131
The 100 classes are grouped into 20 superclasses (fine and coarse labels).
3232

3333
#### MNIST
34-
The [MNIST](http://yann.lecun.com/exdb/mnist/) dataset consists of 60000 28x28 images of handwritten digits.
34+
35+
The [MNIST](http://yann.lecun.com/exdb/mnist/) dataset consists
36+
of 60000 28x28 images of handwritten digits.
37+
38+
Take a look at the [sub-module](src/MNIST/README.md) for more
39+
information
3540

3641
### Language Modeling
3742
#### PTBLM
38-
The `PTBLM` dataset consists of Penn Treebank sentences for language modeling, available from [tomsercu/lstm](https://github.com/tomsercu/lstm).
43+
The `PTBLM` dataset consists of Penn Treebank sentences for language modeling, available from [tomsercu/lstm](https://github.com/tomsercu/lstm).
3944
The unknown words are replaced with `<unk>` so that the total vocaburary size becomes 10000.
4045

4146
This is the first sentence of the PTBLM dataset.

REQUIRE

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,5 @@
1-
julia 0.5-
1+
julia 0.5
2+
ImageCore 0.1.2
3+
ColorTypes 0.4
24
GZip
35
BinDeps

src/MLDatasets.jl

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@ include("io/CoNLL.jl")
44

55
include("CIFAR10.jl")
66
include("CIFAR100.jl")
7-
include("MNIST.jl")
7+
include("MNIST/MNIST.jl")
88
include("PTBLM.jl")
99
include("UD_English.jl")
1010

src/MNIST.jl

Lines changed: 0 additions & 39 deletions
This file was deleted.

src/MNIST/MNIST.jl

Lines changed: 30 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,30 @@
1+
export MNIST
2+
module MNIST
3+
using ImageCore
4+
using ColorTypes
5+
6+
export
7+
8+
traintensor,
9+
testtensor,
10+
11+
trainlabels,
12+
testlabels,
13+
14+
traindata,
15+
testdata,
16+
17+
convert2image,
18+
convert2features,
19+
20+
download_helper
21+
22+
const DEFAULT_DIR = abspath(joinpath(dirname(@__FILE__), "..", "..", "datasets", "mnist"))
23+
24+
include("Reader/Reader.jl")
25+
import .Reader.download_helper
26+
include("interface.jl")
27+
include("utils.jl")
28+
29+
Reader.download_helper(; nargs...) = Reader.download_helper(DEFAULT_DIR; nargs...)
30+
end

src/MNIST/README.md

Lines changed: 83 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,83 @@
1+
# THE MNIST DATABASE of handwritten digits
2+
3+
Description from the [official website](http://yann.lecun.com/exdb/mnist/):
4+
5+
> The MNIST database of handwritten digits, available from this
6+
> page, has a training set of 60,000 examples, and a test set of
7+
> 10,000 examples. It is a subset of a larger set available from
8+
> NIST. The digits have been size-normalized and centered in a
9+
> fixed-size image.
10+
>
11+
> It is a good database for people who want to try learning
12+
> techniques and pattern recognition methods on real-world data
13+
> while spending minimal efforts on preprocessing and formatting.
14+
15+
## Usage
16+
17+
This sub-module provides a programmatic interface to download,
18+
load, and work with the MNIST dataset of handwritten digits.
19+
20+
```julia
21+
using MLDatasets
22+
23+
# download dataset
24+
MNIST.download_helper()
25+
26+
# load full training set
27+
train_x, train_y = MNIST.traindata()
28+
29+
# load full test set
30+
test_x, test_y = MNIST.testdata()
31+
```
32+
33+
The provided functions also allow for optional arguments, such as
34+
the directory `dir` where the dataset is located, or the specific
35+
observation `indices` that one wants to work with. For more
36+
information on the interface take a look at the documentation
37+
(e.g. `?MNIST.traindata`).
38+
39+
Function | Description
40+
---------|-------------
41+
`download_helper([dir])` | Trigger interactive download of the dataset
42+
`traintensor([indices]; [dir], [decimal=true])` | Load the training images as an array
43+
`trainlabels([indices]; [dir])` | Load the labels for the training images
44+
`testtensor([indices]; [dir], [decimal=true])` | Load the test images as an array
45+
`testlabels([indices]; [dir])` | Load the labels for the test images
46+
`traindata([indices]; [dir], [decimal=true])` | Load images and labels of the training data
47+
`testdata([indices]; [dir], [decimal=true])` | Load images and labels of the test data
48+
49+
This module also provides utility functions to make working with
50+
the MNIST dataset in Julia more convenient.
51+
52+
You can use the function `convert2features` to convert the given
53+
MNIST tensor to a feature matrix (or feature vector in the case
54+
of a single image). The purpose of this function is to drop the
55+
spatial dimensions such that traditional ML algorithms can
56+
process the dataset.
57+
58+
```julia
59+
julia> MNIST.convert2features(MNIST.traintensor()) # full training data
60+
784×60000 Array{Float64,2}:
61+
[...]
62+
```
63+
64+
To visualize an image or a prediction we provide the function
65+
`convert2image` to convert the given MNIST horizontal-major
66+
tensor (or feature matrix) to a vertical-major `Colorant` array.
67+
The values are also color corrected according to the website's
68+
description, which means that the digits are black on a white
69+
background.
70+
71+
```julia
72+
julia> MNIST.convert2image(MNIST.traintensor(1)) # first training image
73+
28×28 Array{Gray{Float64},2}:
74+
[...]
75+
```
76+
77+
## References
78+
79+
- **Authors**: Yann LeCun, Corinna Cortes, Christopher J.C. Burges
80+
81+
- **Website**: http://yann.lecun.com/exdb/mnist/
82+
83+
- **[LeCun et al., 1998a]** Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. "Gradient-based learning applied to document recognition." Proceedings of the IEEE, 86(11):2278-2324, November 1998

src/MNIST/Reader/Reader.jl

Lines changed: 30 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,30 @@
1+
module Reader
2+
using GZip
3+
using BinDeps
4+
5+
export
6+
7+
readtrainimages,
8+
readtestimages,
9+
readtrainimages,
10+
readtestlabels,
11+
12+
download_helper
13+
14+
# Constants
15+
16+
const IMAGEOFFSET = 16
17+
const LABELOFFSET = 8
18+
19+
const TRAINIMAGES = "train-images-idx3-ubyte.gz"
20+
const TRAINLABELS = "train-labels-idx1-ubyte.gz"
21+
const TESTIMAGES = "t10k-images-idx3-ubyte.gz"
22+
const TESTLABELS = "t10k-labels-idx1-ubyte.gz"
23+
24+
# Includes
25+
26+
include("readheader.jl")
27+
include("readimages.jl")
28+
include("readlabels.jl")
29+
include("download.jl")
30+
end

src/MNIST/Reader/download.jl

Lines changed: 82 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,82 @@
1+
msg_notfound(dir, filename) = "The MNIST file \"$filename\" was not found in \"$dir\". You can download the dataset at http://yann.lecun.com/exdb/mnist/, or alternatively use MNIST.download_helper(directory) to do it for you."
2+
3+
msg_prompt(dir, files) = """
4+
Interactive session detected. MNIST.download_helper initiated.
5+
6+
Dataset: THE MNIST DATABASE of handwritten digits
7+
Authors: Yann LeCun, Corinna Cortes, Christopher J.C. Burges
8+
Website: http://yann.lecun.com/exdb/mnist/
9+
10+
[LeCun et al., 1998a]
11+
Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. "Gradient-based learning applied to document recognition." Proceedings of the IEEE, 86(11):2278-2324, November 1998
12+
13+
The specified directory \"$dir\" is missing the files $(join(map(f->"\"$f\"", files), ", ", " and ")) of the full data set.
14+
15+
The files are available for download at the offical website linked above.
16+
We can download these files for you if you wish, but that doesn't free
17+
you from the burden of using the data responsibly and respect copyright.
18+
The authors of MNIST aren't really explicit about any terms of use,
19+
so please read the website to make sure you want to download the dataset.
20+
21+
http://yann.lecun.com/exdb/mnist/
22+
23+
Did you visit the website and want to download the dataset to \"$dir\"? [y/n] """
24+
25+
function downloaded_file(dir, filename)
26+
path = joinpath(dir, filename)
27+
if !isfile(path)
28+
if isinteractive()
29+
warn(msg_notfound(dir, filename))
30+
download_helper(dir)
31+
else
32+
error(msg_notfound(dir, filename))
33+
end
34+
end
35+
path
36+
end
37+
38+
"""
39+
download_helper([dir]; [i_accept_the_terms_of_use = false])
40+
41+
Check if the MNIST dataset is contained in the specified `dir`,
42+
or if any of the four files are missing. If `dir` is omitted it
43+
will default to `MLDatasets/datasets/mnist`.
44+
45+
In the case that any of the four files are missing and
46+
`i_accept_the_terms_of_use=false` the function will raise a
47+
warning or an error depending on if julia is run in an
48+
interactive session. If an interactive session is detected the
49+
user will be presented with information and the option to
50+
download the dataset to the specified `dir`.
51+
52+
If the download should happen automatically, please first visit
53+
the website at http://yann.lecun.com/exdb/mnist, before setting
54+
`i_accept_the_terms_of_use=true`.
55+
"""
56+
function download_helper(dir; i_accept_the_terms_of_use = false)
57+
files = filter(file->!isfile(joinpath(dir, file)),
58+
[TRAINIMAGES, TRAINLABELS, TESTIMAGES, TESTLABELS])
59+
if !isempty(files)
60+
if !i_accept_the_terms_of_use && isinteractive()
61+
print(msg_prompt(dir, files))
62+
answer = first(readline())
63+
if answer == 'y'
64+
i_accept_the_terms_of_use = true
65+
end
66+
end
67+
if i_accept_the_terms_of_use
68+
mkpath(dir)
69+
for file in files
70+
url = "http://yann.lecun.com/exdb/mnist/$file"
71+
path = joinpath(dir, file)
72+
info("downloading $file from $url to $dir")
73+
run(download_cmd(url, path))
74+
end
75+
else
76+
error("Unable to download the dataset. Please visit the website at http://yann.lecun.com/exdb/mnist and download the files manually.")
77+
end
78+
else
79+
info("Nothing to do.")
80+
end
81+
nothing
82+
end

src/MNIST/Reader/readheader.jl

Lines changed: 56 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,56 @@
1+
"""
2+
readimageheader(io::IO)
3+
4+
Reads four 32 bit integers at the current position of `io` and
5+
interprets them as a MNIST-image-file header, which is described
6+
in detail in the table below
7+
8+
║ First │ Second │ Third │ Fourth
9+
════════╬══════════════╪══════════╪═════════╪════════════
10+
offset ║ 0000 │ 0004 │ 0008 │ 0012
11+
descr ║ magic number │ # images │ # rows │ # columns
12+
13+
These four numbers are returned as a Tuple in the same storage order
14+
"""
15+
function readimageheader(io::IO)
16+
magic_number = bswap(read(io, UInt32))
17+
total_items = bswap(read(io, UInt32))
18+
nrows = bswap(read(io, UInt32))
19+
ncols = bswap(read(io, UInt32))
20+
UInt32(magic_number), Int(total_items), Int(nrows), Int(ncols)
21+
end
22+
23+
"""
24+
readimageheader(file::AbstractString)
25+
26+
Opens and reads the first four 32 bits values of `file` and
27+
returns them interpreted as an MNIST-image-file header
28+
"""
29+
function readimageheader(file::AbstractString)
30+
gzopen(readimageheader, file, "r")::Tuple{UInt32,Int,Int,Int}
31+
end
32+
33+
"""
34+
readlabelheader(io::IO)
35+
36+
Reads two 32 bit integers at the current position of `io` and
37+
interprets them as a MNIST-label-file header, which consists of a
38+
*magic number* and the *total number of labels* stored in the
39+
file. These two numbers are returned as a Tuple in the same
40+
storage order.
41+
"""
42+
function readlabelheader(io::IO)
43+
magic_number = bswap(read(io, UInt32))
44+
total_items = bswap(read(io, UInt32))
45+
UInt32(magic_number), Int(total_items)
46+
end
47+
48+
"""
49+
readlabelheader(file::AbstractString)
50+
51+
Opens and reads the first two 32 bits values of `file` and
52+
returns them interpreted as an MNIST-label-file header
53+
"""
54+
function readlabelheader(file::AbstractString)
55+
gzopen(readlabelheader, file, "r")::Tuple{UInt32,Int}
56+
end

0 commit comments

Comments
 (0)