Skip to content

Commit da94645

Browse files
authored
Merge pull request #12 from JuliaML/datadeps
Switch download backend to DataDeps
2 parents d41650d + 8498c72 commit da94645

33 files changed

+2872
-1012
lines changed

README.md

Lines changed: 12 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -2,12 +2,9 @@
22

33
[![Build Status](https://travis-ci.org/JuliaML/MLDatasets.jl.svg?branch=master)](https://travis-ci.org/JuliaML/MLDatasets.jl)
44

5-
`MLDatasets` provides an access to common machine learning
6-
datasets for [Julia](http://julialang.org/). Currently, julia 0.5
7-
is supported.
8-
9-
The datasets are automatically downloaded to the specified
10-
directory. The default directory is `MLDatasets/datasets`.
5+
`MLDatasets` provides access to common machine learning datasets
6+
for [Julia](http://julialang.org/). Currently, julia 0.6 is
7+
supported.
118

129
## Installation
1310

@@ -33,15 +30,21 @@ Use `traindata(<directory>)` and `testdata(<directory>)` to change the default d
3330
#### CIFAR-10
3431

3532
The [CIFAR-10](https://www.cs.toronto.edu/~kriz/cifar.html)
36-
dataset consists of 60000 32x32 color images in 10 classes.
33+
dataset consists of 60000 32x32 RGB images in 10 classes.
34+
35+
Take a look at the [sub-module](src/CIFAR10/README.md) for more
36+
information
3737

3838
#### CIFAR-100
3939

4040
The [CIFAR-100](https://www.cs.toronto.edu/~kriz/cifar.html)
41-
dataset consists of 600 32x32 color images in 100 classes. The
41+
dataset consists of 60000 32x32 color images in 100 classes. The
4242
100 classes are grouped into 20 superclasses (fine and coarse
4343
labels).
4444

45+
Take a look at the [sub-module](src/CIFAR100/README.md) for more
46+
information
47+
4548
#### MNIST
4649

4750
The [MNIST](http://yann.lecun.com/exdb/mnist/) dataset consists
@@ -101,7 +104,7 @@ testdata = UD_English.devdata()
101104
| | Type | Train x | Train y | Test x | Test y |
102105
|:---:|:---:|:---:|:---:|:---:|:---:|
103106
| **CIFAR-10** | image | 32x32x3x50000 | 50000 | 32x32x3x10000 | 10000 |
104-
| **CIFAR-100** | image | 32x32x3x500 | 2x500 | 32x32x3x100 | 2x100 |
107+
| **CIFAR-100** | image | 32x32x3x5000 | 50000 (x2) | 32x32x3x10000 | 10000 (x2) |
105108
| **MNIST** | image | 28x28x60000 | 60000 | 28x28x10000 | 10000 |
106109
| **FashionMNIST** | image | 28x28x60000 | 60000 | 28x28x10000 | 10000 |
107110
| **PTBLM** | text | 42068 | 42068 | 3761 | 3761 |

REQUIRE

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,7 @@
11
julia 0.6
22
ImageCore 0.1.2
3+
FixedPointNumbers 0.3
34
ColorTypes 0.4
5+
DataDeps
46
GZip
57
BinDeps

src/CIFAR10.jl

Lines changed: 0 additions & 43 deletions
This file was deleted.

src/CIFAR10/CIFAR10.jl

Lines changed: 93 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,93 @@
1+
export CIFAR10
2+
module CIFAR10
3+
using DataDeps
4+
using BinDeps
5+
using ImageCore
6+
using ColorTypes
7+
using FixedPointNumbers
8+
using ..bytes_to_type
9+
using ..datafile
10+
using ..download_dep
11+
using ..download_docstring
12+
13+
export
14+
15+
classnames,
16+
17+
traintensor,
18+
testtensor,
19+
20+
trainlabels,
21+
testlabels,
22+
23+
traindata,
24+
testdata,
25+
26+
convert2image,
27+
convert2features,
28+
29+
download
30+
31+
const DEPNAME = "CIFAR10"
32+
const NCHUNKS = 5
33+
34+
filename_for_chunk(file_index::Int) =
35+
joinpath("cifar-10-batches-bin", "data_batch_$(file_index).bin")
36+
37+
const TESTSET_FILENAME =
38+
joinpath("cifar-10-batches-bin", "test_batch.bin")
39+
40+
const CLASSES = [
41+
"airplane",
42+
"automobile",
43+
"bird",
44+
"cat",
45+
"deer",
46+
"dog",
47+
"frog",
48+
"horse",
49+
"ship",
50+
"truck",
51+
]
52+
53+
download(args...; kw...) = download_dep(DEPNAME, args...; kw...)
54+
55+
include(joinpath("Reader","Reader.jl"))
56+
include("interface.jl")
57+
include("utils.jl")
58+
59+
function __init__()
60+
RegisterDataDep(
61+
DEPNAME,
62+
"""
63+
Dataset: The CIFAR-10 dataset
64+
Authors: Alex Krizhevsky, Vinod Nair, Geoffrey Hinton
65+
Website: https://www.cs.toronto.edu/~kriz/cifar.html
66+
Reference: https://www.cs.toronto.edu/~kriz/learning-features-2009-TR.pdf
67+
68+
[Krizhevsky, 2009]
69+
Alex Krizhevsky.
70+
"Learning Multiple Layers of Features from Tiny Images",
71+
Tech Report, 2009.
72+
73+
The CIFAR-10 dataset is a labeled subsets of the 80
74+
million tiny images dataset. It consists of 60000
75+
32x32 colour images in 10 classes, with 6000 images
76+
per class.
77+
78+
The compressed archive file that contains the
79+
complete dataset is available for download at the
80+
offical website linked above; specifically the binary
81+
version for C programs. Note that using the data
82+
responsibly and respecting copyright remains your
83+
responsibility. The authors of CIFAR-10 aren't really
84+
explicit about any terms of use, so please read the
85+
website to make sure you want to download the
86+
dataset.
87+
""",
88+
"https://www.cs.toronto.edu/~kriz/cifar-10-binary.tar.gz",
89+
"c4a38c50a1bc5f3a1c5537f2155ab9d68f9f25eb1ed8d9ddda3db29a59bca1dd",
90+
post_fetch_method = file -> (run(BinDeps.unpack_cmd(file,dirname(file), ".gz", ".tar")); rm(file))
91+
)
92+
end
93+
end

src/CIFAR10/README.md

Lines changed: 81 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,81 @@
1+
# CIFAR-10
2+
3+
Description from the [original
4+
website](https://www.cs.toronto.edu/~kriz/cifar.html)
5+
6+
> The CIFAR-10 and CIFAR-100 are labeled subsets of the
7+
> [80 million tiny images](http://people.csail.mit.edu/torralba/tinyimages/)
8+
> dataset. They were collected by Alex Krizhevsky, Vinod Nair,
9+
> and Geoffrey Hinton.
10+
>
11+
> The CIFAR-10 dataset consists of 60000 32x32 colour images in
12+
> 10 classes, with 6000 images per class. There are 50000
13+
> training images and 10000 test images.
14+
15+
## Usage
16+
17+
This sub-module provides a programmatic interface to download,
18+
load, and work with the CIFAR-10 dataset.
19+
20+
```julia
21+
using MLDatasets
22+
23+
# download dataset
24+
CIFAR10.download()
25+
26+
# load full training set
27+
train_x, train_y = CIFAR10.traindata()
28+
29+
# load full test set
30+
test_x, test_y = CIFAR10.testdata()
31+
```
32+
33+
The provided functions also allow for optional arguments, such as
34+
the directory `dir` where the dataset is located, or the specific
35+
observation `indices` that one wants to work with. For more
36+
information on the interface take a look at the documentation
37+
(e.g. `?CIFAR10.traindata`).
38+
39+
Function | Description
40+
---------|-------------
41+
`download([dir])` | Trigger interactive download of the dataset
42+
`classnames()` | Return the class names as a vector of strings
43+
`traintensor([T], [indices]; [dir])` | Load the training images as an array of eltype `T`
44+
`trainlabels([indices]; [dir])` | Load the labels for the training images
45+
`testtensor([T], [indices]; [dir])` | Load the test images as an array of eltype `T`
46+
`testlabels([indices]; [dir])` | Load the labels for the test images
47+
`traindata([T], [indices]; [dir])` | Load images and labels of the training data
48+
`testdata([T], [indices]; [dir])` | Load images and labels of the test data
49+
50+
This module also provides utility functions to make working with
51+
the CIFAR10 dataset in Julia more convenient.
52+
53+
You can use the function `convert2features` to convert the given
54+
CIFAR10 tensor to a feature matrix (or feature vector in the case
55+
of a single image). The purpose of this function is to drop the
56+
spatial dimensions such that traditional ML algorithms can
57+
process the dataset.
58+
59+
```julia
60+
julia> CIFAR10.convert2features(CIFAR10.traintensor()) # full training data
61+
3072×50000 Array{N0f8,2}:
62+
[...]
63+
```
64+
65+
To visualize an image or a prediction we provide the function
66+
`convert2image` to convert the given CIFAR10 horizontal-major
67+
tensor (or feature matrix) to a vertical-major `Colorant` array.
68+
69+
```julia
70+
julia> CIFAR10.convert2image(CIFAR10.traintensor(1)) # first training image
71+
32×32 Array{RGB{N0f8},2}:
72+
[...]
73+
```
74+
75+
## References
76+
77+
- **Authors**: Alex Krizhevsky, Vinod Nair, Geoffrey Hinton
78+
79+
- **Website**: https://www.cs.toronto.edu/~kriz/cifar.html
80+
81+
- **[Krizhevsky, 2009]** Alex Krizhevsky. ["Learning Multiple Layers of Features from Tiny Images"](https://www.cs.toronto.edu/~kriz/learning-features-2009-TR.pdf), Tech Report, 2009.

src/CIFAR10/Reader/Reader.jl

Lines changed: 54 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,54 @@
1+
module Reader
2+
3+
export
4+
5+
readdata!,
6+
readdata
7+
8+
const NROW = 32
9+
const NCOL = 32
10+
const NCHAN = 3
11+
const NBYTE = NROW * NCOL * NCHAN + 1 # "+ 1" for label
12+
const CHUNK_SIZE = 10_000
13+
14+
function readnext!(buffer::Array{UInt8}, io::IO)
15+
y = Int(read(io, UInt8))
16+
read!(io, buffer)
17+
buffer, y
18+
end
19+
20+
function readdata!(buffer::Array{UInt8}, io::IO, index::Integer)
21+
seek(io, (index - 1) * NBYTE)
22+
readnext!(buffer, io)
23+
end
24+
25+
function readdata(io::IO, index::Integer)
26+
buffer = Array{UInt8}(NROW, NCOL, NCHAN)
27+
readdata!(buffer, io, index)
28+
end
29+
30+
function readdata(io::IO)
31+
X = Array{UInt8}(NROW, NCOL, NCHAN, CHUNK_SIZE)
32+
Y = Array{Int}(CHUNK_SIZE)
33+
buffer = Array{UInt8}(NROW, NCOL, NCHAN)
34+
@inbounds for index in 1:CHUNK_SIZE
35+
_, ty = readnext!(buffer, io)
36+
copy!(view(X,:,:,:,index), buffer)
37+
Y[index] = ty
38+
end
39+
X, Y
40+
end
41+
42+
function readdata(file::AbstractString, index::Integer)
43+
open(file, "r") do io
44+
readdata(io, index)
45+
end::Tuple{Array{UInt8,3},Int}
46+
end
47+
48+
function readdata(file::AbstractString)
49+
open(file, "r") do io
50+
readdata(io)
51+
end::Tuple{Array{UInt8,4},Vector{Int}}
52+
end
53+
54+
end

0 commit comments

Comments
 (0)