Skip to content

Commit 3d6cddf

Browse files
Merge pull request #31 from JuliaML/iris
add Iris Dataset
2 parents 8142bba + b945e22 commit 3d6cddf

File tree

12 files changed

+178
-19
lines changed

12 files changed

+178
-19
lines changed

.gitignore

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1 +1,2 @@
11
sandbox.jl
2+
Manifest.toml

.travis.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@ os:
44
- osx
55
julia:
66
- 1.0
7-
- 1.2
7+
- 1.3
88
- nightly
99
notifications:
1010
email: false

Project.toml

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,11 +1,12 @@
11
name = "MLDatasets"
22
uuid = "eb30cadb-4394-5ae3-aed4-317e484a6458"
3-
version = "0.4.0"
3+
version = "0.4.1"
44

55
[deps]
66
BinDeps = "9e28174c-4ba2-5203-b857-d8d62c4213ee"
77
ColorTypes = "3da002f7-5984-5a60-b8a6-cbb66c0b333f"
88
DataDeps = "124859b0-ceae-595e-8997-d05f6a7a8dfe"
9+
DelimitedFiles = "8bb1440f-4735-579b-a4ab-409b98df4dab"
910
FixedPointNumbers = "53c48c17-4a7d-5ca2-90c5-79b7896eea93"
1011
GZip = "92fee26a-97fe-5a0c-ad85-20a5f3185b63"
1112
Requires = "ae029012-a4dd-5104-9daa-d747884805df"

README.md

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -75,6 +75,12 @@ Dataset | Classes | `traintensor` | `trainlabels` | `testtensor` | `testlabels`
7575

7676
(*) Note that the SVHN-2 dataset provides an additional 531131 observations aside from the training- and testset
7777

78+
79+
### Misc. Datasets
80+
Dataset | Classes | `traintensor` | `trainlabels` | `testtensor` | `testlabels`
81+
:------:|:-------:|:-------------:|:-------------:|:------------:|:------------:
82+
**Iris** | 3 | 4x150 | 150 | - | -
83+
7884
### Language Modeling
7985

8086
#### PTBLM

REQUIRE

Lines changed: 0 additions & 7 deletions
This file was deleted.

docs/src/index.md

Lines changed: 53 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -69,17 +69,64 @@ the purpose of image classification.
6969

7070
Dataset | Classes | `traintensor` | `trainlabels` | `testtensor` | `testlabels`
7171
:------:|:-------:|:-------------:|:-------------:|:------------:|:------------:
72-
[**MNIST**](@ref MNIST) | 10 | 28x28x60000 | 60000 | 28x28x10000 | 10000
73-
[**FashionMNIST**](@ref FashionMNIST) | 10 | 28x28x60000 | 60000 | 28x28x10000 | 10000
74-
[**CIFAR-10**](@ref CIFAR10) | 10 | 32x32x3x50000 | 50000 | 32x32x3x10000 | 10000
75-
[**CIFAR-100**](@ref CIFAR100) | 100 (20) | 32x32x3x50000 | 50000 (x2) | 32x32x3x10000 | 10000 (x2)
76-
[**SVHN-2**](@ref SVHN2) (*) | 10 | 32x32x3x73257 | 73257 | 32x32x3x26032 | 26032
72+
[**MNIST**](https://juliaml.github.io/MLDatasets.jl/latest/datasets/MNIST/) | 10 | 28x28x60000 | 60000 | 28x28x10000 | 10000
73+
[**FashionMNIST**](https://juliaml.github.io/MLDatasets.jl/latest/datasets/FashionMNIST/) | 10 | 28x28x60000 | 60000 | 28x28x10000 | 10000
74+
[**CIFAR-10**](https://juliaml.github.io/MLDatasets.jl/latest/datasets/CIFAR10/) | 10 | 32x32x3x50000 | 50000 | 32x32x3x10000 | 10000
75+
[**CIFAR-100**](https://juliaml.github.io/MLDatasets.jl/latest/datasets/CIFAR100/) | 100 (20) | 32x32x3x50000 | 50000 (x2) | 32x32x3x10000 | 10000 (x2)
76+
[**SVHN-2**](https://juliaml.github.io/MLDatasets.jl/latest/datasets/SVHN2/) (*) | 10 | 32x32x3x73257 | 73257 | 32x32x3x26032 | 26032
7777

7878
(*) Note that the SVHN-2 dataset provides an additional 531131 observations aside from the training- and testset
7979

80+
81+
### Misc. Datasets
82+
Dataset | Classes | `traintensor` | `trainlabels` | `testtensor` | `testlabels`
83+
:------:|:-------:|:-------------:|:-------------:|:------------:|:------------:
84+
**Iris** | 3 | 4x150 | 150 | - | -
85+
8086
### Language Modeling
8187

82-
Work in progress
88+
#### PTBLM
89+
90+
The `PTBLM` dataset consists of Penn Treebank sentences for
91+
language modeling, available from
92+
[tomsercu/lstm](https://github.com/tomsercu/lstm). The unknown
93+
words are replaced with `<unk>` so that the total vocabulary size
94+
becomes 10000.
95+
96+
This is the first sentence of the PTBLM dataset.
97+
98+
```julia
99+
x, y = PTBLM.traindata()
100+
101+
x[1]
102+
> ["no", "it", "was", "n't", "black", "monday"]
103+
y[1]
104+
> ["it", "was", "n't", "black", "monday", "<eos>"]
105+
```
106+
107+
where `MLDataset` adds the special word: `<eos>` to the end of `y`.
108+
109+
### Text Analysis (POS-Tagging, Parsing)
110+
111+
#### UD English
112+
113+
The [UD_English](https://github.com/UniversalDependencies/UD_English-EWT)
114+
Universal Dependencies English Web Treebank dataset is an annotated corpus of morphological features,
115+
POS-tags and syntactic trees. The dataset follows CoNLL-style
116+
format.
117+
118+
```julia
119+
traindata = UD_English.traindata()
120+
devdata = UD_English.devdata()
121+
testdata = UD_English.devdata()
122+
```
123+
124+
## Data Size
125+
| | Train x | Train y | Test x | Test y |
126+
|:--:|:-------:|:-------:|:------:|:------:|
127+
| **PTBLM** | 42068 | 42068 | 3761 | 3761 |
128+
| **UD_English** | 12543 | - | 2077 | - |
129+
83130

84131
## Index
85132

src/Iris/Iris.jl

Lines changed: 89 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,89 @@
1+
export Iris
2+
3+
"""
4+
Fisher's classic iris dataset.
5+
6+
Measurements from 3 different species of iris: setosa, versicolor and
7+
virginica. There are 50 examples of each species.
8+
9+
There are 4 measurements for each example: sepal length, sepal width, petal
10+
length and petal width. The measurements are in centimeters.
11+
12+
The module retrieves the data from the [UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/datasets/iris).
13+
14+
NOTE: no pre-defined train-test split for this dataset, `features` and `labels` return the whole dataset.
15+
16+
## Interface
17+
18+
- [`Iris.features`](@ref)
19+
- [`Iris.labels`](@ref)
20+
21+
## Utilities
22+
23+
- [`Iris.download`](@ref)
24+
"""
25+
module Iris
26+
27+
using DataDeps
28+
using ..MLDatasets: bytes_to_type, datafile, download_dep, download_docstring
29+
using DelimitedFiles
30+
31+
export features, labels, download
32+
33+
const DEPNAME = "Iris"
34+
const LINK = "https://archive.ics.uci.edu/ml/machine-learning-databases/iris/"
35+
const DOCS = "https://archive.ics.uci.edu/ml/datasets/Iris"
36+
const DATA = "iris.data"
37+
38+
"""
39+
download([dir]; [i_accept_the_terms_of_use])
40+
41+
Trigger the (interactive) download of the full dataset into
42+
"`dir`". If no `dir` is provided the dataset will be
43+
downloaded into "~/.julia/datadeps/$DEPNAME".
44+
45+
This function will display an interactive dialog unless
46+
either the keyword parameter `i_accept_the_terms_of_use` or
47+
the environment variable `DATADEPS_ALWAY_ACCEPT` is set to
48+
`true`. Note that using the data responsibly and respecting
49+
copyright/terms-of-use remains your responsibility.
50+
"""
51+
download(args...; kw...) = download_dep(DEPNAME, args...; kw...)
52+
53+
function __init__()
54+
register(DataDep(
55+
DEPNAME,
56+
"""
57+
Dataset: The Iris dataset
58+
Website: $DOCS
59+
""",
60+
LINK .* [DATA],
61+
"1ec014c249120402fc228dbab231129b87a7359699675059035af0f4adc3b863" # if checksum omitted, will be generated by DataDeps
62+
))
63+
end
64+
65+
"""
66+
labels(; dir = nothing)
67+
68+
Return a string vector of length 150 containing observations' labels.
69+
"""
70+
71+
function labels(; dir = nothing)
72+
path = datafile(DEPNAME, DATA, dir)
73+
iris = readdlm(path, ',')
74+
Vector{String}(iris[:, end])
75+
end
76+
77+
"""
78+
features(; dir = nothing)
79+
80+
Return a 4x150 matrix containing the 4-dimensional features of each observation.
81+
"""
82+
function features(; dir = nothing)
83+
path = datafile(DEPNAME, DATA, dir)
84+
iris = readdlm(path, ',')
85+
Matrix{Float64}(iris[:, 1:4])' |> collect
86+
end
87+
88+
end # module
89+

src/MLDatasets.jl

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -39,7 +39,7 @@ end
3939

4040
include("download.jl")
4141
include("CoNLL.jl")
42-
42+
include("Iris/Iris.jl")
4343
include("CIFAR10/CIFAR10.jl")
4444
include("CIFAR100/CIFAR100.jl")
4545
include("MNIST/MNIST.jl")

test/REQUIRE

Lines changed: 0 additions & 2 deletions
This file was deleted.

test/runtests.jl

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2,6 +2,7 @@ using Test
22
using MLDatasets
33

44
tests = [
5+
"tst_iris.jl",
56
"tst_cifar10.jl",
67
"tst_cifar100.jl",
78
"tst_mnist.jl",

0 commit comments

Comments
 (0)