Skip to content

Commit 6d8200a

Browse files
slimmer docs
1 parent f12c03e commit 6d8200a

File tree

14 files changed

+428
-538
lines changed

14 files changed

+428
-538
lines changed

.github/workflows/Documenter.yml

Lines changed: 2 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -14,9 +14,8 @@ jobs:
1414
build:
1515
runs-on: ${{ matrix.os }}
1616
strategy:
17-
matrix:
18-
julia-version: [1]
19-
os: [ubuntu-latest]
17+
julia-version: 1.6
18+
os: ubuntu-latest
2019
steps:
2120
- uses: actions/checkout@v2
2221
- uses: julia-actions/setup-julia@latest

README.md

Lines changed: 34 additions & 155 deletions
Original file line numberDiff line numberDiff line change
@@ -1,179 +1,58 @@
11
# MLDatasets.jl
22

3-
_This package represents a community effort to provide a common
4-
interface for accessing common Machine Learning (ML) datasets. In
5-
contrast to other data-related Julia packages, the focus of
6-
`MLDatasets.jl` is specifically on downloading, unpacking, and
7-
accessing benchmark dataset. Functionality for the purpose of
8-
data processing or visualization is only provided to a degree
9-
that is special to some dataset._
10-
11-
| **Package Status** | **Build Status** |
3+
| **Documentation** | **Build Status** |
124
|:------------------:|:-----------------:|
13-
| [![License](http://img.shields.io/badge/license-MIT-brightgreen.svg?style=flat)](LICENSE.md) [![Docs](https://img.shields.io/badge/docs-stable-blue.svg)](https://JuliaML.github.io/MLDatasets.jl/stable) | [![Build Status](https://github.com/JuliaML/MLDatasets.jl/workflows/Unit%20test/badge.svg)](https://github.com/JuliaML/MLDatasets.jl/actions)|
5+
| ![Docs][docs-stable-img](docs-stable-url) [![Docs][docs-latest-img](docs-latest-url) | [![Build Status](https://github.com/JuliaML/MLDatasets.jl/workflows/Unit%20test/badge.svg)](https://github.com/JuliaML/MLDatasets.jl/actions)|
146

15-
This package is a part of the
16-
[`JuliaML`](https://github.com/JuliaML) ecosystem. Its
17-
functionality is build on top of the package
18-
[`DataDeps.jl`](https://github.com/oxinabox/DataDeps.jl).
19-
20-
## Introduction
7+
[docs-stable-img]: https://img.shields.io/badge/docs-stable-blue.svg
8+
[docs-latest-img]: https://img.shields.io/badge/docs-latest-blue.svg
9+
[docs-stable-url]: https://JuliaML.github.io/MLDatasets.jl/stable
10+
[docs-latest-url]: https://JuliaML.github.io/MLDatasets.jl/latest
2111

22-
The way `MLDatasets.jl` is organized is that each dataset has its
23-
own dedicated sub-module. Where possible, those sub-module share
24-
a common interface for interacting with the datasets. For example
25-
you can load the training set and the test set of the MNIST
26-
database of handwritten digits using the following commands:
12+
This package represents a community effort to provide a common interface for accessing common Machine Learning (ML) datasets.
13+
In contrast to other data-related Julia packages, the focus of `MLDatasets.jl` is specifically on downloading, unpacking, and accessing benchmark datasets.
14+
Functionality for the purpose of data processing or visualization is only provided to a degree that is special to some dataset.
2715

28-
```julia
29-
using MLDatasets
30-
31-
train_x, train_y = MNIST.traindata()
32-
test_x, test_y = MNIST.testdata()
33-
```
16+
This package is a part of the
17+
[`JuliaML`](https://github.com/JuliaML) ecosystem.
18+
Its functionality is built on top of the package
19+
[`DataDeps.jl`](https://github.com/oxinabox/DataDeps.jl).
3420

35-
To load the data the package looks for the necessary files in
36-
various locations (see
37-
[`DataDeps.jl`](https://github.com/oxinabox/DataDeps.jl#configuration)
38-
for more information on how to configure such defaults). If the
39-
data can't be found in any of those locations, then the package
40-
will trigger a download dialog to `~/.julia/datadeps/MNIST`. To
41-
overwrite this on a case by case basis, it is possible to specify
42-
a data directory directly in `traindata(dir = <directory>)` and
43-
`testdata(dir = <directory>)`.
4421

4522
## Available Datasets
4623

47-
Check out the **[latest
48-
documentation](https://juliaml.github.io/MLDatasets.jl/latest)**
49-
50-
Additionally, you can make use of Julia's native docsystem.
51-
The following example shows how to get additional information
52-
on `MNIST.traintensor` within Julia's REPL:
53-
54-
```julia
55-
?MNIST.traintensor
56-
```
57-
58-
Each dataset has its own dedicated sub-module. As such, it makes
59-
sense to document their functionality similarly distributed. Find
60-
below a list of available datasets and links to their their
61-
documentation.
62-
63-
### Image Classification
64-
65-
This package provides a variety of common benchmark datasets for
66-
the purpose of image classification.
67-
68-
Dataset | Classes | `traintensor` | `trainlabels` | `testtensor` | `testlabels`
69-
:------:|:-------:|:-------------:|:-------------:|:------------:|:------------:
70-
[**MNIST**](https://juliaml.github.io/MLDatasets.jl/latest/datasets/MNIST/) | 10 | 28x28x60000 | 60000 | 28x28x10000 | 10000
71-
[**FashionMNIST**](https://juliaml.github.io/MLDatasets.jl/latest/datasets/FashionMNIST/) | 10 | 28x28x60000 | 60000 | 28x28x10000 | 10000
72-
[**CIFAR-10**](https://juliaml.github.io/MLDatasets.jl/latest/datasets/CIFAR10/) | 10 | 32x32x3x50000 | 50000 | 32x32x3x10000 | 10000
73-
[**CIFAR-100**](https://juliaml.github.io/MLDatasets.jl/latest/datasets/CIFAR100/) | 100 (20) | 32x32x3x50000 | 50000 (x2) | 32x32x3x10000 | 10000 (x2)
74-
[**SVHN-2**](https://juliaml.github.io/MLDatasets.jl/latest/datasets/SVHN2/) (*) | 10 | 32x32x3x73257 | 73257 | 32x32x3x26032 | 26032
75-
76-
(*) Note that the SVHN-2 dataset provides an additional 531131 observations aside from the training- and testset
77-
78-
[**EMNIST**](https://www.nist.gov/itl/products-and-services/emnist-dataset) packages 6 different extensions of the MNIST dataset involving letters and digits and variety of test train split options. Each extension has the standard test/train data/labels nested under it as shown below.
79-
80-
```julia
81-
traindata = EMNIST.Balanced.traindata()
82-
testdata = EMNIST.Balanced.testdata()
83-
trainlabels = EMNIST.Balanced.trainlabels()
84-
testlabels = EMNIST.Balanced.testlabels()
85-
```
86-
87-
Dataset | Classes | `traindata` | `trainlabels` | `testdata` | `testlabels` | `balanced classes`
88-
:------:|:-------:|:-------------:|:-------------:|:------------:|:------------:|:------------:
89-
**ByClass** | 62 | 697932x28x28 | 697932x1 | 116323x28x28 | 116323x1 | no
90-
**ByMerge** | 47 | 697932x28x28 | 697932x1 | 116323x28x28 | 116323x1 | no
91-
**Balanced** | 47 | 112800x28x28 | 112800x1 | 18800x28x28 | 18800x1 | yes
92-
**Letters** | 26 | 124800x28x28 | 124800x1 | 20800x28x28 | 208000x1 | yes
93-
**Digits** | 10 | 240000x28x28 | 240000x1 | 40000x28x28 | 40000x1 | yes
94-
**MNIST** | 10 | 60000x28x28 | 60000x1 | 10000x28x28 | 10000x1 | yes
24+
Each dataset has its own dedicated sub-module.
25+
Find below a list of available datasets and links to their documentation.
9526

96-
### Misc. Datasets
27+
#### Vision
28+
- [CIFAR10](https://juliaml.github.io/MLDatasets.jl/latest/datasets/CIFAR100/)
29+
- [CIFAR100](https://juliaml.github.io/MLDatasets.jl/latest/datasets/CIFAR100/)
30+
- [EMNIST](https://juliaml.github.io/MLDatasets.jl/latest/datasets/EMNIST/)
31+
- [FashionMNIST](https://juliaml.github.io/MLDatasets.jl/latest/datasets/FashionMNIST/)
32+
- [MNIST](https://juliaml.github.io/MLDatasets.jl/latest/datasets/MNIST/)
33+
- [SVHN2](https://juliaml.github.io/MLDatasets.jl/latest/datasets/SVHN2/)
9734

98-
Dataset | Classes | `traintensor` | `trainlabels` | `testtensor` | `testlabels`
99-
:------:|:-------:|:-------------:|:-------------:|:------------:|:------------:
100-
**Iris** | 3 | 4x150 | 150 | - | -
101-
**BostongHousing** | - | 13x506 | 1x506 | - | -
10235

103-
### Language Modeling
36+
#### Miscellaneous
37+
- [BostonHousing](https://juliaml.github.io/MLDatasets.jl/latest/datasets/BostonHousing/)
38+
- [Iris](https://juliaml.github.io/MLDatasets.jl/latest/datasets/Iris/)
10439

105-
| | Train x | Train y | Test x | Test y |
106-
|:--:|:-------:|:-------:|:------:|:------:|
107-
| **PTBLM** | 42068 | 42068 | 3761 | 3761 |
108-
| **UD_English** | 12543 | - | 2077 | - |
10940

110-
#### PTBLM
41+
#### Text
42+
- [PTBLM](https://juliaml.github.io/MLDatasets.jl/latest/datasets/PTBLM/)
43+
- [UD_English](https://juliaml.github.io/MLDatasets.jl/latest/datasets/UD_English/)
11144

112-
The `PTBLM` dataset consists of Penn Treebank sentences for
113-
language modeling, available from
114-
[tomsercu/lstm](https://github.com/tomsercu/lstm). The unknown
115-
words are replaced with `<unk>` so that the total vocabulary size
116-
becomes 10000.
45+
#### Graphs
46+
- To be added.
11747

118-
This is the first sentence of the PTBLM dataset.
48+
#### Audio
49+
- To be added.
11950

120-
```julia
121-
x, y = PTBLM.traindata()
122-
123-
x[1]
124-
> ["no", "it", "was", "n't", "black", "monday"]
125-
y[1]
126-
> ["it", "was", "n't", "black", "monday", "<eos>"]
127-
```
128-
129-
where `MLDataset` adds the special word: `<eos>` to the end of `y`.
130-
131-
### Text Analysis (POS-Tagging, Parsing)
132-
133-
#### UD English
134-
135-
The [UD_English](https://github.com/UniversalDependencies/UD_English-EWT)
136-
Universal Dependencies English Web Treebank dataset is an annotated corpus of morphological features,
137-
POS-tags and syntactic trees. The dataset follows CoNLL-style
138-
format.
139-
140-
```julia
141-
traindata = UD_English.traindata()
142-
devdata = UD_English.devdata()
143-
testdata = UD_English.devdata()
144-
```
145-
146-
## Documentation
147-
148-
Check out the **[latest
149-
documentation](https://JuliaML.github.io/MLDatasets.jl/stable)**
150-
151-
Additionally, you can make use of Julia's native docsystem.
152-
The following example shows how to get additional information
153-
on `MNIST.convert2image` within Julia's REPL:
154-
155-
```julia
156-
?MNIST.convert2image
157-
```
158-
```
159-
convert2image(array) -> Array{Gray}
160-
161-
Convert the given MNIST horizontal-major tensor (or feature matrix) to a vertical-major Colorant array. The values are also color corrected according to
162-
the website's description, which means that the digits are black on a white background.
163-
164-
julia> MNIST.convert2image(MNIST.traintensor()) # full training dataset
165-
28×28×60000 Array{Gray{N0f8},3}:
166-
[...]
167-
168-
julia> MNIST.convert2image(MNIST.traintensor(1)) # first training image
169-
28×28 Array{Gray{N0f8},2}:
170-
[...]
171-
```
17251

17352
## Installation
17453

175-
To install `MLDatasets.jl`, start up Julia and type the following
176-
code snippet into the REPL. It makes use of the native Julia
54+
To install `MLDatasets.jl`, start up Julia and type the following code snippet into the REPL.
55+
It makes use of the native Julia
17756
package manger.
17857

17958
```julia

docs/Project.toml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,2 +1,2 @@
11
[deps]
2-
Documenter = "e30172f5-a6a5-5a46-863b-614d45cd2de4"
2+
Documenter = "e30172f5-a6a5-5a46-863b-614d45cd2de4"

docs/make.jl

Lines changed: 26 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -1,29 +1,46 @@
11
using Documenter, MLDatasets
22

3+
DocMeta.setdocmeta!(MLDatasets, :DocTestSetup, :(using MLDatasets); recursive=true)
4+
5+
# Build documentation.
6+
# ====================
7+
38
makedocs(
49
modules = [MLDatasets],
10+
doctest = true,
511
clean = false,
6-
format = Documenter.HTML(
7-
prettyurls = haskey(ENV, "CI"),
8-
assets = [joinpath("assets", "favicon.ico")]
9-
),
1012
sitename = "MLDatasets.jl",
13+
format = Documenter.HTML(
14+
canonical = "https://juliadata.github.io/MLDatasets.jl/stable/",
15+
assets = ["assets/favicon.ico"],
16+
prettyurls = get(ENV, "CI", nothing) == "true"
17+
),
18+
1119
authors = "Hiroyuki Shindo, Christof Stocker",
12-
linkcheck = !("skiplinks" in ARGS),
1320
pages = Any[
1421
"Home" => "index.md",
1522
"Available Datasets" => Any[
16-
"Image Classification" => Any[
17-
"MNIST handwritten digits" => "datasets/MNIST.md",
18-
"Fashion MNIST" => "datasets/FashionMNIST.md",
23+
"Vision" => Any[
24+
"MNIST" => "datasets/MNIST.md",
25+
"FashionMNIST" => "datasets/FashionMNIST.md",
1926
"CIFAR-10" => "datasets/CIFAR10.md",
2027
"CIFAR-100" => "datasets/CIFAR100.md",
2128
"SVHN format 2" => "datasets/SVHN2.md",
2229
],
30+
"Misc." => Any[
31+
"Iris" => "datasets/Iris.md",
32+
"Boston Housing" => "datasets/BostonHousing.md",
33+
],
34+
35+
"Text" => Any[
36+
"PTBLM" => "datasets/PTBLM.md",
37+
"UD_English" => "datasets/UD_English.md",
38+
],
39+
2340
],
24-
hide("Indices" => "indices.md"),
2541
"LICENSE.md",
2642
]
2743
)
2844

45+
2946
deploydocs(repo = "github.com/JuliaML/MLDatasets.jl.git")

docs/src/datasets/BostonHousing.md

Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,13 @@
1+
# Boston Housing
2+
3+
```@docs
4+
BostonHousing
5+
```
6+
7+
## API reference
8+
9+
```@docs
10+
BostonHousing.feature_names
11+
BostonHousing.features
12+
BostonHousing.targets
13+
```

docs/src/datasets/EMNIST.md

Lines changed: 21 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,21 @@
1+
# EMNIST
2+
3+
[**EMNIST**](https://www.nist.gov/itl/products-and-services/emnist-dataset) packages 6 different extensions of the MNIST dataset involving letters and digits and variety of test train split options. Each extension has the standard test/train data/labels nested under it as shown below.
4+
5+
```julia
6+
using MLDatasets: EMNIST
7+
8+
traindata = EMNIST.Balanced.traindata()
9+
testdata = EMNIST.Balanced.testdata()
10+
trainlabels = EMNIST.Balanced.trainlabels()
11+
testlabels = EMNIST.Balanced.testlabels()
12+
```
13+
14+
Dataset | Classes | `traindata` | `trainlabels` | `testdata` | `testlabels` | `balanced classes`
15+
:------:|:-------:|:-------------:|:-------------:|:------------:|:------------:|:------------:
16+
**ByClass** | 62 | 697932x28x28 | 697932x1 | 116323x28x28 | 116323x1 | no
17+
**ByMerge** | 47 | 697932x28x28 | 697932x1 | 116323x28x28 | 116323x1 | no
18+
**Balanced** | 47 | 112800x28x28 | 112800x1 | 18800x28x28 | 18800x1 | yes
19+
**Letters** | 26 | 124800x28x28 | 124800x1 | 20800x28x28 | 208000x1 | yes
20+
**Digits** | 10 | 240000x28x28 | 240000x1 | 40000x28x28 | 40000x1 | yes
21+
**MNIST** | 10 | 60000x28x28 | 60000x1 | 10000x28x28 | 10000x1 | yes

docs/src/datasets/Iris.md

Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,13 @@
1+
# Iris
2+
3+
```@docs
4+
Iris
5+
```
6+
7+
## API reference
8+
9+
```@docs
10+
Iris.features
11+
Iris.labels
12+
Iris.download
13+
```

docs/src/datasets/PTBLM.md

Lines changed: 20 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,20 @@
1+
# PTBLM
2+
3+
The `PTBLM` dataset consists of Penn Treebank sentences for
4+
language modeling, available from
5+
[tomsercu/lstm](https://github.com/tomsercu/lstm). The unknown
6+
words are replaced with `<unk>` so that the total vocabulary size
7+
becomes 10000.
8+
9+
This is the first sentence of the PTBLM dataset.
10+
11+
```julia
12+
x, y = PTBLM.traindata()
13+
14+
x[1]
15+
> ["no", "it", "was", "n't", "black", "monday"]
16+
y[1]
17+
> ["it", "was", "n't", "black", "monday", "<eos>"]
18+
```
19+
20+
where `MLDataset` adds the special word: `<eos>` to the end of `y`.

docs/src/datasets/UD_English.md

Lines changed: 18 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,18 @@
1+
2+
# UD English
3+
4+
The [UD_English](https://github.com/UniversalDependencies/UD_English-EWT)
5+
Universal Dependencies English Web Treebank dataset is an annotated corpus of morphological features,
6+
POS-tags and syntactic trees. The dataset follows CoNLL-style format.
7+
8+
```julia
9+
traindata = UD_English.traindata()
10+
devdata = UD_English.devdata()
11+
testdata = UD_English.devdata()
12+
```
13+
14+
## Data Size
15+
16+
| | Train x | Train y | Test x | Test y |
17+
|:--:|:-------:|:-------:|:------:|:------:|
18+
| **UD_English** | 12543 | - | 2077 | - |

0 commit comments

Comments
 (0)