|
1 | 1 | # MLDatasets.jl
|
2 | 2 |
|
3 |
| -_This package represents a community effort to provide a common |
4 |
| -interface for accessing common Machine Learning (ML) datasets. In |
5 |
| -contrast to other data-related Julia packages, the focus of |
6 |
| -`MLDatasets.jl` is specifically on downloading, unpacking, and |
7 |
| -accessing benchmark dataset. Functionality for the purpose of |
8 |
| -data processing or visualization is only provided to a degree |
9 |
| -that is special to some dataset._ |
10 |
| - |
11 |
| -| **Package Status** | **Build Status** | |
| 3 | +| **Documentation** | **Build Status** | |
12 | 4 | |:------------------:|:-----------------:|
|
13 |
| -| [](LICENSE.md) [](https://JuliaML.github.io/MLDatasets.jl/stable) | [](https://github.com/JuliaML/MLDatasets.jl/actions)| |
| 5 | +| ![Docs][docs-stable-img](docs-stable-url) [![Docs][docs-latest-img](docs-latest-url) | [](https://github.com/JuliaML/MLDatasets.jl/actions)| |
14 | 6 |
|
15 |
| -This package is a part of the |
16 |
| -[`JuliaML`](https://github.com/JuliaML) ecosystem. Its |
17 |
| -functionality is build on top of the package |
18 |
| -[`DataDeps.jl`](https://github.com/oxinabox/DataDeps.jl). |
19 |
| - |
20 |
| -## Introduction |
| 7 | +[docs-stable-img]: https://img.shields.io/badge/docs-stable-blue.svg |
| 8 | +[docs-latest-img]: https://img.shields.io/badge/docs-latest-blue.svg |
| 9 | +[docs-stable-url]: https://JuliaML.github.io/MLDatasets.jl/stable |
| 10 | +[docs-latest-url]: https://JuliaML.github.io/MLDatasets.jl/latest |
21 | 11 |
|
22 |
| -The way `MLDatasets.jl` is organized is that each dataset has its |
23 |
| -own dedicated sub-module. Where possible, those sub-module share |
24 |
| -a common interface for interacting with the datasets. For example |
25 |
| -you can load the training set and the test set of the MNIST |
26 |
| -database of handwritten digits using the following commands: |
| 12 | +This package represents a community effort to provide a common interface for accessing common Machine Learning (ML) datasets. |
| 13 | +In contrast to other data-related Julia packages, the focus of `MLDatasets.jl` is specifically on downloading, unpacking, and accessing benchmark datasets. |
| 14 | +Functionality for the purpose of data processing or visualization is only provided to a degree that is special to some dataset. |
27 | 15 |
|
28 |
| -```julia |
29 |
| -using MLDatasets |
30 |
| - |
31 |
| -train_x, train_y = MNIST.traindata() |
32 |
| -test_x, test_y = MNIST.testdata() |
33 |
| -``` |
| 16 | +This package is a part of the |
| 17 | +[`JuliaML`](https://github.com/JuliaML) ecosystem. |
| 18 | +Its functionality is built on top of the package |
| 19 | +[`DataDeps.jl`](https://github.com/oxinabox/DataDeps.jl). |
34 | 20 |
|
35 |
| -To load the data the package looks for the necessary files in |
36 |
| -various locations (see |
37 |
| -[`DataDeps.jl`](https://github.com/oxinabox/DataDeps.jl#configuration) |
38 |
| -for more information on how to configure such defaults). If the |
39 |
| -data can't be found in any of those locations, then the package |
40 |
| -will trigger a download dialog to `~/.julia/datadeps/MNIST`. To |
41 |
| -overwrite this on a case by case basis, it is possible to specify |
42 |
| -a data directory directly in `traindata(dir = <directory>)` and |
43 |
| -`testdata(dir = <directory>)`. |
44 | 21 |
|
45 | 22 | ## Available Datasets
|
46 | 23 |
|
47 |
| -Check out the **[latest |
48 |
| -documentation](https://juliaml.github.io/MLDatasets.jl/latest)** |
49 |
| - |
50 |
| -Additionally, you can make use of Julia's native docsystem. |
51 |
| -The following example shows how to get additional information |
52 |
| -on `MNIST.traintensor` within Julia's REPL: |
53 |
| - |
54 |
| -```julia |
55 |
| -?MNIST.traintensor |
56 |
| -``` |
57 |
| - |
58 |
| -Each dataset has its own dedicated sub-module. As such, it makes |
59 |
| -sense to document their functionality similarly distributed. Find |
60 |
| -below a list of available datasets and links to their their |
61 |
| -documentation. |
62 |
| - |
63 |
| -### Image Classification |
64 |
| - |
65 |
| -This package provides a variety of common benchmark datasets for |
66 |
| -the purpose of image classification. |
67 |
| - |
68 |
| -Dataset | Classes | `traintensor` | `trainlabels` | `testtensor` | `testlabels` |
69 |
| -:------:|:-------:|:-------------:|:-------------:|:------------:|:------------: |
70 |
| -[**MNIST**](https://juliaml.github.io/MLDatasets.jl/latest/datasets/MNIST/) | 10 | 28x28x60000 | 60000 | 28x28x10000 | 10000 |
71 |
| -[**FashionMNIST**](https://juliaml.github.io/MLDatasets.jl/latest/datasets/FashionMNIST/) | 10 | 28x28x60000 | 60000 | 28x28x10000 | 10000 |
72 |
| -[**CIFAR-10**](https://juliaml.github.io/MLDatasets.jl/latest/datasets/CIFAR10/) | 10 | 32x32x3x50000 | 50000 | 32x32x3x10000 | 10000 |
73 |
| -[**CIFAR-100**](https://juliaml.github.io/MLDatasets.jl/latest/datasets/CIFAR100/) | 100 (20) | 32x32x3x50000 | 50000 (x2) | 32x32x3x10000 | 10000 (x2) |
74 |
| -[**SVHN-2**](https://juliaml.github.io/MLDatasets.jl/latest/datasets/SVHN2/) (*) | 10 | 32x32x3x73257 | 73257 | 32x32x3x26032 | 26032 |
75 |
| - |
76 |
| -(*) Note that the SVHN-2 dataset provides an additional 531131 observations aside from the training- and testset |
77 |
| - |
78 |
| -[**EMNIST**](https://www.nist.gov/itl/products-and-services/emnist-dataset) packages 6 different extensions of the MNIST dataset involving letters and digits and variety of test train split options. Each extension has the standard test/train data/labels nested under it as shown below. |
79 |
| - |
80 |
| -```julia |
81 |
| -traindata = EMNIST.Balanced.traindata() |
82 |
| -testdata = EMNIST.Balanced.testdata() |
83 |
| -trainlabels = EMNIST.Balanced.trainlabels() |
84 |
| -testlabels = EMNIST.Balanced.testlabels() |
85 |
| -``` |
86 |
| - |
87 |
| -Dataset | Classes | `traindata` | `trainlabels` | `testdata` | `testlabels` | `balanced classes` |
88 |
| -:------:|:-------:|:-------------:|:-------------:|:------------:|:------------:|:------------: |
89 |
| -**ByClass** | 62 | 697932x28x28 | 697932x1 | 116323x28x28 | 116323x1 | no |
90 |
| -**ByMerge** | 47 | 697932x28x28 | 697932x1 | 116323x28x28 | 116323x1 | no |
91 |
| -**Balanced** | 47 | 112800x28x28 | 112800x1 | 18800x28x28 | 18800x1 | yes |
92 |
| -**Letters** | 26 | 124800x28x28 | 124800x1 | 20800x28x28 | 208000x1 | yes |
93 |
| -**Digits** | 10 | 240000x28x28 | 240000x1 | 40000x28x28 | 40000x1 | yes |
94 |
| -**MNIST** | 10 | 60000x28x28 | 60000x1 | 10000x28x28 | 10000x1 | yes |
| 24 | +Each dataset has its own dedicated sub-module. |
| 25 | +Find below a list of available datasets and links to their documentation. |
95 | 26 |
|
96 |
| -### Misc. Datasets |
| 27 | +#### Vision |
| 28 | + - [CIFAR10](https://juliaml.github.io/MLDatasets.jl/latest/datasets/CIFAR100/) |
| 29 | + - [CIFAR100](https://juliaml.github.io/MLDatasets.jl/latest/datasets/CIFAR100/) |
| 30 | + - [EMNIST](https://juliaml.github.io/MLDatasets.jl/latest/datasets/EMNIST/) |
| 31 | + - [FashionMNIST](https://juliaml.github.io/MLDatasets.jl/latest/datasets/FashionMNIST/) |
| 32 | + - [MNIST](https://juliaml.github.io/MLDatasets.jl/latest/datasets/MNIST/) |
| 33 | + - [SVHN2](https://juliaml.github.io/MLDatasets.jl/latest/datasets/SVHN2/) |
97 | 34 |
|
98 |
| -Dataset | Classes | `traintensor` | `trainlabels` | `testtensor` | `testlabels` |
99 |
| -:------:|:-------:|:-------------:|:-------------:|:------------:|:------------: |
100 |
| -**Iris** | 3 | 4x150 | 150 | - | - |
101 |
| -**BostongHousing** | - | 13x506 | 1x506 | - | - |
102 | 35 |
|
103 |
| -### Language Modeling |
| 36 | +#### Miscellaneous |
| 37 | + - [BostonHousing](https://juliaml.github.io/MLDatasets.jl/latest/datasets/BostonHousing/) |
| 38 | + - [Iris](https://juliaml.github.io/MLDatasets.jl/latest/datasets/Iris/) |
104 | 39 |
|
105 |
| -| | Train x | Train y | Test x | Test y | |
106 |
| -|:--:|:-------:|:-------:|:------:|:------:| |
107 |
| -| **PTBLM** | 42068 | 42068 | 3761 | 3761 | |
108 |
| -| **UD_English** | 12543 | - | 2077 | - | |
109 | 40 |
|
110 |
| -#### PTBLM |
| 41 | +#### Text |
| 42 | + - [PTBLM](https://juliaml.github.io/MLDatasets.jl/latest/datasets/PTBLM/) |
| 43 | + - [UD_English](https://juliaml.github.io/MLDatasets.jl/latest/datasets/UD_English/) |
111 | 44 |
|
112 |
| -The `PTBLM` dataset consists of Penn Treebank sentences for |
113 |
| -language modeling, available from |
114 |
| -[tomsercu/lstm](https://github.com/tomsercu/lstm). The unknown |
115 |
| -words are replaced with `<unk>` so that the total vocabulary size |
116 |
| -becomes 10000. |
| 45 | +#### Graphs |
| 46 | + - To be added. |
117 | 47 |
|
118 |
| -This is the first sentence of the PTBLM dataset. |
| 48 | +#### Audio |
| 49 | + - To be added. |
119 | 50 |
|
120 |
| -```julia |
121 |
| -x, y = PTBLM.traindata() |
122 |
| - |
123 |
| -x[1] |
124 |
| -> ["no", "it", "was", "n't", "black", "monday"] |
125 |
| -y[1] |
126 |
| -> ["it", "was", "n't", "black", "monday", "<eos>"] |
127 |
| -``` |
128 |
| - |
129 |
| -where `MLDataset` adds the special word: `<eos>` to the end of `y`. |
130 |
| - |
131 |
| -### Text Analysis (POS-Tagging, Parsing) |
132 |
| - |
133 |
| -#### UD English |
134 |
| - |
135 |
| -The [UD_English](https://github.com/UniversalDependencies/UD_English-EWT) |
136 |
| -Universal Dependencies English Web Treebank dataset is an annotated corpus of morphological features, |
137 |
| -POS-tags and syntactic trees. The dataset follows CoNLL-style |
138 |
| -format. |
139 |
| - |
140 |
| -```julia |
141 |
| -traindata = UD_English.traindata() |
142 |
| -devdata = UD_English.devdata() |
143 |
| -testdata = UD_English.devdata() |
144 |
| -``` |
145 |
| - |
146 |
| -## Documentation |
147 |
| - |
148 |
| -Check out the **[latest |
149 |
| -documentation](https://JuliaML.github.io/MLDatasets.jl/stable)** |
150 |
| - |
151 |
| -Additionally, you can make use of Julia's native docsystem. |
152 |
| -The following example shows how to get additional information |
153 |
| -on `MNIST.convert2image` within Julia's REPL: |
154 |
| - |
155 |
| -```julia |
156 |
| -?MNIST.convert2image |
157 |
| -``` |
158 |
| -``` |
159 |
| - convert2image(array) -> Array{Gray} |
160 |
| -
|
161 |
| - Convert the given MNIST horizontal-major tensor (or feature matrix) to a vertical-major Colorant array. The values are also color corrected according to |
162 |
| - the website's description, which means that the digits are black on a white background. |
163 |
| -
|
164 |
| - julia> MNIST.convert2image(MNIST.traintensor()) # full training dataset |
165 |
| - 28×28×60000 Array{Gray{N0f8},3}: |
166 |
| - [...] |
167 |
| -
|
168 |
| - julia> MNIST.convert2image(MNIST.traintensor(1)) # first training image |
169 |
| - 28×28 Array{Gray{N0f8},2}: |
170 |
| - [...] |
171 |
| -``` |
172 | 51 |
|
173 | 52 | ## Installation
|
174 | 53 |
|
175 |
| -To install `MLDatasets.jl`, start up Julia and type the following |
176 |
| -code snippet into the REPL. It makes use of the native Julia |
| 54 | +To install `MLDatasets.jl`, start up Julia and type the following code snippet into the REPL. |
| 55 | +It makes use of the native Julia |
177 | 56 | package manger.
|
178 | 57 |
|
179 | 58 | ```julia
|
|
0 commit comments