|
1 | 1 | # MLDatasets.jl
|
2 | 2 |
|
| 3 | +[](https://JuliaML.github.io/MLDatasets.jl/stable) |
3 | 4 | [](https://travis-ci.org/JuliaML/MLDatasets.jl)
|
4 | 5 |
|
5 |
| -`MLDatasets` provides access to common machine learning datasets |
6 |
| -for [Julia](http://julialang.org/). Currently, julia 0.6 is |
7 |
| -supported. |
| 6 | +This package represents a community effort to provide a common |
| 7 | +interface for accessing common Machine Learning (ML) datasets. In |
| 8 | +contrast to other data-related Julia packages, the focus of |
| 9 | +`MLDatasets.jl` is specifically on downloading, unpacking, and |
| 10 | +accessing benchmark dataset. Functionality for the purpose of |
| 11 | +data processing or visualization is only provided to a degree |
| 12 | +that is special to some dataset. |
8 | 13 |
|
9 |
| -## Installation |
10 |
| - |
11 |
| -```julia |
12 |
| -julia> Pkg.clone("https://github.com/JuliaML/MLDatasets.jl.git") |
13 |
| -``` |
| 14 | +This package is a part of the |
| 15 | +[`JuliaML`](https://github.com/JuliaML) ecosystem. Its |
| 16 | +functionality is build on top of the package |
| 17 | +[`DataDeps.jl`](https://github.com/oxinabox/DataDeps.jl). |
14 | 18 |
|
15 | 19 | ## Basic Usage
|
16 | 20 |
|
| 21 | +The way `MLDatasets.jl` is organized is that each dataset has its |
| 22 | +own dedicated sub-module. Where possible, those sub-module share |
| 23 | +a common interface for interacting with the datasets. For example |
| 24 | +you can load the training set and the test set of the MNIST |
| 25 | +database of handwritten digits using the following commands: |
| 26 | + |
17 | 27 | ```julia
|
18 | 28 | using MLDatasets
|
19 | 29 |
|
20 | 30 | train_x, train_y = MNIST.traindata()
|
21 |
| -test_x, test_y = MNIST.testdata() |
| 31 | +test_x, test_y = MNIST.testdata() |
22 | 32 | ```
|
23 | 33 |
|
24 |
| -Use `traindata(<directory>)` and `testdata(<directory>)` to change the default directory. |
| 34 | +To load the data the package looks for the necessary files in |
| 35 | +various locations (see |
| 36 | +[`DataDeps.jl`](https://github.com/oxinabox/DataDeps.jl#configuration) |
| 37 | +for more information on how to configure such defaults). If the |
| 38 | +data can't be found in any of those locations, then the package |
| 39 | +will trigger a download dialog to `~/.julia/datadeps/MNIST`. To |
| 40 | +overwrite this on a case by case basis, it is possible to specify |
| 41 | +a data directory directly in `traindata(dir = <directory>)` and |
| 42 | +`testdata(dir = <directory>)`. |
25 | 43 |
|
26 | 44 | ## Available Datasets
|
27 | 45 |
|
28 |
| -### Image Classification |
29 |
| - |
30 |
| -#### CIFAR-10 |
| 46 | +Check out the **[latest |
| 47 | +documentation](https://juliaml.github.io/MLDatasets.jl/latest)** |
31 | 48 |
|
32 |
| -The [CIFAR-10](https://www.cs.toronto.edu/~kriz/cifar.html) |
33 |
| -dataset consists of 60000 32x32 RGB images in 10 classes. |
| 49 | +Additionally, you can make use of Julia's native docsystem. |
| 50 | +The following example shows how to get additional information |
| 51 | +on `MNIST.traintensor` within Julia's REPL: |
34 | 52 |
|
35 |
| -Take a look at the [sub-module](src/CIFAR10/README.md) for more |
36 |
| -information |
37 |
| - |
38 |
| -#### CIFAR-100 |
39 |
| - |
40 |
| -The [CIFAR-100](https://www.cs.toronto.edu/~kriz/cifar.html) |
41 |
| -dataset consists of 60000 32x32 color images in 100 classes. The |
42 |
| -100 classes are grouped into 20 superclasses (fine and coarse |
43 |
| -labels). |
44 |
| - |
45 |
| -Take a look at the [sub-module](src/CIFAR100/README.md) for more |
46 |
| -information |
47 |
| - |
48 |
| -#### MNIST |
49 |
| - |
50 |
| -The [MNIST](http://yann.lecun.com/exdb/mnist/) dataset consists |
51 |
| -of 60000 28x28 images of handwritten digits. |
| 53 | +```julia |
| 54 | +?MNIST.traintensor |
| 55 | +``` |
52 | 56 |
|
53 |
| -Take a look at the [sub-module](src/MNIST/README.md) for more |
54 |
| -information |
| 57 | +Each dataset has its own dedicated sub-module. As such, it makes |
| 58 | +sense to document their functionality similarly distributed. Find |
| 59 | +below a list of available datasets and links to their their |
| 60 | +documentation. |
55 | 61 |
|
56 |
| -#### Fashion-MNIST |
| 62 | +### Image Classification |
57 | 63 |
|
58 |
| -The [Fashion-MNIST](https://github.com/zalandoresearch/fashion-mnist) |
59 |
| -dataset consists of 60000 28x28 images of fashion products. It |
60 |
| -was designed to be a drop-in replacement for the MNIST dataset |
| 64 | +This package provides a variety of common benchmark datasets for |
| 65 | +the purpose of image classification. |
61 | 66 |
|
62 |
| -Take a look at the [sub-module](src/FashionMNIST/README.md) for more |
63 |
| -information |
| 67 | +Dataset | Classes | `traintensor` | `trainlabels` | `testtensor` | `testlabels` |
| 68 | +:------:|:-------:|:-------------:|:-------------:|:------------:|:------------: |
| 69 | +[**MNIST**](https://juliaml.github.io/MLDatasets.jl/datasets/MNIST/) | 10 | 28x28x60000 | 60000 | 28x28x10000 | 10000 |
| 70 | +[**FashionMNIST**](https://juliaml.github.io/MLDatasets.jl/datasets/FashionMNIST/) | 10 | 28x28x60000 | 60000 | 28x28x10000 | 10000 |
| 71 | +[**CIFAR-10**](https://juliaml.github.io/MLDatasets.jl/datasets/CIFAR10/) | 10 | 32x32x3x50000 | 50000 | 32x32x3x10000 | 10000 |
| 72 | +[**CIFAR-100**](https://juliaml.github.io/MLDatasets.jl/datasets/CIFAR100/) | 100 (20) | 32x32x3x50000 | 50000 (x2) | 32x32x3x10000 | 10000 (x2) |
64 | 73 |
|
65 | 74 | ### Language Modeling
|
66 | 75 |
|
@@ -102,10 +111,27 @@ testdata = UD_English.devdata()
|
102 | 111 |
|
103 | 112 | ## Data Size
|
104 | 113 | | | Type | Train x | Train y | Test x | Test y |
|
105 |
| -|:---:|:---:|:---:|:---:|:---:|:---:| |
106 |
| -| **CIFAR-10** | image | 32x32x3x50000 | 50000 | 32x32x3x10000 | 10000 | |
107 |
| -| **CIFAR-100** | image | 32x32x3x5000 | 50000 (x2) | 32x32x3x10000 | 10000 (x2) | |
108 |
| -| **MNIST** | image | 28x28x60000 | 60000 | 28x28x10000 | 10000 | |
109 |
| -| **FashionMNIST** | image | 28x28x60000 | 60000 | 28x28x10000 | 10000 | |
110 | 114 | | **PTBLM** | text | 42068 | 42068 | 3761 | 3761 |
|
111 | 115 | | **UD_English** | text | 12543 | - | 2077 | - |
|
| 116 | + |
| 117 | +## Installation |
| 118 | + |
| 119 | +To install `MLDatasets.jl`, start up Julia and type the following |
| 120 | +code snippet into the REPL. It makes use of the native Julia |
| 121 | +package manger. |
| 122 | + |
| 123 | +```julia |
| 124 | +Pkg.add("MLDatasets") |
| 125 | +``` |
| 126 | + |
| 127 | +Additionally, for example if you encounter any sudden issues, or |
| 128 | +in the case you would like to contribute to the package, you can |
| 129 | +manually choose to be on the latest (untagged) version. |
| 130 | + |
| 131 | +```julia |
| 132 | +Pkg.checkout("MLDatasets") |
| 133 | +``` |
| 134 | + |
| 135 | +## License |
| 136 | + |
| 137 | +This code is free to use under the terms of the MIT license. |
0 commit comments