Skip to content

Commit 36acddc

Browse files
authored
Switch LearnBase + MLDataPattern + DataLoaders -> MLUtils (#229)
* Switch LearnBase + MLDataPattern + DataLoaders -> MLUtils * Remove unneeded dependencies * Update docs to use MLUtils.jl
1 parent b60afa8 commit 36acddc

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

44 files changed

+192
-435
lines changed

CHANGELOG.md

Lines changed: 13 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,18 @@ All notable changes to this project will be documented in this file.
55
The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
66
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
77

8-
## v0.4.3
8+
## v0.5 (unreleased)
9+
10+
### Changed
11+
12+
- (BREAKING) Now uses [MLUtils.jl](https://github.com/JuliaML/MLUtils.jl) to create and load datasets and data containers
13+
- Replaces dependencies MLDataPattern.jl, LearnBase.jl, and DataLoaders.jl
14+
- Data containers must now implement the `Base.getindex`/`MLUtils.getobs` and `Base.length`/`MLUtils.numobs` interfaces.
15+
- Previously exported `MLDataPattern.datasubset` has been replaced by `MLUtils.ObsView`
16+
- Documentation has been updated appropriately
17+
18+
19+
## v0.4.3 (2022/05/14)
920

1021
### Added
1122

@@ -17,7 +28,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
1728
- the old APIs for registries have been removed and functionality for accessing them (`finddatasets`, `loaddataset`) has been deprecated. See the updated docs for how to find functionality using the new feature registries.
1829

1930

20-
## v0.4.2
31+
## v0.4.2 (2022/04/30)
2132

2233
### Added
2334

Project.toml

Lines changed: 4 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -4,29 +4,25 @@ authors = ["Lorenz Ohly", "Julia Community"]
44
version = "0.4.3"
55

66
[deps]
7-
Animations = "27a7e980-b3e6-11e9-2bcd-0b925532e340"
8-
BSON = "fbb218c0-5317-5bc6-957e-2ee96dd4b1f0"
97
CSV = "336ed68f-0bac-5ca0-87d4-7b16caf5d00b"
108
ColorVectorSpace = "c3611d14-8923-5661-9e6a-0046d554d3a4"
119
Colors = "5ae59095-9a9b-59fe-a467-6f913c188581"
1210
DataAugmentation = "88a5189c-e7ff-4f85-ac6b-e6158070f02e"
1311
DataDeps = "124859b0-ceae-595e-8997-d05f6a7a8dfe"
1412
DataFrames = "a93c6f00-e57d-5684-b7b6-d8193f3e46c0"
15-
DataLoaders = "2e981812-ef13-4a9c-bfa0-ab13047b12a9"
1613
FeatureRegistries = "c6aefb4f-3ac3-4095-8805-528476b02c02"
1714
FileIO = "5789e2e9-d7fb-5bc7-8068-2c6fae9b9549"
1815
FilePathsBase = "48062228-2e41-5def-b9a4-89aafe57970f"
1916
FixedPointNumbers = "53c48c17-4a7d-5ca2-90c5-79b7896eea93"
2017
Flux = "587475ba-b771-5e3f-ad9e-33799f191a9c"
2118
FluxTraining = "7bf95e4d-ca32-48da-9824-f0dc5310474f"
22-
Glob = "c27321d9-0574-5035-807b-f59d2c89b15c"
2319
ImageIO = "82e4d734-157c-48bb-816b-45c225c6df19"
2420
ImageInTerminal = "d8c32880-2388-543b-8c61-d9f865259254"
2521
IndirectArrays = "9b13fd28-a010-5f03-acff-a1bbcff69959"
2622
InlineTest = "bd334432-b1e7-49c7-a2dc-dd9149e4ebd6"
2723
JLD2 = "033835bb-8acc-5ee8-8aae-3f567f8a3819"
28-
LearnBase = "7f8f8fb0-2700-5f03-b4bd-41f8cfc144b6"
29-
MLDataPattern = "9920b226-0b2a-5f5f-9153-9aa70a013f8b"
24+
MLDatasets = "eb30cadb-4394-5ae3-aed4-317e484a6458"
25+
MLUtils = "f1d291b0-491e-4a28-83b9-f70985020b54"
3026
Markdown = "d6f4376e-aef5-505a-96c1-9c027394607a"
3127
MosaicViews = "e94cdb99-869f-56ef-bcf0-1ae2bcbe0389"
3228
Parameters = "d96e819e-fc66-5662-9728-84c9c7592b0a"
@@ -45,29 +41,25 @@ UnicodePlots = "b8865327-cd53-5732-bb35-84acbb429228"
4541
Zygote = "e88e6eb3-aa80-5325-afca-941959d7151f"
4642

4743
[compat]
48-
Animations = "0.4"
49-
BSON = "0.3"
5044
CSV = "0.8, 0.9, 0.10"
5145
ColorVectorSpace = "0.9"
5246
Colors = "0.12"
5347
DataAugmentation = "0.2.4"
5448
DataDeps = "0.7"
5549
DataFrames = "1"
56-
DataLoaders = "0.1"
5750
FeatureRegistries = "0.1"
5851
FileIO = "1.7"
5952
FilePathsBase = "0.9"
6053
FixedPointNumbers = "0.8"
6154
Flux = "0.12, 0.13"
6255
FluxTraining = "0.2, 0.3"
63-
Glob = "1"
6456
ImageIO = "0.6"
6557
ImageInTerminal = "0.4"
6658
IndirectArrays = "0.5, 1"
6759
InlineTest = "0.2"
6860
JLD2 = "0.4"
69-
LearnBase = "0.3, 0.4, 0.6"
70-
MLDataPattern = "0.5"
61+
MLDatasets = "0.7"
62+
MLUtils = "0.2.6"
7163
MosaicViews = "0.2, 0.3"
7264
Parameters = "0.12"
7365
PrettyTables = "1.2"

docs/Project.toml

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -9,10 +9,11 @@ FilePathsBase = "48062228-2e41-5def-b9a4-89aafe57970f"
99
Flux = "587475ba-b771-5e3f-ad9e-33799f191a9c"
1010
FluxTraining = "7bf95e4d-ca32-48da-9824-f0dc5310474f"
1111
ImageIO = "82e4d734-157c-48bb-816b-45c225c6df19"
12-
ImageMagick = "6218d12a-5da1-5696-b52f-db25d2ecc6d1"
1312
ImageShow = "4e3cecfd-b093-5904-9786-8bbb286a6a31"
1413
Images = "916415d5-f1e6-5110-898d-aaa5f9f070e0"
1514
JuliaSyntax = "70703baa-626e-46a2-a12c-08ffd08c73b4"
15+
MLDatasets = "eb30cadb-4394-5ae3-aed4-317e484a6458"
16+
MLUtils = "f1d291b0-491e-4a28-83b9-f70985020b54"
1617
ModuleInfo = "3c3ff5e7-c68c-4a09-80d1-9526a1e9878a"
1718
Pollen = "c88717ad-5130-4874-a664-5a9aba5ec443"
1819
StaticArrays = "90137ffa-7385-5640-81b9-e52037218182"

docs/background/datapipelines.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -38,7 +38,7 @@ batchdata = batchviewcollated(taskdata, 16)
3838
NBATCHES = 200
3939

4040
# sequential data iterator
41-
@time for (i, batch) in enumerate(getobs(batchdata, i) for i in 1:nobs(batchdata))
41+
@time for (i, batch) in enumerate(getobs(batchdata, i) for i in 1:numobs(batchdata))
4242
i != NBATCHES || break
4343
end
4444

@@ -96,11 +96,11 @@ To find performance bottlenecks in the loading of each observation, you'll want
9696
```julia
9797
using BenchmarkTools
9898
using FastAI
99-
using FastAI.Datasets
99+
using FastAI.Datasets, FastAI.MLUtils
100100

101101
# Since loading times can vary per observation, we'll average the measurements over multiple observations
102102
N = 10
103-
data = datasubset(data, 1:N)
103+
data = MLUtils.ObsView(data, 1:N)
104104

105105
# Time it takes to load an `(image, class)` observation
106106
@btime for i in 1:N

docs/data_containers.md

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -16,7 +16,7 @@ using FastAI
1616
data, _ = load(findfirst(datarecipes(datasetid="imagenette2-160")))
1717
```
1818

19-
A data container is any type that holds observations of data and allows us to load them with `getobs` and query the number of observations with `nobs`. In this case, each observation is a tuple of an image and the corresponding class; after all, we want to use it for image classification.
19+
A data container is any type that holds observations of data and allows us to load them with `getobs` and query the number of observations with `numobs`. In this case, each observation is a tuple of an image and the corresponding class; after all, we want to use it for image classification.
2020

2121
{cell=main}
2222
```julia
@@ -27,7 +27,7 @@ image
2727

2828
{cell=main}
2929
```julia
30-
nobs(data)
30+
numobs(data)
3131
```
3232

3333
`load(`[`datasets`](#)`[id])` makes it easy to a load a data container that is compatible with some block types, but to get a better feel for what it does, let's look under the hood by creating the same data container using some mid-level APIs.
@@ -41,11 +41,11 @@ Before we recreate the data container, we'll download the dataset and get the pa
4141
dir = load(datasets()["imagenette2-160"])
4242
```
4343

44-
Now we'll start with [`FileDataset`](#) which creates a data container (here a `Vector`) of files given a path. We'll use the path of the downloaded dataset:
44+
Now we'll start with `loadfolderdata` which creates a data container (here a `Vector`) of files given a path. We'll use the path of the downloaded dataset:
4545

4646
{cell=main}
4747
```julia
48-
files = FileDataset(dir)
48+
files = loadfolderdata(dir)
4949
```
5050

5151
`files` is a data container where each observation is a path to a file. We'll confirm that using `getobs`:

docs/fastai_api_comparison.md

Lines changed: 10 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
# fastai API comparison
22

3-
FastAI.jl is in many ways similar to the original Python [fastai](docs.fast.ai), but also has its differences. This reference goes through all the sections in the [fastai: A Layered API for Deep Learning](https://arxiv.org/abs/2002.04688) paper and comments what the interfaces for the same functionality in FastAI.jl are, and where they differ or functionality is still missing.
3+
FastAI.jl is in many ways similar to the original Python [fastai](http://docs.fast.ai), but also has its differences. This reference goes through all the sections in the [fastai: A Layered API for Deep Learning](https://arxiv.org/abs/2002.04688) paper and comments what the interfaces for the same functionality in FastAI.jl are, and where they differ or functionality is still missing.
44

55
## Applications
66

@@ -10,15 +10,16 @@ FastAI.jl additionally has a unified API for registering and discovering functio
1010

1111
### Vision
1212

13-
Computer vision is the most developed part of FastAI.jl with good support for different tasks and optimized data pipelines with N-dimensional images, masks and keypoints. See the tutorial section for many examples.
13+
Computer vision is well-supported in FastAI.jl with different tasks and optimized data pipelines for N-dimensional images, masks and keypoints. See the tutorial section for many examples.
1414

1515
### Tabular
1616

17-
Support for tabular data is merged into master but is lacking documentation which will come with the next release (0.2.0).
17+
FastAI.jl also has support for tabular data.
1818

1919
### Deployment
2020

21-
Through FastAI.jl's [`LearningTask` interface](./learning_tasks.md), the data processing logic is decoupled from the dataset creation and training and can be easily serialized and loaded to make predictions. See the tutorial on [saving and loading models](../notebooks/serialization.ipynb).
21+
Through FastAI.jl's [`LearningTask`](#) interface, the data processing logic is decoupled from the dataset creation and training and can be easily serialized and loaded to make predictions. See the tutorial on [saving and loading models](../notebooks/serialization.ipynb).
22+
2223

2324
---
2425

@@ -76,8 +77,7 @@ res = lrfind(learner); plot(res) # Run learning rate finder and plot suggestio
7677
Since it is a Julia package, FastAI.jl is not written on top of PyTorch, but a Julia library for deep learning: [Flux.jl](http://www.fluxml.ai). In any case, the point of this section is to note that the abstractions in fastai are decoupled and existing projects can easily be reused. This is also the case for FastAI.jl as it is built on top of several decoupled libraries. Many of these were built specifically for FastAI.jl, but they are unaware of each other and useful in their own right:
7778

7879
- [Flux.jl](https://github.com/FluxML/Flux.jl) provides models, optimizers, and loss functions, fulfilling a similar role to PyTorch
79-
- [MLDataPattern.jl](https://github.com/JuliaML/MLDataPattern.jl) gives you tools for building and transforming data containers
80-
- [DataLoaders.jl](https://github.com/lorenzoh/DataLoaders.jl) takes care of efficient, parallelized iteration of data containers
80+
- [MLUtils.jl](https://github.com/JuliaML/MLUtils.jl) gives you tools for building and transforming data containers. Also, it takes care of efficient, parallelized iteration of data containers.
8181
- [DataAugmentation.jl](https://github.com/lorenzoh/DataAugmentation.jl) takes care of the lower levels of high-performance, composable data augmentations.
8282
- [FluxTraining.jl](https://github.com/lorenzoh/FluxTraining.jl) contributes a highly extensible training loop with 2-way callbacks
8383

@@ -126,14 +126,14 @@ FastAI.jl makes all the same datasets available in `fastai.data.external` availa
126126

127127
### funcs_kwargs and DataLoader, fastai.data.core
128128

129-
In FastAI.jl, you are not restricted to a specific type of data iterator and can pass any iterator over batches to `Learner`. In cases where performance is important [`DataLoader`](#) can speed up data iteration by loading and batching samples in parallel on background threads. All transformations of data happen through the data container interface which requires a type to implement `LearnBase.getobs` and `LearnBase.nobs`, similar to PyTorch's `torch.utils.data.Dataset`. Data containers are then transformed into other data containers. Some examples:
129+
In FastAI.jl, you are not restricted to a specific type of data iterator and can pass any iterator over batches to `Learner`. In cases where performance is important [`DataLoader`](#) can speed up data iteration by loading and batching samples in parallel on background threads. All transformations of data happen through the data container interface which requires a type to implement `Base.getindex`/`MLUtils.getobs` and `Base.length`/`MLUtils.numobs`, similar to PyTorch's `torch.utils.data.Dataset`. Data containers are then transformed into other data containers. Some examples:
130130

131131
- [`mapobs`](#)`(f, data)` lazily maps a function `f` of over `data` such that `getobs(mapobs(f, data), idx) == f(getobs(data, idx))`. For example `mapobs(loadfile, files)` turns a vector of image files into a data container of images.
132-
- `DataLoader(data, batchsize)` is a wrapper around `batchviewcollated` which turns a data container of samples into one of collated batches and `eachobsparallel` which creates a parallel, buffered iterator over the observations (here batches) in the resulting container.
132+
- `DataLoader(data; batchsize)` is a wrapper around [`BatchView`](#) which turns a data container of samples into one of collated batches and `eachobsparallel` which creates a parallel, buffered iterator over the observations (here batches) in the resulting container.
133133
- [`groupobs`](#)`(f, data)` splits a container into groups using a grouping function `f`. For example, `groupobs(grandparentname, files)` creates training splits for files where the grandparent folder indicates the split.
134-
- [`datasubset`](#)`(data, idxs)` lazily takes a subset of the observations in `data`.
134+
- [`MLUtils.ObsView`](#)`(data, idxs)` lazily takes a subset of the observations in `data`.
135135

136-
For more information, see the [data container tutorial](data_containers.md) and the [MLDataPattern.jl docs](https://mldatapatternjl.readthedocs.io/en/latest/). At a higher level, there are also convenience functions like [`FileDataset`](#) to create data containers.
136+
For more information, see the [data container tutorial](data_containers.md) and the [MLUtils.jl docs](https://juliaml.github.io/MLUtils.jl/dev/). At a higher level, there are also convenience functions like `loadfolderdata` to create data containers.
137137

138138
### Layers and architectures
139139

docs/glossary.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,7 @@ Terms commonly used in *FastAI.jl*.
66

77
In many docstrings, generic types are abbreviated with the following symbols. Many of these refer to a learning task; the context should make clear which task is meant.
88

9-
- `DC{T}`: A [data container](#data-container) of type T, meaning a type that implements the data container interface `getobs` and `nobs` where `getobs : (DC{T}, Int) -> Int`, that is, each observation is of type `T`.
9+
- `DC{T}`: A [data container](data_containers.md) of type T, meaning a type that implements the data container interface `getindex`/`getobs` and `length`/`numobs` where `getobs : (DC{T}, Int) -> Int`, that is, each observation is of type `T`.
1010
- `I`: Type of the unprocessed input in the context of a task.
1111
- `T`: Type of the target variable.
1212
- `X`: Type of the processed input. This is fed into a `model`, though it may be batched beforehand. `Xs` represents a batch of processed inputs.
@@ -23,7 +23,7 @@ Some examples of these in use:
2323

2424
### Data container
2525

26-
A data structure that is used to load a number of data observations separately and lazily. It defines how many observations it holds with `nobs` and how to load a single observation with `getobs`.
26+
A data structure that is used to load a number of data observations separately and lazily. It defines how many observations it holds with `numobs` and how to load a single observation with `getobs`.
2727

2828
### Learning task
2929

docs/introduction.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -33,7 +33,7 @@ ENV["DATADEPS_ALWAYS_ACCEPT"] = "true"
3333
data, blocks = load(datarecipes()["imagenette2-160"])
3434
```
3535

36-
This line downloads and loads the [ImageNette](https://github.com/fastai/imagenette) image classification dataset, a small subset of ImageNet with 10 different classes. `data` is a [data container](data_containers.md) that can be used to load individual observations, here of images and the corresponding labels. We can use `getobs(data, i)` to load the `i`-th observation and `nobs` to find out how many observations there are.
36+
This line downloads and loads the [ImageNette](https://github.com/fastai/imagenette) image classification dataset, a small subset of ImageNet with 10 different classes. `data` is a [data container](data_containers.md) that can be used to load individual observations, here of images and the corresponding labels. We can use `getobs(data, i)` to load the `i`-th observation and `numobs` to find out how many observations there are.
3737

3838
{cell=main }
3939
```julia

docs/project.jl

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -6,12 +6,13 @@ Crayons.COLORS[:nothing] = 67
66
ENV["DATADEPS_ALWAYS_ACCEPT"] = "true"
77

88
using FastAI, Flux, FluxTraining
9-
import DataAugmentation
9+
import DataAugmentation, MLUtils
1010
m = FastAI
1111
ms = [
1212
DataAugmentation,
1313
Flux,
1414
FluxTraining,
15+
MLUtils,
1516
m,
1617
]
1718

notebooks/how_to_visualize.ipynb

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -106,7 +106,7 @@
106106
}
107107
],
108108
"source": [
109-
"idxs = rand(1:nobs(data), 9)\n",
109+
"idxs = rand(1:numobs(data), 9)\n",
110110
"samples = [getobs(data, i) for i in idxs]\n",
111111
"xs, ys = makebatch(task, data, idxs)\n",
112112
"ŷs = gpu(model)(gpu(xs)) |> cpu"

0 commit comments

Comments
 (0)