-
Notifications
You must be signed in to change notification settings - Fork 3
Description
We want to be able to declare some amount of data processing on the level of DataSets. This is very much related to the users being able to declare layers of processing to help interpret a dataset (e.g. that a random set of bytes is actually a table in a CSV format; c.f. #17).
I would suggest fundamentally thinking about this on the level of dtypes (i.e. File, FileTree right now, but we want others like Table and Image too). That is, it conceptually a function taking one dtype in, and producing another one (possibly the same one). The implementation for each of these processors simply relies on the abstract interface of the dtype (e.g. an IO stream for File, or a Tables.jl-interface for a Table input).
A few examples:
- Decompressing a file is a
decompress(::File) -> Fileoperations. - Same for decrypting:
decrypt(::File; key=...) -> File. But sometimes you need to pass options as well, as we can't always automagically infer everything. - Unpacking a tarball:
unpack(::File) -> FileTree. - Parsing a table or an image:
table(::File) -> Tableorimage(::File) -> Image. - Even accessing files from a
FileTreeis really just aFileTree -> FileTree / Fileoperation.
This quite naturally lends itself to forming a pipeline
open(File, dataset("an_encrypted_tar_gz")) |> decrypt(key = ...) |> decompress |> unpack
I imagine such operations be implemented in a separate packages. They depend on DataSets, and any other packages providing dtypes. On the other hand, they would mostly be interface packages for other package (e.g. DataSetTables.jl would probably depend on multiple tabular data formats, like CSV.jl and Arrow.jl).
I also imagine that this pipeline will largely be implemented lazily, although that would be an implementation detail.
Declared layers
Once you have that general logic of transforming between dtypes, you can take advantage of this to implement the layers of #17. Each layer is just a call to one of these processors.
A question is how to declare this in the metadata (e.g. TOML file). One possibility is to declare the Julia function in processor = "DataSetTables.table". It should then have a method (::File, config::Dict) -> Table method. config would correspond to (optional) configuration parameters defined in the metadata.
For first iteration, I wouldn't worry too much about code loading. It's up to the user to make sure they have the correct package / module in the Project.toml and loaded. At some point we could add package UUIDs and compat checks to the Data.toml.