Discussion regarding refactoring to include a larger variety of datasets

Currently this package includes functions to 1) download, 2) open, and 3) query two different datasets. If we want to expand the package to include a wider variety of data, step 3 is going to be difficult because it is hard to anticipate how users will want to use the data ahead of time.

So I would propose that it make sense to remove step 3 from the scope of the project, at least for the near term until common workflows for querying datasets become apparent.

To make steps 1 and 2 more general, maybe the following setup would be useful:

```julia
export open, cache

abstract type GeoDataset end
abstract type NCDataset <: GeoDataset end
abstract type ZarrDataset <: GeoDataset end
abstract type ShapefileDatset <: GeoDataset end

struct GSHHG <: ShapefileDatset
    resolution::Int
    level::Int
end

function Base.open(d::GeoDataset)
   cache(d)
   _open(d)
end

function cache(d::GeoDataset)
    .....
end

# This function would go in an extension for when GeoDataFrames.jl or whatever is loaded.
function _open(d::ShapefileDataset)
    .....
end
... etc
```

In the example above there are two main functions, `open` and `cache`, where `cache` would check if the file was locally available and download it if not. And open would open the file.

Each specific dataset would be a type including any fields necessary for configuration of specification, e.g. resolution and level as above information about mirrors, authentication, etc. 

I think the `cache` function could be relatively straightforward, with maybe some different branches for http vs s3, whether authentication is required, whether there's a possibility for hash checking, etc. I think the `cache` function should be exported to facility data being dowloaded on HPC systems where only the login node has an internet connection.

The `_open` function is trickier because there will potentially be many different file formats, but I think the way forward would be to have extensions for each file format to provide a method for open that type of file.

The user would ultimately run `open(GSHHG(2,1))` to get a geodataframe with the GSHHG data.

Let me know what you think.

cc @simone-silvestri


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Discussion regarding refactoring to include a larger variety of datasets #45

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Discussion regarding refactoring to include a larger variety of datasets #45

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions