Skip to content

Discussion regarding refactoring to include a larger variety of datasets #45

@ctessum

Description

@ctessum

Currently this package includes functions to 1) download, 2) open, and 3) query two different datasets. If we want to expand the package to include a wider variety of data, step 3 is going to be difficult because it is hard to anticipate how users will want to use the data ahead of time.

So I would propose that it make sense to remove step 3 from the scope of the project, at least for the near term until common workflows for querying datasets become apparent.

To make steps 1 and 2 more general, maybe the following setup would be useful:

export open, cache

abstract type GeoDataset end
abstract type NCDataset <: GeoDataset end
abstract type ZarrDataset <: GeoDataset end
abstract type ShapefileDatset <: GeoDataset end

struct GSHHG <: ShapefileDatset
    resolution::Int
    level::Int
end

function Base.open(d::GeoDataset)
   cache(d)
   _open(d)
end

function cache(d::GeoDataset)
    .....
end

# This function would go in an extension for when GeoDataFrames.jl or whatever is loaded.
function _open(d::ShapefileDataset)
    .....
end
... etc

In the example above there are two main functions, open and cache, where cache would check if the file was locally available and download it if not. And open would open the file.

Each specific dataset would be a type including any fields necessary for configuration of specification, e.g. resolution and level as above information about mirrors, authentication, etc.

I think the cache function could be relatively straightforward, with maybe some different branches for http vs s3, whether authentication is required, whether there's a possibility for hash checking, etc. I think the cache function should be exported to facility data being dowloaded on HPC systems where only the login node has an internet connection.

The _open function is trickier because there will potentially be many different file formats, but I think the way forward would be to have extensions for each file format to provide a method for open that type of file.

The user would ultimately run open(GSHHG(2,1)) to get a geodataframe with the GSHHG data.

Let me know what you think.

cc @simone-silvestri

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions