Skip to content

Suggest to read datapackage.json if it is one of the resources #287

@peterdesmet

Description

@peterdesmet

Zenodo now offers Data Package as an export format for the metadata (e.g. https://zenodo.org/records/10054230/export/datapackage). It includes the deposit metadata (contributors, license, etc.) and all files as resources. These resources are generic (with name, path, format, mimetype, bytes, hash): they are not specified as tabular (even if they are) and do not contain a schema.

For deposits that have a datapackage.json file, one of the resources listed will be that datapackage.json:

library(frictionless)
(p <- read_package("https://zenodo.org/records/10054230/export/datapackage"))
#> A Data Package with 22 resources:
#> • HG_OOSTENDE-acceleration-2017.csv.gz
#> • HG_OOSTENDE-gps-2013.csv.gz
#> • HG_OOSTENDE-gps-2019.csv.gz
#> • HG_OOSTENDE-gps-2021.csv.gz
#> • HG_OOSTENDE-acceleration-2016.csv.gz
#> • HG_OOSTENDE-gps-2017.csv.gz
#> • HG_OOSTENDE-gps-2016.csv.gz
#> • HG_OOSTENDE-acceleration-2022.csv.gz
#> • HG_OOSTENDE-acceleration-2020.csv.gz
#> • HG_OOSTENDE-acceleration-2021.csv.gz
#> • HG_OOSTENDE-acceleration-2014.csv.gz
#> • HG_OOSTENDE-acceleration-2018.csv.gz
#> • HG_OOSTENDE-acceleration-2019.csv.gz
#> • HG_OOSTENDE-acceleration-2013.csv.gz
#> • HG_OOSTENDE-gps-2014.csv.gz
#> • HG_OOSTENDE-acceleration-2015.csv.gz
#> • HG_OOSTENDE-gps-2015.csv.gz
#> • HG_OOSTENDE-gps-2018.csv.gz
#> • datapackage.json
#> • HG_OOSTENDE-gps-2022.csv.gz
#> • HG_OOSTENDE-gps-2020.csv.gz
#> • HG_OOSTENDE-reference-data.csv
#> For more information, see <https://doi.org/10.5281/zenodo.10054230>.
#> Use `unclass()` to print the Data Package as a list.

read_resource(p, "datapackage.json")
#> Error in `get_schema()` at frictionless-r/R/read_from_path.R:13:3:
#> ! Resource "datapackage.json" must have a profile property with value
#>   "tabular-data-resource".

datapackage_path <- frictionless:::get_resource(p, "datapackage.json")$path
read_package(datapackage_path)
#> A Data Package with 3 resources:
#> • reference-data
#> • gps
#> • acceleration
#> For more information, see <https://doi.org/10.5281/zenodo.10054230>.
#> Use `unclass()` to print the Data Package as a list.

It would be nice if read_package() could notice this and suggest to the user to read that file instead.

p <- read_package("https://zenodo.org/records/10054230/export/datapackage")
#> ...
#> One of the listed resources is a "datapackage.json" which may describe
#> the resources in more detail. Read it with
#> `read_package("https://zenodo.org/records/10054230/files/datapackage.json")`.

This is good as a first approach, but it doesn't allow easy programmatic access. Suggestions to do that:

  1. An attribute. NULL if there is no datapackage.json resource:

    p1 <- read_package("https://zenodo.org/records/10054230/export/datapackage")
    p1$resource_datapackage_path
    #> "https://zenodo.org/records/10054230/files/datapackage.json"
    p2 <- read_package(p1$resource_datapackage_path)
  2. Piping read_package(). If you pass a package to read_package() it attempts to read the deeper datapackage.json or return the original one if not found:

    read_package("https://zenodo.org/records/10054230/export/datapackage") |>
    read_package()
  3. A merge parameter that tries to merge the first (metadata) and second (resources) datapackage.json files. Note: there is no guarantee that the second one contains better resources info and worse metadata, but it is likely for Zenodo deposits.

    read_package(
      "https://zenodo.org/records/10054230/export/datapackage",
      merge = TRUE
    )

It would be good to investigate how other implementations do this. @roll how is this implemented in dpkit and/or Python?

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions