Resource path: how to specify files in a .zip? #601

cpina · 2020-05-11T16:11:44Z

cpina
May 11, 2020

A data resource has a data location specified here:
https://specs.frictionlessdata.io/data-resource/#data-location

There are datasets where a series of files need to be bundled in a .zip file. This is for practical reasons, limitations in the publishing platforms or just the way that the original data was provided.
One example in Zenodo: https://zenodo.org/record/3247384#.XrlxUmhb-gp

There is one file: https://zenodo.org/record/3247384/files/Sea-Bird_Processed_Data.zip
with a series of files inside.

So, the question is: how to use:

{
    "profile": "data-package",
    "resources": [
        {
            "path": "https://zenodo.org/record/3247384/files/Sea-Bird_Processed_Data.zip", # need to specify the file
        }
    ]
}

Note: obviously it doesn't need to be a .zip file but any other archiving or compressing formats (.tar, .tar.bz2, .7z, etc.)

Solution (update from Rufus)

It looks like we are inclining to have a resources array inside the resource.

cpina · 2020-05-11T16:12:00Z

cpina
May 11, 2020
Author

Possible solutions to the ticket:

Implicit anchor

One easy way to specify a file in a .zip file would be the same as an anchor in HTML with #:

{
    "profile": "data-package",
    "resources": [
        {
            "path": "https://zenodo.org/record/3247384/files/Sea-Bird_Processed_Data.zip#file_name.csv",
            "format": "csv",
            "mediatype": "text/csv",
            "bytes": "2424"
            "hash": "d0d540b287cfe4c23af2f3cf544a92ee",
        }
    ]
}

One problem is that then the path or the file name cannot have a #. Not common, but it is a valid character.

We could ignore this problem (assume that there are no # in a filename), or escape it (with #) if it's part of the file name (but escaping might be confusing... and then we need to escape \ if it's in the filename, etc.).

Explicit anchor

Same idea as before but in a different field such as internal_path:
E.g. datapackage.json:

{
    "profile": "data-package",
    "resources": [
        {
            "path": "https://zenodo.org/record/3247384/files/Sea-Bird_Processed_Data.zip",
            "internal_path": "file_name.csv",
            "format": "csv",
            "mediatype": "text/csv",
            "bytes": "2424"
            "hash": "d0d540b287cfe4c23af2f3cf544a92ee",
        }
    ]
}

Recursive resources

It might sound over-complicated but at least to be considered: recursive resources. It captures all the data on the outer and inner resource.

The .zip file is a resource (with its path, format, bytes, hash, etc.). And the .zip file contains a list of resources inside it: with its paths, encoding, bytes, hash, etc. If one wanted it could go deeper.

E.g.: datapackage.json:

{
    "profile": "data-package",
    "resources": [
        {
            "path": "https://zenodo.org/record/3247384/files/Sea-Bird_Processed_Data.zip",
            "format": "zip",
            "mediatype": "application/zip",
            "bytes": "294294242424"
            "hash": "a27063c614c183b502e5c03bd9c8931b",
            "resources": [
                {
                    "path": "file_name.csv",
                    "format": "csv",
                    "mediatype": "text/csv",
                    "bytes": 242421,
                    "hash": "0300048878bb9b5804a1f62869d296bc",
                },
                {
                    "path": "file_name2.csv",
                    "format": "csv",
                    "mediatype": "text/csv",
                    "bytes": 2424213,
                    "hash": "ff9435e0ee350efbe8a4a8779a47caaa",
                }
            ]
        }
    ]
}

From the tooling point of view: it would be nice it they would be able to handle this at least with a handful of compression/archiving formats. So that they could get the compressed file and extract the file that is referred to in the resource.

0 replies

rufuspollock · 2020-05-14T14:42:25Z

rufuspollock
May 14, 2020
Maintainer

@cpina really interesting user case. Can you give me a sense of the underlying user story / job story i.e. What do you think the end user wants to be able to do in the tooling? e.g. do you want to write this file to disk or extract it or present the information to users of what's inside ...

0 replies

cpina · 2020-05-14T20:20:25Z

cpina
May 14, 2020
Author

Sure thing @rufuspolloc. I'll give some context as it might be useful for other issues or Discord comments and it might be interesting to know what we are trying to do.

Our case is that SPI has data published in Zenodo (e.g. https://zenodo.org/record/3379590#.Xr2gRWhb_Z8) and we have a pilot with EMODNet to incorporate data into their portal. The imported data of the linked record looks like http://www.emodnet-physics.eu/Map/platinfo/PIGenericPlot.aspx?platformid=980308 .

In the first import SPI documented what EMODNet needed regarding the structure of the files, descriptions, type of variables, etc. in text files. We started doing this last year when we were not familiar with Frictionless Data.

In our second dataset (and hopefully all the rest!) we created Frictionless Data packages (https://github.com/Swiss-Polar-Institute/frictionless-data-packages) and EMODNet is going to use them to automate the downloading and ingesting of the files into the database (side note, currently we added additional fields to the table schema such as x_spi_cf_unit, in https://github.com/Swiss-Polar-Institute/frictionless-data-packages/blob/master/10.5281_zenodo.3634411/tableschema.json, while we decide how we do the extensions in the other issue, frictionlessdata/datapackage#663).

Long story short: some of the datasets contain zip files with different types of data inside (CSVs, readmes, different types of CSVs, etc.). We want to indicate to EMODNet that there is a zip file within the datapackage and describe the different tabular data and other resources within the zip file.

Tooling wise: I'm not sure if EMODNet is using the Frictionless Data tooling or if they are writing their own downloader and JSON parser. I have the feeling that they write their own one but I can check at some point.

0 replies

rufuspollock · 2020-05-15T20:14:03Z

rufuspollock
May 15, 2020
Maintainer

Long story short: some of the datasets contain zip files with different types of data inside (CSVs, readmes, different types of CSVs, etc.). We want to indicate to EMODNet that there is a zip file within the datapackage and describe the different tabular data and other resources within the zip file.

And why do you want to describe what's inside? So that EMODNET use that info in some way? And if so how?

0 replies

cpina · 2020-05-15T22:46:22Z

cpina
May 15, 2020
Author

Long story short: some of the datasets contain zip files with different types of data inside (CSVs, readmes, different types of CSVs, etc.). We want to indicate to EMODNet that there is a zip file within the datapackage and describe the different tabular data and other resources within the zip file.

And why do you want to describe what's inside? So that EMODNET use that info in some way? And if so how?

Yes, we would like to use Frictionless Data schemas to describe what's in the published dataset. The published dataset might have .zip files with different type of files and/or different tabular data.

The main goal in short term is for EMODNet: they can ingest in their storage the tabular data knowing what's on each column (descriptions, the CF Variables, types, etc.) so the final result looks like https://www.emodnet-physics.eu/Map/platinfo/PIGenericPlot.aspx?platformid=980308 . It also allows to compare different datasets easier on the platform (I think!) or locally. A "side effect" is to create a pool of Frictionless Data schemas for published data and we might retroactively add Frictionless Data schemas in the datasets itself.

How does EMODNet do this? My understanding is that their backend is ERDDAP: https://coastwatch.pfeg.noaa.gov/erddap/index.html . I haven't set this up but it's an open source software: https://coastwatch.pfeg.noaa.gov/erddap/download/setup.html

Frictionless Data could think of a .zip file like a "type of" a directory (to certain extend! as it has a hash and type by itself): a way to group many files. Sometimes it needs to be used for practical reasons: either to reduce the total size, reduce the number of files, make it easier to download -I think that Zenodo doesn't have a "download all", a way to group files.

As another example: today we started preparing another dataset (to be uploaded in Zenodo): 109 GB, 950K files (average file size is 120 KB). I'm not on charge of the publishing to Zenodo but my understanding is that this will be compressed and published in a .zip file (not the readme, the files) and at some point EMODNet might the description of what's on each column.

I'm not sure if I explained what you asked or if I'm running in circles! :-)

0 replies

rufuspollock · 2020-05-16T06:11:09Z

rufuspollock
May 16, 2020
Maintainer

OK, I think it get the idea. You basically want the manifest of the zip file. For single files in a zip note that we already have a pattern here http://specs.frictionlessdata.io/patterns/#specification-3

For multi-file zips it is obviously more complicated.

BTW in that 109GB / 950k file dataset consists of different types of files with different schemas or lots of the same kind of file?

0 replies

cpina · 2020-05-16T09:01:18Z

cpina
May 16, 2020
Author

About the 109GB/950K files dataset: I'm afraid that it contains more than one type of files. I think that it could be argued in this case (before depositing) that it's possible to re-pack it before the deposit in a few zip files and each one should contain only one type of file. Someone might wonder why not depositing the original dataset and trying to adapt it for the tooling (and the provenance of data, etc.); and repacking to have a .zip per data type is not possible or practical for already published data.

When I was writing my answer last night I wondered if a wildcard matching (or a regular expression matching?) to match files in the dataset (inside the zip or outside) would be useful to indicate which tabular schema each file is using without repeating it. Then I thought of not adding more complexity here :-) and that it's convenient to have the hashes for each file. BTW, I just remembered of bagit: https://tools.ietf.org/html/draft-kunze-bagit-14 that it might be used together with Frictionless Data (I haven't checked bagit for 4 years now) to keep file hashes, perhaps, separately.

BTW, http://specs.frictionlessdata.io/patterns/#specification-3, I'm surprised to have .zip and .gz in the same category since .zip can be for multiple files and .gz only for one file. I'll think more of this when I think of the local_path and remote_path.

Thanks for your answers!

0 replies

cpina · 2020-05-16T09:39:52Z

cpina
May 16, 2020
Author

When I was writing my answer last night I wondered if a wildcard matching (or a regular expression matching?) to match files in the dataset (inside the zip or outside) would be useful to indicate which tabular schema each file is using without repeating it. Then I thought of not adding more complexity

Answering to myself and why not to have wildcards/matching outside .zip files: files could not be downloaded (since the path wouldn't be known), and if I had the data package missing files would be... missed silently (for not having a full list with the hashes).

0 replies

rufuspollock · 2020-06-12T15:40:36Z

rufuspollock
Jun 12, 2020
Maintainer

FIXED. In summary:

You can handle one file inside a zip
Specifying all the files would be a new pattern, probably in specs repo.

@cpina if you want to reopen please flag.

0 replies

cpina · 2020-06-12T21:59:28Z

cpina
Jun 12, 2020
Author

FIXED. In summary:
* You can handle one file inside a zip

* Specifying all the files would be a new pattern, probably in specs repo.
@cpina if you want to reopen please flag.

I'd like to reopen it - I see it quite essential for many datasets. Obviously I could always do something "off-the-spec"...

Is there an interest for the solution "Recursive resources"? If so, would it help if I try to draft something for the specs repo as a PR? (it would be in some weeks when I can spend some time looking at the repo and hopefully drafting something).

0 replies

rufuspollock · 2020-06-15T11:53:44Z

rufuspollock
Jun 15, 2020
Maintainer

@cpina reopening and the way to start would be to draft a "pattern" for this and we can iterate from there.

0 replies

cpina · 2020-06-15T12:26:12Z

cpina
Jun 15, 2020
Author

@rufuspollock : will do! For future reference, it will be there I guess: https://github.com/frictionlessdata/specs/tree/master/patterns (I'll update with the branch/PR here I guess).

0 replies

michaelamadi · 2020-06-23T08:00:54Z

michaelamadi
Jun 23, 2020

@cpina @rufuspollock Creating a new pattern for this makes sense; the compressed resource pattern is focused on single-file compression scenarios and not multi-file compressed archives or multi-file archives in general.

It appears that the current requirements/restrictions for a valid Data Resource would apply to a child resource also, with the exception of the path which looks like it needs to be a POSIX path relative to the root of the archive. A URL wouldn’t make sense at the child resource level because once an archive is extracted, a contained file's path is relative to the location of the extracted archive.

With regard to the format property, I’m guessing this would just be set to the extension for the files archive format (https://en.m.wikipedia.org/wiki/List_of_archive_formats)?

I think the Recursive Resources pattern should have some tie-in with the Compressed Resources pattern to clarify when one should be used over the over and when they could be used together:

Individually compressed files where the hash and bytes properties need to be specified only for the compressed resource => Compressed Resources pattern
Individually compressed files where the hash and bytes properties need to be specified for both the compressed resource and uncompressed resource => Compressed Resources pattern + Recursive Resources pattern
Multi-file archives (without compression) => Recursive Resources pattern
Multi-file archives (with compression) => Compressed Resources pattern + Recursive Resources pattern

Scenario 2 would address the problem discussed at frictionlessdata/datapackage#639 as mentioned there by @cpina.

Happy to hear any thoughts.

0 replies

rufuspollock · 2020-06-23T15:32:55Z

rufuspollock
Jun 23, 2020
Maintainer

It looks like we are converging to an approach here which is to have another resources array inside a (compressed) resource.

@cpina / @michaelamadi would you be up for drafting a pattern PR on specs repo to recommend this approach

0 replies

michaelamadi · 2020-06-24T09:15:16Z

michaelamadi
Jun 24, 2020

It looks like we are converging to an approach here which is to have another resources array inside a (compressed) resource.

@cpina / @michaelamadi would you be up for drafting a pattern PR on specs repo to recommend this approach

@rufuspollock Happy to contribute to the drafting.

@cpina If you can draft the core principles of the Recursive Resources pattern, I can extend it with the tie-in to the Compression of Resources pattern.

Side note: As the recursive resources approach can support an unlimited number of recursions, and therefore complexity, we'll need to explicitly define the maximum number of supported recursions in the pattern. Based on the discussion so far, capping this at one level of recursion wouldn't be unreasonable and would likely cover most use cases.

Can either of you think of a compelling reason to set this higher?

0 replies

cpina · 2020-06-24T09:38:52Z

cpina
Jun 24, 2020
Author

@michaelamadi @rufuspollock :
I'll do the draft in the next 5 or 7 days, I've been swamped with things the last few days :-)

Level of recursion: files in a .tar.gz might be seen as in a 2-level recursion perhaps (if we have the hash+size of the .tar.gz, then the hash+size of the .tar and then the hash+size of the individual files).

I'll think a bit more when writing the draft.

0 replies

cpina · 2020-07-05T22:34:01Z

cpina
Jul 5, 2020
Author

@michaelamadi @rufuspollock
Finally the draft: cpina/specs@e931164

Tomorrow I'll polish it a bit more. First "Pattern" that I've written so any feedback is welcomed... feel free to wait one day so I have another look. I really wanted to push a first release out :-)

0 replies

rufuspollock · 2020-07-07T23:24:27Z

rufuspollock
Jul 7, 2020
Maintainer

@cpina i've left a few minor comments. looking very good.

0 replies

Resource path: how to specify files in a .zip? #601

Uh oh!

cpina May 11, 2020

Solution (update from Rufus)

Replies: 18 comments

Uh oh!

Uh oh!

cpina May 11, 2020 Author

Implicit anchor

Explicit anchor

Recursive resources

Uh oh!

rufuspollock May 14, 2020 Maintainer

Uh oh!

cpina May 14, 2020 Author

Uh oh!

rufuspollock May 15, 2020 Maintainer

Uh oh!

cpina May 15, 2020 Author

Uh oh!

rufuspollock May 16, 2020 Maintainer

Uh oh!

cpina May 16, 2020 Author

Uh oh!

cpina May 16, 2020 Author

Uh oh!

rufuspollock Jun 12, 2020 Maintainer

Uh oh!

cpina Jun 12, 2020 Author

Uh oh!

rufuspollock Jun 15, 2020 Maintainer

Uh oh!

cpina Jun 15, 2020 Author

Uh oh!

Uh oh!

michaelamadi Jun 23, 2020

Uh oh!

rufuspollock Jun 23, 2020 Maintainer

Uh oh!

michaelamadi Jun 24, 2020

Uh oh!

cpina Jun 24, 2020 Author

Uh oh!

cpina Jul 5, 2020 Author

Uh oh!

rufuspollock Jul 7, 2020 Maintainer

cpina
May 11, 2020

cpina
May 11, 2020
Author

rufuspollock
May 14, 2020
Maintainer

cpina
May 14, 2020
Author

rufuspollock
May 15, 2020
Maintainer

cpina
May 15, 2020
Author

rufuspollock
May 16, 2020
Maintainer

cpina
May 16, 2020
Author

cpina
May 16, 2020
Author

rufuspollock
Jun 12, 2020
Maintainer

cpina
Jun 12, 2020
Author

rufuspollock
Jun 15, 2020
Maintainer

cpina
Jun 15, 2020
Author

michaelamadi
Jun 23, 2020

rufuspollock
Jun 23, 2020
Maintainer

michaelamadi
Jun 24, 2020

cpina
Jun 24, 2020
Author

cpina
Jul 5, 2020
Author

rufuspollock
Jul 7, 2020
Maintainer