Replies: 18 comments
-
Possible solutions to the ticket: Implicit anchorOne easy way to specify a file in a .zip file would be the same as an anchor in HTML with #:
One problem is that then the path or the file name cannot have a #. Not common, but it is a valid character. We could ignore this problem (assume that there are no # in a filename), or escape it (with #) if it's part of the file name (but escaping might be confusing... and then we need to escape \ if it's in the filename, etc.). Explicit anchorSame idea as before but in a different field such as
Recursive resourcesIt might sound over-complicated but at least to be considered: recursive resources. It captures all the data on the outer and inner resource. The .zip file is a resource (with its path, format, bytes, hash, etc.). And the .zip file contains a list of resources inside it: with its paths, encoding, bytes, hash, etc. If one wanted it could go deeper. E.g.:
From the tooling point of view: it would be nice it they would be able to handle this at least with a handful of compression/archiving formats. So that they could get the compressed file and extract the file that is referred to in the resource. |
Beta Was this translation helpful? Give feedback.
-
@cpina really interesting user case. Can you give me a sense of the underlying user story / job story i.e. What do you think the end user wants to be able to do in the tooling? e.g. do you want to write this file to disk or extract it or present the information to users of what's inside ... |
Beta Was this translation helpful? Give feedback.
-
Sure thing @rufuspolloc. I'll give some context as it might be useful for other issues or Discord comments and it might be interesting to know what we are trying to do. Our case is that SPI has data published in Zenodo (e.g. https://zenodo.org/record/3379590#.Xr2gRWhb_Z8) and we have a pilot with EMODNet to incorporate data into their portal. The imported data of the linked record looks like http://www.emodnet-physics.eu/Map/platinfo/PIGenericPlot.aspx?platformid=980308 . In the first import SPI documented what EMODNet needed regarding the structure of the files, descriptions, type of variables, etc. in text files. We started doing this last year when we were not familiar with Frictionless Data. In our second dataset (and hopefully all the rest!) we created Frictionless Data packages (https://github.com/Swiss-Polar-Institute/frictionless-data-packages) and EMODNet is going to use them to automate the downloading and ingesting of the files into the database (side note, currently we added additional fields to the table schema such as x_spi_cf_unit, in https://github.com/Swiss-Polar-Institute/frictionless-data-packages/blob/master/10.5281_zenodo.3634411/tableschema.json, while we decide how we do the extensions in the other issue, frictionlessdata/datapackage#663). Long story short: some of the datasets contain zip files with different types of data inside (CSVs, readmes, different types of CSVs, etc.). We want to indicate to EMODNet that there is a zip file within the datapackage and describe the different tabular data and other resources within the zip file. Tooling wise: I'm not sure if EMODNet is using the Frictionless Data tooling or if they are writing their own downloader and JSON parser. I have the feeling that they write their own one but I can check at some point. |
Beta Was this translation helpful? Give feedback.
-
And why do you want to describe what's inside? So that EMODNET use that info in some way? And if so how? |
Beta Was this translation helpful? Give feedback.
-
Yes, we would like to use Frictionless Data schemas to describe what's in the published dataset. The published dataset might have .zip files with different type of files and/or different tabular data. The main goal in short term is for EMODNet: they can ingest in their storage the tabular data knowing what's on each column (descriptions, the CF Variables, types, etc.) so the final result looks like https://www.emodnet-physics.eu/Map/platinfo/PIGenericPlot.aspx?platformid=980308 . It also allows to compare different datasets easier on the platform (I think!) or locally. A "side effect" is to create a pool of Frictionless Data schemas for published data and we might retroactively add Frictionless Data schemas in the datasets itself. How does EMODNet do this? My understanding is that their backend is ERDDAP: https://coastwatch.pfeg.noaa.gov/erddap/index.html . I haven't set this up but it's an open source software: https://coastwatch.pfeg.noaa.gov/erddap/download/setup.html Frictionless Data could think of a .zip file like a "type of" a directory (to certain extend! as it has a hash and type by itself): a way to group many files. Sometimes it needs to be used for practical reasons: either to reduce the total size, reduce the number of files, make it easier to download -I think that Zenodo doesn't have a "download all", a way to group files. As another example: today we started preparing another dataset (to be uploaded in Zenodo): 109 GB, 950K files (average file size is 120 KB). I'm not on charge of the publishing to Zenodo but my understanding is that this will be compressed and published in a .zip file (not the readme, the files) and at some point EMODNet might the description of what's on each column. I'm not sure if I explained what you asked or if I'm running in circles! :-) |
Beta Was this translation helpful? Give feedback.
-
OK, I think it get the idea. You basically want the manifest of the zip file. For single files in a zip note that we already have a pattern here http://specs.frictionlessdata.io/patterns/#specification-3 For multi-file zips it is obviously more complicated. BTW in that 109GB / 950k file dataset consists of different types of files with different schemas or lots of the same kind of file? |
Beta Was this translation helpful? Give feedback.
-
About the 109GB/950K files dataset: I'm afraid that it contains more than one type of files. I think that it could be argued in this case (before depositing) that it's possible to re-pack it before the deposit in a few zip files and each one should contain only one type of file. Someone might wonder why not depositing the original dataset and trying to adapt it for the tooling (and the provenance of data, etc.); and repacking to have a .zip per data type is not possible or practical for already published data. When I was writing my answer last night I wondered if a wildcard matching (or a regular expression matching?) to match files in the dataset (inside the zip or outside) would be useful to indicate which tabular schema each file is using without repeating it. Then I thought of not adding more complexity here :-) and that it's convenient to have the hashes for each file. BTW, I just remembered of bagit: https://tools.ietf.org/html/draft-kunze-bagit-14 that it might be used together with Frictionless Data (I haven't checked bagit for 4 years now) to keep file hashes, perhaps, separately. BTW, http://specs.frictionlessdata.io/patterns/#specification-3, I'm surprised to have .zip and .gz in the same category since .zip can be for multiple files and .gz only for one file. I'll think more of this when I think of the local_path and remote_path. Thanks for your answers! |
Beta Was this translation helpful? Give feedback.
-
Answering to myself and why not to have wildcards/matching outside .zip files: files could not be downloaded (since the path wouldn't be known), and if I had the data package missing files would be... missed silently (for not having a full list with the hashes). |
Beta Was this translation helpful? Give feedback.
-
FIXED. In summary:
@cpina if you want to reopen please flag. |
Beta Was this translation helpful? Give feedback.
-
I'd like to reopen it - I see it quite essential for many datasets. Obviously I could always do something "off-the-spec"... Is there an interest for the solution "Recursive resources"? If so, would it help if I try to draft something for the specs repo as a PR? (it would be in some weeks when I can spend some time looking at the repo and hopefully drafting something). |
Beta Was this translation helpful? Give feedback.
-
@cpina reopening and the way to start would be to draft a "pattern" for this and we can iterate from there. |
Beta Was this translation helpful? Give feedback.
-
@rufuspollock : will do! For future reference, it will be there I guess: https://github.com/frictionlessdata/specs/tree/master/patterns (I'll update with the branch/PR here I guess). |
Beta Was this translation helpful? Give feedback.
-
@cpina @rufuspollock Creating a new pattern for this makes sense; the compressed resource pattern is focused on single-file compression scenarios and not multi-file compressed archives or multi-file archives in general. It appears that the current requirements/restrictions for a valid Data Resource would apply to a child resource also, with the exception of the path which looks like it needs to be a POSIX path relative to the root of the archive. A URL wouldn’t make sense at the child resource level because once an archive is extracted, a contained file's path is relative to the location of the extracted archive. With regard to the format property, I’m guessing this would just be set to the extension for the files archive format (https://en.m.wikipedia.org/wiki/List_of_archive_formats)? I think the Recursive Resources pattern should have some tie-in with the Compressed Resources pattern to clarify when one should be used over the over and when they could be used together:
Scenario 2 would address the problem discussed at frictionlessdata/datapackage#639 as mentioned there by @cpina. Happy to hear any thoughts. |
Beta Was this translation helpful? Give feedback.
-
It looks like we are converging to an approach here which is to have another @cpina / @michaelamadi would you be up for drafting a pattern PR on specs repo to recommend this approach |
Beta Was this translation helpful? Give feedback.
-
@rufuspollock Happy to contribute to the drafting. @cpina If you can draft the core principles of the Recursive Resources pattern, I can extend it with the tie-in to the Compression of Resources pattern. Side note: As the recursive resources approach can support an unlimited number of recursions, and therefore complexity, we'll need to explicitly define the maximum number of supported recursions in the pattern. Based on the discussion so far, capping this at one level of recursion wouldn't be unreasonable and would likely cover most use cases. Can either of you think of a compelling reason to set this higher? |
Beta Was this translation helpful? Give feedback.
-
@michaelamadi @rufuspollock : Level of recursion: files in a .tar.gz might be seen as in a 2-level recursion perhaps (if we have the hash+size of the .tar.gz, then the hash+size of the .tar and then the hash+size of the individual files). I'll think a bit more when writing the draft. |
Beta Was this translation helpful? Give feedback.
-
@michaelamadi @rufuspollock Tomorrow I'll polish it a bit more. First "Pattern" that I've written so any feedback is welcomed... feel free to wait one day so I have another look. I really wanted to push a first release out :-) |
Beta Was this translation helpful? Give feedback.
-
@cpina i've left a few minor comments. looking very good. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
A data resource has a data location specified here:
https://specs.frictionlessdata.io/data-resource/#data-location
There are datasets where a series of files need to be bundled in a .zip file. This is for practical reasons, limitations in the publishing platforms or just the way that the original data was provided.
One example in Zenodo: https://zenodo.org/record/3247384#.XrlxUmhb-gp
There is one file: https://zenodo.org/record/3247384/files/Sea-Bird_Processed_Data.zip
with a series of files inside.
So, the question is: how to use:
Note: obviously it doesn't need to be a .zip file but any other archiving or compressing formats (.tar, .tar.bz2, .7z, etc.)
Solution (update from Rufus)
It looks like we are inclining to have a resources array inside the resource.
Beta Was this translation helpful? Give feedback.
All reactions