Using frictionlessdata data package for web archive data package (WACZ format) #589
-
I wanted to reach out to the frictionless data community share that we (https://github.com/webrecorder, https://webrecorder.net/) are working on a new packaging format to store web archive data, and are planning to adopt frictionless datapackage as the basis for this format! The current spec and goals for this format is described here so far (not yet using frictionless data): The goal of the format is to create a portable format for archived web content that can be distributed and contain additional useful metadata about the web archives. The format is also designed to be accessed in chunks, taking advantage of the ZIP format to allow reading only part of a file. This allows very large web archives to be loaded on-demand in the browser, accessing only the portions that are needed. (An example of this can be seen here: https://webrecorder.net/embed-demo-2.html) The bulk of the raw web archive data will use existing formats, such as WARC, and we are packaging them up in a ZIP file, along with additional data and metadata. This new format we're calling WACZ (Web Archive Collection Zipped) We are looking at the table schemas to describe more typical tabular data, such as list of page in a web archive. We are starting to use https://github.com/frictionlessdata/frictionless-py package for part of the validation and creation work for WACZ. The WACZ file will also be read directly in the browser, so we'll have to support it in Javascript as well as Python. We'll open specific questions if we run into any issues, but just wanted to start this issue here to say that we're looking forward to adopting frictionless datapackage for this use case. Let us know if there are any comments/suggestions/feedback for what we're trying to do with frictionlessdata and web archives! |
Beta Was this translation helpful? Give feedback.
Replies: 7 comments
-
Hi @ikreymer that's awesome - thanks for sharing! Maybe we could have you present a little bit about your plans/project at an upcoming community call? Also, we have a community chat in discord (https://discord.com/invite/j9DNFNw) in case you want to join there too. That chat is a good place to ask questions & get feedback from other users. And please do open issues if you run into any problems! Your project sounds really cool :-) |
Beta Was this translation helpful? Give feedback.
-
Great! I signed up for the upcoming call and got a calendar invite! |
Beta Was this translation helpful? Give feedback.
-
awesome @ikreymer! At our upcoming call, we are going to have a presentation by another user group, so there might not be time during this call to hear about your work, but would you be interested in giving a very brief overview (a minute or 2) and then doing a deeper dive at our next call (17 Dec) or our January call? |
Beta Was this translation helpful? Give feedback.
-
Sure, that sounds perfect! We can introduce our work on next call and prepare something longer for the following one. |
Beta Was this translation helpful? Give feedback.
-
@sglavoie could you please add Ilya's project to the agenda for the December community call? Thanks! |
Beta Was this translation helpful? Give feedback.
-
@lwinfree Noted! 🙂 |
Beta Was this translation helpful? Give feedback.
-
Ilya presented about this during the December 2020 community call. Watch the video: https://frictionlessdata.io/blog/2020/12/17/december-virtual-hangout |
Beta Was this translation helpful? Give feedback.
Ilya presented about this during the December 2020 community call. Watch the video: https://frictionlessdata.io/blog/2020/12/17/december-virtual-hangout