Skip to content

Conversation

@roll
Copy link
Contributor

@roll roll commented May 27, 2020

@roll
Copy link
Contributor Author

roll commented May 27, 2020

Hi @akariv @cschloer,

It's my second take on this issue. The first attempt was #110

The new one uses tabulator's workbook_cache argument for XLSX which allows downloading every workbook only once - https://github.com/frictionlessdata/tabulator-py#xlsxlsx-read--write

I think dataflows is a better place for high-level logic as sheets management and cache control so my current idea that tabulator just needs to provide enough low-level tools to be managed by dataflows

@cschloer
Copy link
Contributor

cschloer commented Jun 1, 2020

I will take a look!

@akariv
Copy link
Member

akariv commented Jun 13, 2020

Hey @roll -

Wouldn't it be better if there was a standard way to open a 'container' kind of file - for example, an Excel file or Google Spreadsheet with multiple sheets or a ZIP file with multiple files.
Then you would be able to iterate on them with a consistent API.

This implementation basically re-opens the Excel file for each of the sheets, reads a sample, infers a schema - and then checks to see if the sheet name adheres to the sheets regular expression... we might have solved the re-downloading problem but it's still very inefficient.

I'm thinking we could have a generic class similar to Stream, which would be Container:

>>> container = tabulator.Container('path/to/excel/file.xlsx')
# OR 
>>> container = tabulator.Container('path/to/archive/file.zip')
>>> for item in container.iter():
...  print(item.name) # could be sheet name or filename in zipfile
...  print(item.options) # dict of options to be used in Stream, e.g. '{"sheet": 1}' or {"compression": "zip"} etc.
... stream = item.stream(**other_options)  # Returns a Stream object

then you could do:

FILENAME = 'path/to/excel/file.xlsx'
Flow(*[
   load(FILENAME, headers=1, name='res-%d' % idx, **item.options)
   for i, item in enumerate(Container(FILENAME).iter())
]).process()

@roll
Copy link
Contributor Author

roll commented Jun 14, 2020

@akariv
Yes sure, I agree. And once I have time I will do something like this (we have some container concepts already like Package or Storage; maybe we can use them; may some new one). BTW, I'm not sure the currently used technique is really a noticeable overhead as ZIP-files should support proper random-access.

Aside from implementation, what do you think the best API will be for DPP? For dataflows it's easy to imaging but we need it also in a declarative form.

@akariv
Copy link
Member

akariv commented Jun 14, 2020

To answer the DPP question - off the top of my head.
We could have a load_batch processor (in dataflows and in DPP) with the following parameter:

  • selection - list of REs / predicate functions to control filename/sheet selection

Internally it will iterate on the different parts (using the tabulator.Collection object ! 😄 ) and will generate a bunch of loads as necessary.

Example selections:

  • ["https://path/to/file.xls", "Expenses.+"]
    Select all expenses sheets from an excel file on the web
  • ["/path/to/file.zip", ".+\.csv"]
    Select all csv files in a zip file
  • ["https://path/to/file.zip", "report.+\.xlsx", "Raw Data"]
    Select the 'Raw Data' sheet in all report excel files in a zip file on the web

resource name can be a slug of the filenames/sheetnames of the different parts combined (we can complicate it later)

@roll
Copy link
Contributor Author

roll commented Jun 15, 2020

@akariv
Cool, it sounds good. I doubt I will have enough time to implement tabulator.Collection/etc before the end of the current stage of the BCO-DMO pilot so I would suggest:

  • to agree on load_batch API (@cschloer please take a look if it makes sense for you)
  • I will implement the first version of load_batch probably using the current state of tabulator
  • We will switch to tabulator.Collection on the next iteration (only implementation-wise so it will not be breaking for users)
    WDYT?

@cschloer
Copy link
Contributor

Our custom load processor already has something relatively akin to "load_batch" - it can take a comma separated list of urls or a regular expression (for local or s3) url, and then it just generates a bunch of load steps for each resulting URL. But if your implementation will improve the load times for multiple sheets within an xlsx file, I am happy to switch over to your implementation. 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[dataflows] revisit loading multiple sheets with one load step

3 participants