Skip to content

pure-numpy interface to parquet#931

Draft
martindurant wants to merge 27 commits intodask:mainfrom
martindurant:faster
Draft

pure-numpy interface to parquet#931
martindurant wants to merge 27 commits intodask:mainfrom
martindurant:faster

Conversation

@martindurant
Copy link
Member

Due to the upcoming hard dependence of pandas on pyarrow, this branch investigates what it would look like to have a fastparquet that avoids pandas altogether and deals with numpy arrays alone. For complex columns, the representation will be similar and compatible to awkward/arrow buffers, but not require those packages.

@yohplala
Copy link

Hi @martindurant
I have seen your comment in #935:

Output will be an iterator over row-groups, and dictionaries giving the positions in the schema or light structured wrapper, something like:

{0: {
  'foo.with.strings-data': array([0, 1, -1], dtype=int8),
  'foo.with.strings-cats': ["hey", "there"],
  'foo.with.ints-data': array([1, 2, 3], dtype=uint8),
  'foo.with.lists.list-offsets': array([0, 1, 2, 3]),
  'foo.with.lists.list.element-data': array([0, 0, 0], dtype=uint8),
  'foo.with.lists.list.element-cats': [0]}
}

'foo.with.strings-data' appears to be a column name, right?
But, what is 0 key? The ID of the row group? (all arrays do not have all the same length, so I am not sure what it is)

I also am curious to know what will be the input for the general write() function?
A similar dictionary providing per column the corresponding data?

Thank you for your feedback!

@martindurant
Copy link
Member Author

'foo.with.strings-data' appears to be a column name

These are complex columns. In this case, a list-of-lists is made up of the data values, offsets and maybe an index (in the case of categoricals). There will be some simple wrappers in https://github.com/dask/fastparquet/blob/a9d3f309068189043f5ecec5f616de90c11fa305/fastparquet/wrappers.py to provide access to these nested structures, or the arrays could be passed directly to arrow, awkward or other libraries that know what to do with them.

  'foo.with.strings-data': array([0, 1, -1], dtype=int8),
  'foo.with.strings-cats': ["hey", "there"],

becomes ["hey", "there", None] as a list

  'foo.with.lists.list-offsets': array([0, 1, 2, 3]),
  'foo.with.lists.list.element-data': array([0, 0, 0], dtype=uint8),
  'foo.with.lists.list.element-cats': [0]}

becomes [[0], [0], [0]] as a list.

Yes, 0 is the row-group index. It could also include the filename maybe. Ways to combine arrays from multiple row-groups can be provided, but I am thinking that iterating over them will be more common.

@yohplala
Copy link

yohplala commented Sep 11, 2024

Thanks a lot for your quick feedbacks !
Please, can you also share your thoughts about the 2nd question?

I also am curious to know what will be the input for the general write() function?
A similar dictionary providing per column the corresponding data?

@martindurant
Copy link
Member Author

I also am curious to know what will be the input for the general write() function?
A similar dictionary providing per column the corresponding data?

Yes, I think so. So in the simple case of tabular data (nothing nested), this is essentially what pandas gives you anyway: dict(df) => {col: values}. For structured data, we can provide ways to ingest lists/dicts, but the best path would be for the caller to provide offsets and such directly, or use the same wrapper classes I referenced above. Reading will be ready well before writing, though!

@beckermr
Copy link

beckermr commented Jan 5, 2026

@erykoff has an interest in a "dict of arrays output" and has his numparquet project: https://github.com/erykoff/numparquet.

I do not want to muddle this PR with new ideas/features, but I do want to connect you all together since I think you all have common goals. :)

@martindurant
Copy link
Member Author

@erykoff : happy to talk and help. I have not had a chance to see your work, since I didn't know about it until just now.

@erykoff
Copy link

erykoff commented Jan 5, 2026

My work did not exist until just now! It was a holiday break hobby project to see how far I could get. I hadn't looked at fastparquet because it was so entwined with pandas (which I try to avoid). Nevertheless, I now realize that the primitives here really do almost everything that we need.

What we are looking for is:

  • A parquet reader that outputs a dict of numpy arrays, keyed by column name.
  • That doesn't use 4x memory overhead (like pyarrow)
  • Handles nulls via numpy masked arrays.
  • Has a well defined schema with key/value metadata. (This is something that pandas does not have).
  • We do need array (list) columns, definitely aligned, probably ragged as well (even though these are painful in numpy).
  • No implicit multithreading. This is actively harmful for how we run on compute clusters.
  • No need for / concern for nested types (which are also super complicated as you show above).

I'm happy to look at this PR and see if I can make a minimal working example of what we need. But I don't know if it's general...

@martindurant
Copy link
Member Author

To give more detail on the experiment in this PR, it does work, including variable strings, lists and nested records with or without nulls. You should find the thrift implementation here significantly faster than thriftpy2 (which may be important for big schemas).

Those various complex types are returned as sets of offsets into data arrays, e.g., strings should be a (numpy) uint8 array and an uint32 array of offsets. This is best for loading speed and storage size unless you actually want python strings.

Complexity is around how to combine pages and row-groups, particularly is you intend to try to parallelise.

@martindurant
Copy link
Member Author

Did anyone have any use for the work in this PR?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants