pure-numpy interface to parquet by martindurant · Pull Request #931 · dask/fastparquet

martindurant · 2024-08-22T13:58:25Z

Due to the upcoming hard dependence of pandas on pyarrow, this branch investigates what it would look like to have a fastparquet that avoids pandas altogether and deals with numpy arrays alone. For complex columns, the representation will be similar and compatible to awkward/arrow buffers, but not require those packages.

yohplala · 2024-09-11T12:51:32Z

Hi @martindurant
I have seen your comment in #935:

Output will be an iterator over row-groups, and dictionaries giving the positions in the schema or light structured wrapper, something like:

{0: {
  'foo.with.strings-data': array([0, 1, -1], dtype=int8),
  'foo.with.strings-cats': ["hey", "there"],
  'foo.with.ints-data': array([1, 2, 3], dtype=uint8),
  'foo.with.lists.list-offsets': array([0, 1, 2, 3]),
  'foo.with.lists.list.element-data': array([0, 0, 0], dtype=uint8),
  'foo.with.lists.list.element-cats': [0]}
}

'foo.with.strings-data' appears to be a column name, right?
But, what is 0 key? The ID of the row group? (all arrays do not have all the same length, so I am not sure what it is)

I also am curious to know what will be the input for the general write() function?
A similar dictionary providing per column the corresponding data?

Thank you for your feedback!

martindurant · 2024-09-11T13:12:03Z

'foo.with.strings-data' appears to be a column name

These are complex columns. In this case, a list-of-lists is made up of the data values, offsets and maybe an index (in the case of categoricals). There will be some simple wrappers in https://github.com/dask/fastparquet/blob/a9d3f309068189043f5ecec5f616de90c11fa305/fastparquet/wrappers.py to provide access to these nested structures, or the arrays could be passed directly to arrow, awkward or other libraries that know what to do with them.

  'foo.with.strings-data': array([0, 1, -1], dtype=int8),
  'foo.with.strings-cats': ["hey", "there"],

becomes ["hey", "there", None] as a list

  'foo.with.lists.list-offsets': array([0, 1, 2, 3]),
  'foo.with.lists.list.element-data': array([0, 0, 0], dtype=uint8),
  'foo.with.lists.list.element-cats': [0]}

becomes [[0], [0], [0]] as a list.

Yes, 0 is the row-group index. It could also include the filename maybe. Ways to combine arrays from multiple row-groups can be provided, but I am thinking that iterating over them will be more common.

yohplala · 2024-09-11T14:08:41Z

Thanks a lot for your quick feedbacks !
Please, can you also share your thoughts about the 2nd question?

I also am curious to know what will be the input for the general write() function?
A similar dictionary providing per column the corresponding data?

martindurant · 2024-09-11T14:27:34Z

I also am curious to know what will be the input for the general write() function?
A similar dictionary providing per column the corresponding data?

Yes, I think so. So in the simple case of tabular data (nothing nested), this is essentially what pandas gives you anyway: dict(df) => {col: values}. For structured data, we can provide ways to ingest lists/dicts, but the best path would be for the caller to provide offsets and such directly, or use the same wrapper classes I referenced above. Reading will be ready well before writing, though!

beckermr · 2026-01-05T19:07:45Z

@erykoff has an interest in a "dict of arrays output" and has his numparquet project: https://github.com/erykoff/numparquet.

I do not want to muddle this PR with new ideas/features, but I do want to connect you all together since I think you all have common goals. :)

martindurant · 2026-01-05T19:11:54Z

@erykoff : happy to talk and help. I have not had a chance to see your work, since I didn't know about it until just now.

erykoff · 2026-01-05T19:22:08Z

My work did not exist until just now! It was a holiday break hobby project to see how far I could get. I hadn't looked at fastparquet because it was so entwined with pandas (which I try to avoid). Nevertheless, I now realize that the primitives here really do almost everything that we need.

What we are looking for is:

A parquet reader that outputs a dict of numpy arrays, keyed by column name.
That doesn't use 4x memory overhead (like pyarrow)
Handles nulls via numpy masked arrays.
Has a well defined schema with key/value metadata. (This is something that pandas does not have).
We do need array (list) columns, definitely aligned, probably ragged as well (even though these are painful in numpy).
No implicit multithreading. This is actively harmful for how we run on compute clusters.
No need for / concern for nested types (which are also super complicated as you show above).

I'm happy to look at this PR and see if I can make a minimal working example of what we need. But I don't know if it's general...

martindurant · 2026-01-05T19:31:56Z

To give more detail on the experiment in this PR, it does work, including variable strings, lists and nested records with or without nulls. You should find the thrift implementation here significantly faster than thriftpy2 (which may be important for big schemas).

Those various complex types are returned as sets of offsets into data arrays, e.g., strings should be a (numpy) uint8 array and an uint32 array of offsets. This is best for loading speed and storage size unless you actually want python strings.

Complexity is around how to combine pages and row-groups, particularly is you intend to try to parallelise.

martindurant · 2026-02-12T19:52:15Z

Did anyone have any use for the work in this PR?

martindurant added 19 commits March 1, 2024 14:32

Scrap lots of pandas stuff

64f3c3b

Kick out pandas, start numpy

440bfe5

partial

6d0e8a9

prototype

64be035

general algo and specialised variants

f046b2b

various opts

393b8e4

micro

2261d63

opts

59dd36e

choose your parallelism

4aa8934

stop

b60eddd

stop

6cde130

TO simpler

9359f04

stop

a140540

Merge branch 'main' into faster

d6caf89

See what happens if we don't track thrift i32

0d8c2ca

one more

9ac836e

small wins

40ea4c0

Merge branch 'ignore_i32' into faster

29d4ac8

parallelism

b6834e1

martindurant mentioned this pull request Aug 22, 2024

Support upcoming default pandas string dtype (pandas >= 3) #930

Closed

martindurant added 7 commits August 22, 2024 10:32

Merge branch 'main' into faster

c5b39ec

remove pandas CI

18d01b2

light work on wrappers

07776f7

steps

944837a

stop

0345cb3

alt

2a6d6a5

improve schema

71683ca

latest

a9d3f30

martindurant mentioned this pull request Sep 26, 2024

Parquet Feature requests kylebarron/arro3#195

Open

yohplala mentioned this pull request Oct 29, 2024

Linting in fastparquet? #939

Open

martindurant mentioned this pull request Jan 5, 2026

question: direct read/write with numpy recarrays? #746

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

pure-numpy interface to parquet#931

pure-numpy interface to parquet#931
martindurant wants to merge 27 commits intodask:mainfrom
martindurant:faster

martindurant commented Aug 22, 2024

Uh oh!

yohplala commented Sep 11, 2024

Uh oh!

martindurant commented Sep 11, 2024

Uh oh!

yohplala commented Sep 11, 2024 •

edited

Loading

Uh oh!

martindurant commented Sep 11, 2024

Uh oh!

beckermr commented Jan 5, 2026

Uh oh!

martindurant commented Jan 5, 2026

Uh oh!

erykoff commented Jan 5, 2026

Uh oh!

martindurant commented Jan 5, 2026

Uh oh!

martindurant commented Feb 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Uh oh!

Conversation

martindurant commented Aug 22, 2024

Uh oh!

yohplala commented Sep 11, 2024

Uh oh!

martindurant commented Sep 11, 2024

Uh oh!

yohplala commented Sep 11, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

martindurant commented Sep 11, 2024

Uh oh!

beckermr commented Jan 5, 2026

Uh oh!

martindurant commented Jan 5, 2026

Uh oh!

erykoff commented Jan 5, 2026

Uh oh!

martindurant commented Jan 5, 2026

Uh oh!

martindurant commented Feb 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

yohplala commented Sep 11, 2024 •

edited

Loading