Skip to content

Conversation

@martindurant
Copy link
Member

@hombit , can you test, please? I fear this may be over reading bytes, but at least all the tests pass, including your specific case.

@hombit
Copy link

hombit commented Jan 23, 2026

I'm looking into it. All examples from #1973 work now! Let me try running our CI in different repos vs this branch

@hombit
Copy link

hombit commented Jan 23, 2026

Sorry, I think I messed up something. The test from the issue still fails:

uv venv venv
source venv/bin/activate
uv pip install git+https://github.com/martindurant/filesystem_spec@parquet-nested pyarrow
import random
import pyarrow as pa
import pyarrow.parquet as pq
from fsspec.parquet import open_parquet_file

def test(n, path):
    flat = pa.array([random.random() for _ in range(n)])
    nested = pa.array([{"a": random.random(), "b": random.random()} for _ in range(n)])
    table = pa.table({"flat": flat, "nested": nested})
    pq.write_table(table, path)
    with open_parquet_file(path, columns=["nested.a"], engine="pyarrow") as fh:
        _ = pq.read_table(fh)

# works for 10 rows
test(10, "/tmp/ten.parquet")

# fails for 100k rows
test(100_000, "100k.parquet")
OSError: Malformed levels. min: 2 max: 2 out of range.  Max Level: 1

@martindurant
Copy link
Member Author

I added the test https://github.com/fsspec/filesystem_spec/pull/1979/files#diff-ff7fd767388891014a980915bb4fb6a84233cd96324d59e80ce9d2db18577791R208-R220 that is supposed to be identical. I wonder what the difference is.

@martindurant
Copy link
Member Author

(sorry, this link:

def test_nested(n, tmpdir, engine):
)

@hombit
Copy link

hombit commented Jan 23, 2026

Sorry, it was my mistake with the code: pq.read_table(fh) doesn't have columns=columns. It passes when I add it. I'm still testing with other repositories, give me some time

@hombit
Copy link

hombit commented Jan 23, 2026

Could you please try with the files I shared in #1973? It still fails for me with double-nested columns (e.g. "spectrum.flux" is a list-array itself).

@hombit
Copy link

hombit commented Jan 23, 2026

The test you introduced is a little bit different from the code in my original issue: you reuse the same a and b values for the nested column, which makes them very small on the disk because of encoding and compression.

If I change you test with either 1) {"a": random.random(), "b": random.random()}, or 2) pq.write_table(table, path, use_dictionary=False, compression=None), it fails with the same "Couldn't deserialize thrift" error.

Reproducible code
uv venv venv
source venv/bin/activate
uv pip install git+https://github.com/martindurant/filesystem_spec@parquet-nested pyarrow pandas
import os
import random

import pyarrow as pa
import pyarrow.parquet as pq
import pandas as pd
from fsspec.parquet import open_parquet_file

def test_nested(n, tmpdir, engine):
    path = os.path.join(str(tmpdir), "test.parquet")
    import pyarrow as pa
    flat = pa.array([random.random() for _ in range(n)])
    a = random.random()
    b = random.random()
    nested = pa.array([{"a": a, "b": b} for _ in range(n)])
    table = pa.table({"flat": flat, "nested": nested})
    pq.write_table(table, path, use_dictionary=False, compression=None)
    with open_parquet_file(path, columns=["nested.a"], engine=engine) as fh:
        col = pd.read_parquet(fh, engine=engine, columns=["nested.a"])
    name = "a" if engine == "pyarrow" else "nested.a"
    assert (col[name] == a).all()
test_nested(1_000_000, '/tmp', 'pyarrow')
Traceback (most recent call last):
  File "<python-input-1>", line 22, in <module>
    test_nested(1_000_000, '/tmp', 'pyarrow')
    ~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "<python-input-1>", line 19, in test_nested
    col = pd.read_parquet(fh, engine=engine, columns=["nested.a"])
  File "/private/tmp/venv/lib/python3.14/site-packages/pandas/io/parquet.py", line 671, in read_parquet
    return impl.read(
           ~~~~~~~~~^
        path,
        ^^^^^
    ...<6 lines>...
        **kwargs,
        ^^^^^^^^^
    )
    ^
  File "/private/tmp/venv/lib/python3.14/site-packages/pandas/io/parquet.py", line 260, in read
    pa_table = self.api.parquet.read_table(
        path_or_handle,
    ...<3 lines>...
        **kwargs,
    )
  File "/private/tmp/venv/lib/python3.14/site-packages/pyarrow/parquet/core.py", line 1926, in read_table
    return dataset.read(columns=columns, use_threads=use_threads,
           ~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
                        use_pandas_metadata=use_pandas_metadata)
                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/private/tmp/venv/lib/python3.14/site-packages/pyarrow/parquet/core.py", line 1552, in read
    table = self._dataset.to_table(
        columns=columns, filter=self._filter_expression,
        use_threads=use_threads
    )
  File "pyarrow/_dataset.pyx", line 589, in pyarrow._dataset.Dataset.to_table
  File "pyarrow/_dataset.pyx", line 3969, in pyarrow._dataset.Scanner.to_table
  File "pyarrow/error.pxi", line 155, in pyarrow.lib.pyarrow_internal_check_status
  File "pyarrow/error.pxi", line 92, in pyarrow.lib.check_status
OSError: Couldn't deserialize thrift: TProtocolException: Invalid data
Deserializing page header failed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants