-
Notifications
You must be signed in to change notification settings - Fork 425
Maybe fix column selection #1979
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
|
I'm looking into it. All examples from #1973 work now! Let me try running our CI in different repos vs this branch |
|
Sorry, I think I messed up something. The test from the issue still fails: uv venv venv
source venv/bin/activate
uv pip install git+https://github.com/martindurant/filesystem_spec@parquet-nested pyarrowimport random
import pyarrow as pa
import pyarrow.parquet as pq
from fsspec.parquet import open_parquet_file
def test(n, path):
flat = pa.array([random.random() for _ in range(n)])
nested = pa.array([{"a": random.random(), "b": random.random()} for _ in range(n)])
table = pa.table({"flat": flat, "nested": nested})
pq.write_table(table, path)
with open_parquet_file(path, columns=["nested.a"], engine="pyarrow") as fh:
_ = pq.read_table(fh)
# works for 10 rows
test(10, "/tmp/ten.parquet")
# fails for 100k rows
test(100_000, "100k.parquet") |
|
I added the test https://github.com/fsspec/filesystem_spec/pull/1979/files#diff-ff7fd767388891014a980915bb4fb6a84233cd96324d59e80ce9d2db18577791R208-R220 that is supposed to be identical. I wonder what the difference is. |
|
(sorry, this link: filesystem_spec/fsspec/tests/test_parquet.py Line 208 in 8ba3aae
|
|
Sorry, it was my mistake with the code: |
|
Could you please try with the files I shared in #1973? It still fails for me with double-nested columns (e.g. "spectrum.flux" is a list-array itself). |
|
The test you introduced is a little bit different from the code in my original issue: you reuse the same If I change you test with either 1) Reproducible codeuv venv venv
source venv/bin/activate
uv pip install git+https://github.com/martindurant/filesystem_spec@parquet-nested pyarrow pandasimport os
import random
import pyarrow as pa
import pyarrow.parquet as pq
import pandas as pd
from fsspec.parquet import open_parquet_file
def test_nested(n, tmpdir, engine):
path = os.path.join(str(tmpdir), "test.parquet")
import pyarrow as pa
flat = pa.array([random.random() for _ in range(n)])
a = random.random()
b = random.random()
nested = pa.array([{"a": a, "b": b} for _ in range(n)])
table = pa.table({"flat": flat, "nested": nested})
pq.write_table(table, path, use_dictionary=False, compression=None)
with open_parquet_file(path, columns=["nested.a"], engine=engine) as fh:
col = pd.read_parquet(fh, engine=engine, columns=["nested.a"])
name = "a" if engine == "pyarrow" else "nested.a"
assert (col[name] == a).all()
test_nested(1_000_000, '/tmp', 'pyarrow') |
@hombit , can you test, please? I fear this may be over reading bytes, but at least all the tests pass, including your specific case.