Skip to content

Commit d9f36a1

Browse files
authored
GH-47301: [Python] Fix FileFragment.open() seg fault behavior for file-like objects (#47302)
### Rationale for this change This PR resolves the issue reported in #47301. There are [three possible file source types](https://github.com/apache/arrow/blob/80addfab90b65c9127b46cc5c0ff48af4db1afb3/python/pyarrow/_dataset.pyx#L104) in which a `CFileSource` can be created: 1. From a `pa.Buffer`. 2. From a `path` string. 3. From a file-like object which has a `read` attribute. However, `FileFragment.open()` currently only [explicitly handles the first two types](https://github.com/apache/arrow/blob/80addfab90b65c9127b46cc5c0ff48af4db1afb3/python/pyarrow/_dataset.pyx#L2005). When `open` is called with a `FileFragment` created from type (3), the current implementation tries to read the `path` which is set to a string called `"<Buffer>"` ([source](https://github.com/apache/arrow/blob/135357ce3824d1a8e1aba5a19d897b0c02b22ab7/cpp/src/arrow/dataset/file_base.h#L106)). This causes the seg fault as observed in the linked issue. ### What changes are included in this PR? 1. Modify `FileFragment.open()` to handle the three `CFileSource` cases as listed above. 2. Add a unit test which seg faults without the change in (1) and passes with the change. ### Are these changes tested? Yes. ### Are there any user-facing changes? Yes; this PR fixes a user facing bug in the `FileFragment` API. * GitHub Issue: #47301 Authored-by: Lester Fan <[email protected]> Signed-off-by: Raúl Cumplido <[email protected]>
1 parent f8b20f1 commit d9f36a1

File tree

3 files changed

+18
-4
lines changed

3 files changed

+18
-4
lines changed

python/pyarrow/_dataset.pyx

Lines changed: 9 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -2012,13 +2012,18 @@ cdef class FileFragment(Fragment):
20122012
c_string c_path
20132013
NativeFile out = NativeFile()
20142014

2015+
# Handle each of the cases in _make_file_source
20152016
if self.buffer is not None:
20162017
return pa.BufferReader(self.buffer)
20172018

2018-
c_path = tobytes(self.file_fragment.source().path())
2019-
with nogil:
2020-
c_filesystem = self.file_fragment.source().filesystem()
2021-
opened = GetResultValue(c_filesystem.get().OpenInputFile(c_path))
2019+
if self.file_fragment.source().filesystem() != nullptr:
2020+
c_path = tobytes(self.file_fragment.source().path())
2021+
with nogil:
2022+
c_filesystem = self.file_fragment.source().filesystem()
2023+
opened = GetResultValue(c_filesystem.get().OpenInputFile(c_path))
2024+
else:
2025+
with nogil:
2026+
opened = GetResultValue(self.file_fragment.source().Open())
20222027

20232028
out.set_random_access_file(opened)
20242029
out.is_readable = True

python/pyarrow/includes/libarrow_dataset.pxd

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -182,6 +182,7 @@ cdef extern from "arrow/dataset/api.h" namespace "arrow::dataset" nogil:
182182
const shared_ptr[CFileSystem]& filesystem() const
183183
const shared_ptr[CBuffer]& buffer() const
184184
const int64_t size() const
185+
CResult[shared_ptr[CRandomAccessFile]] Open() const
185186
# HACK: Cython can't handle all the overloads so don't declare them.
186187
# This means invalid construction of CFileSource won't be caught in
187188
# the C++ generation phase (though it will still be caught when

python/pyarrow/tests/test_dataset.py

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1364,6 +1364,14 @@ def test_make_parquet_fragment_from_buffer(dataset_reader, pickle_module):
13641364
pickled = pickle_module.loads(pickle_module.dumps(fragment))
13651365
assert dataset_reader.to_table(pickled).equals(table)
13661366

1367+
# GH-47301: Ensure fragment.open() works for file-like objects.
1368+
file_like = pa.BufferReader(buffer)
1369+
fragment = format_.make_fragment(file_like)
1370+
opened_file = fragment.open()
1371+
assert isinstance(opened_file, pa.NativeFile)
1372+
assert opened_file.readable
1373+
assert pq.ParquetFile(fragment.open()).read().equals(table)
1374+
13671375

13681376
@pytest.mark.parquet
13691377
def _create_dataset_for_fragments(tempdir, chunk_size=None, filesystem=None):

0 commit comments

Comments
 (0)