Commit c49b666
authored
[Data] Fix checkpoint filter PyArrow zero-copy conversion error (#59839)
# Fix `ArrowInvalid` error in checkpoint filter when converting PyArrow
chunks to NumPy arrays
## Issue
Fixes `ArrowInvalid` error when checkpoint filtering converts PyArrow
chunks to NumPy arrays with `zero_copy_only=True`:
```
File "/usr/local/lib/python3.10/dist-packages/ray/data/checkpoint/checkpoint_filter.py", line 249, in filter_rows_for_block
masks = list(executor.map(filter_with_ckpt_chunk, ckpt_chunks))
File "/usr/lib/python3.10/concurrent/futures/_base.py", line 621, in result_iterator
yield _result_or_cancel(fs.pop())
File "/usr/lib/python3.10/concurrent/futures/_base.py", line 319, in _result_or_cancel
return fut.result(timeout)
File "/usr/lib/python3.10/concurrent/futures/_base.py", line 458, in result
return self.__get_result()
File "/usr/lib/python3.10/concurrent/futures/_base.py", line 403, in __get_result
raise self._exception
File "/usr/lib/python3.10/concurrent/futures/thread.py", line 58, in run
result = self.fn(*self.args, **self.kwargs)
File "/usr/local/lib/python3.10/dist-packages/ray/data/checkpoint/checkpoint_filter.py", line 229, in filter_with_ckpt_chunk
ckpt_ids = ckpt_chunk.to_numpy(zero_copy_only=True)
File "pyarrow/array.pxi", line 1789, in pyarrow.lib.Array.to_numpy
File "pyarrow/error.pxi", line 92, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: Needed to copy 1 chunks with 0 nulls, but zero_copy_only was True
```
This error occurs when checkpoint data is loaded from Ray's object
store, where PyArrow buffers may reside in shared memory and cannot be
zero-copied to NumPy.
## Reproduction
```python
#!/usr/bin/env python3
import ray
from ray.data import DataContext
from ray.data.checkpoint import CheckpointConfig
import tempfile
ray.init()
with tempfile.TemporaryDirectory() as ckpt_dir, \
tempfile.TemporaryDirectory() as data_dir, \
tempfile.TemporaryDirectory() as output_dir:
# Step 1: Create data
ray.data.range(10).map(lambda x: {"id": f"id_{x['id']}"}).write_parquet(data_dir)
# Step 2: Enable checkpoint and write
ctx = DataContext.get_current()
ctx.checkpoint_config = CheckpointConfig(
checkpoint_path=ckpt_dir,
id_column="id",
delete_checkpoint_on_success=False
)
ray.data.read_parquet(data_dir).filter(lambda x: x["id"] != 'id_0').write_parquet(output_dir)
# Step 3: Second write triggers checkpoint filtering
ray.data.read_parquet(data_dir).write_parquet(output_dir)
ray.shutdown()
```
## Solution
Change `to_numpy(zero_copy_only=True)` to
`to_numpy(zero_copy_only=False)` in
`BatchBasedCheckpointFilter.filter_rows_for_block()`. This allows
PyArrow to copy data when necessary.
### Changes
**File**: `ray/python/ray/data/checkpoint/checkpoint_filter.py`
- Line 229: Changed `ckpt_chunk.to_numpy(zero_copy_only=True)` to
`ckpt_chunk.to_numpy(zero_copy_only=False)`
### Performance Impact
No performance regression expected. PyArrow will only perform a copy
when zero-copy is not possible.
Signed-off-by: dragongu <andrewgu@vip.qq.com>1 parent 2ac0f3f commit c49b666
1 file changed
+3
-1
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
7 | 7 | | |
8 | 8 | | |
9 | 9 | | |
| 10 | + | |
10 | 11 | | |
11 | 12 | | |
12 | 13 | | |
| |||
226 | 227 | | |
227 | 228 | | |
228 | 229 | | |
229 | | - | |
| 230 | + | |
| 231 | + | |
230 | 232 | | |
231 | 233 | | |
232 | 234 | | |
| |||
0 commit comments