[Data] Add map namespace support for expression operations#59879
[Data] Add map namespace support for expression operations#59879ryankert01 wants to merge 23 commits intoray-project:masterfrom
Conversation
Signed-off-by: Hsien-Cheng Huang <ryankert01@gmail.com>
694b035 to
1bd4269
Compare
There was a problem hiding this comment.
Code Review
This pull request introduces support for map/dict operations on expression columns by adding a map namespace. The implementation is well-structured, adding a _MapNamespace with keys() and values() methods that work on both logical MapArray and physical List<Struct> representations. The handling of sliced arrays with non-zero offsets is a great detail that ensures correctness. The accompanying tests are thorough, covering various representations, edge cases like nulls and empty maps, and integration with other namespaces.
I've added a couple of suggestions to map_namespace.py to further improve the robustness of the implementation by handling LargeListArray and providing clearer errors for unsupported types. Overall, this is a solid contribution that enhances Ray Data's expression capabilities.
Signed-off-by: Hsien-Cheng Huang <ryankert01@gmail.com>
Signed-off-by: Hsien-Cheng Huang <ryankert01@gmail.com>
Signed-off-by: Hsien-Cheng Huang <ryankert01@gmail.com>
Co-authored-by: You-Cheng Lin <106612301+owenowenisme@users.noreply.github.com> Signed-off-by: Ryan Huang <ryankert01@gmail.com>
Signed-off-by: Hsien-Cheng Huang <hcr@apache.org>
| f"(key and value), but got: {arr.type}." | ||
| ) | ||
| return pyarrow.ListArray.from_arrays( | ||
| offsets=[0] * (len(arr) + 1), |
There was a problem hiding this comment.
| return pyarrow.ListArray.from_arrays( | ||
| offsets=[0] * (len(arr) + 1), | ||
| values=pyarrow.array([], type=pyarrow.null()), | ||
| mask=pyarrow.array([True] * len(arr)), |
| VALUES = "values" | ||
|
|
||
|
|
||
| def _extract_map_component( |
There was a problem hiding this comment.
Let's create 3 helper functions to make the intent clearer.
_get_child_arraywhich gets keys and values_make_empty_list_array_rebuild_list_arraywith normalized offsets
For each one add an example of what it's doing in the comments.
def _extract_map_component(
arr: pyarrow.Array, component: MapComponent
) -> pyarrow.Array:
"""Extract keys or values from a MapArray or ListArray<Struct>."""
if isinstance(arr, pyarrow.ChunkedArray):
return pyarrow.chunked_array(
[_extract_map_component(chunk, component) for chunk in arr.chunks]
)
child_array = _get_child_array(arr, component)
if child_array is None:
return _make_empty_list_array(arr, component)
return _rebuild_list_array(arr, child_array)Signed-off-by: Ryan Huang <ryankert01@gmail.com>
Signed-off-by: Ryan Huang <ryankert01@gmail.com>
Signed-off-by: Ryan Huang <ryankert01@gmail.com>
| assert list(rows[0]["keys"]) == ["a"] and list(rows[0]["values"]) == [1] | ||
| assert len(rows[1]["keys"]) == 0 and len(rows[1]["values"]) == 0 | ||
| assert rows[2]["keys"] is None and rows[2]["values"] is None |
There was a problem hiding this comment.
Let's use rows_same
There was a problem hiding this comment.
row_same operates on pandas that can't handle the mixed None/list column when converting. The to_pandas() path triggers TensorArray casting which fails on the mixed types. Let's keep it!
Although there's workaround, but is too complex for the context of this test:
ctx = ray.data.context.DataContext.get_current()
ctx.enable_tensor_extension_casting = False
try:
result = (
ds.with_column("keys", col("m").map.keys())
.with_column("values", col("m").map.values())
.to_pandas()
)
expected = pd.DataFrame(
{
"keys": [["a"], [], None],
"values": [[1], [], None],
}
)
_assert_result(result, expected, drop_cols=["m"])
finally:
ctx.enable_tensor_extension_casting = True| if start_offset.as_py() != 0: | ||
| end_offset = offsets[-1].as_py() | ||
| child_array = child_array.slice( | ||
| offset=start_offset.as_py(), length=end_offset - start_offset.as_py() |
There was a problem hiding this comment.
Don't believe you need to call as_py here
goutamvenkat-anyscale
left a comment
There was a problem hiding this comment.
Please look at open comments. Thanks
Signed-off-by: Ryan Huang <ryankert01@gmail.com>
Description
MapNamespaceimpl._extract_map_componentas a robust, vectorized fallback since nativepc.map_keyskernels are not standard in PyArrow yet.MapArray) and Physical Maps (List<Struct>).Testing
test_map_keys/test_map_values: Standard extraction.test_physical_map_extraction: Verifies support forList<Struct>.test_map_sliced_offsets: Verifies the critical fix for sliced data.test_map_nulls_and_empty: Verifies handling ofNoneand empty maps{}.test_map_chaining: Verifies composition with List namespace (e.g.,.map.keys().list.len()).Related issues
Related to #58674
Continues #58743
Additional information
test w/