Skip to content

[Data] Add map namespace support for expression operations#59879

Open
ryankert01 wants to merge 23 commits intoray-project:masterfrom
ryankert01:map-expression
Open

[Data] Add map namespace support for expression operations#59879
ryankert01 wants to merge 23 commits intoray-project:masterfrom
ryankert01:map-expression

Conversation

@ryankert01
Copy link
Member

@ryankert01 ryankert01 commented Jan 6, 2026

Description

MapNamespace impl.

  • Implemented _extract_map_component as a robust, vectorized fallback since native pc.map_keys kernels are not standard in PyArrow yet.
  • Support: Handles both Logical Maps (MapArray) and Physical Maps (List<Struct>).

Testing

  • test_map_keys / test_map_values: Standard extraction.
  • test_physical_map_extraction: Verifies support for List<Struct>.
  • test_map_sliced_offsets: Verifies the critical fix for sliced data.
  • test_map_nulls_and_empty: Verifies handling of None and empty maps {}.
  • test_map_chaining: Verifies composition with List namespace (e.g., .map.keys().list.len()).

Related issues

Related to #58674
Continues #58743

Additional information

test w/

python -m pytest -v -s python/ray/data/tests/test_namespace_expressions.py::TestMapNamespace
Cursor Bugbot found 1 potential issue for commit 7a11478

@ryankert01 ryankert01 requested a review from a team as a code owner January 6, 2026 06:41
Signed-off-by: Hsien-Cheng Huang <ryankert01@gmail.com>
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces support for map/dict operations on expression columns by adding a map namespace. The implementation is well-structured, adding a _MapNamespace with keys() and values() methods that work on both logical MapArray and physical List<Struct> representations. The handling of sliced arrays with non-zero offsets is a great detail that ensures correctness. The accompanying tests are thorough, covering various representations, edge cases like nulls and empty maps, and integration with other namespaces.

I've added a couple of suggestions to map_namespace.py to further improve the robustness of the implementation by handling LargeListArray and providing clearer errors for unsupported types. Overall, this is a solid contribution that enhances Ray Data's expression capabilities.

Signed-off-by: Hsien-Cheng Huang <ryankert01@gmail.com>
@ray-gardener ray-gardener bot added data Ray Data-related issues community-contribution Contributed by the community labels Jan 6, 2026
Signed-off-by: Hsien-Cheng Huang <ryankert01@gmail.com>
Signed-off-by: Hsien-Cheng Huang <ryankert01@gmail.com>
@ryankert01
Copy link
Member Author

Copy link
Member

@owenowenisme owenowenisme left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor fixes, overall LGTM

ryankert01 and others added 3 commits January 22, 2026 19:57
Co-authored-by: You-Cheng Lin <106612301+owenowenisme@users.noreply.github.com>
Signed-off-by: Ryan Huang <ryankert01@gmail.com>
Signed-off-by: Hsien-Cheng Huang <hcr@apache.org>
f"(key and value), but got: {arr.type}."
)
return pyarrow.ListArray.from_arrays(
offsets=[0] * (len(arr) + 1),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

return pyarrow.ListArray.from_arrays(
offsets=[0] * (len(arr) + 1),
values=pyarrow.array([], type=pyarrow.null()),
mask=pyarrow.array([True] * len(arr)),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here

VALUES = "values"


def _extract_map_component(
Copy link
Contributor

@goutamvenkat-anyscale goutamvenkat-anyscale Jan 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's create 3 helper functions to make the intent clearer.

  1. _get_child_array which gets keys and values
  2. _make_empty_list_array
  3. _rebuild_list_array with normalized offsets

For each one add an example of what it's doing in the comments.

def _extract_map_component(
    arr: pyarrow.Array, component: MapComponent
) -> pyarrow.Array:
    """Extract keys or values from a MapArray or ListArray<Struct>."""
    
    if isinstance(arr, pyarrow.ChunkedArray):
        return pyarrow.chunked_array(
            [_extract_map_component(chunk, component) for chunk in arr.chunks]
        )

    child_array = _get_child_array(arr, component)
    
    if child_array is None:
        return _make_empty_list_array(arr, component)
    
    return _rebuild_list_array(arr, child_array)

Signed-off-by: Ryan Huang <ryankert01@gmail.com>
Signed-off-by: Ryan Huang <ryankert01@gmail.com>
Comment on lines +120 to +122
assert list(rows[0]["keys"]) == ["a"] and list(rows[0]["values"]) == [1]
assert len(rows[1]["keys"]) == 0 and len(rows[1]["values"]) == 0
assert rows[2]["keys"] is None and rows[2]["values"] is None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's use rows_same

Copy link
Member Author

@ryankert01 ryankert01 Feb 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

row_same operates on pandas that can't handle the mixed None/list column when converting. The to_pandas() path triggers TensorArray casting which fails on the mixed types. Let's keep it!

Although there's workaround, but is too complex for the context of this test:

    ctx = ray.data.context.DataContext.get_current()
    ctx.enable_tensor_extension_casting = False
    try:
        result = (
            ds.with_column("keys", col("m").map.keys())
            .with_column("values", col("m").map.values())
            .to_pandas()
        )
        expected = pd.DataFrame(
            {
                "keys": [["a"], [], None],
                "values": [[1], [], None],
            }
        )
        _assert_result(result, expected, drop_cols=["m"])
    finally:
        ctx.enable_tensor_extension_casting = True

Comment on lines 78 to 81
if start_offset.as_py() != 0:
end_offset = offsets[-1].as_py()
child_array = child_array.slice(
offset=start_offset.as_py(), length=end_offset - start_offset.as_py()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't believe you need to call as_py here

@iamjustinhsu iamjustinhsu added the go add ONLY when ready to merge, run all tests label Feb 4, 2026
Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Copy link
Contributor

@goutamvenkat-anyscale goutamvenkat-anyscale left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please look at open comments. Thanks

Signed-off-by: Ryan Huang <ryankert01@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

community-contribution Contributed by the community data Ray Data-related issues go add ONLY when ready to merge, run all tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants