Align on standard ETL Metadata schema

In order to make adaptive decisions about shuffling/broadcasting/etc during run time, we need to keep track of some basic metadata. In cudf-polars, this metadata is propagated through the network using a distinct `Channel` between every node. After https://github.com/rapidsai/rapidsmpf/pull/811, we will be able to use the same `Channel` for both data and metadata payloads. However, in order for cudf-polars to **use** any C++ Join/GroupBy nodes, we will (probably) still need cudf-polars and RapidsMPF to align on a common/standard metadata schema.

At the very minimum, the channel metadata should keep track of the expected local chunk count, and whether the local data coming through the channel is already broadcasted (duplicated) across all ranks. In order to avoid "re-shuffling" data unnecessarily, we also want to track the "partitioning" status of the data being sent into an actor.

## Current Metadata Schema

The current `Metadata` [definition in cudf-polars](https://github.com/rapidsai/cudf/blob/4a8cdb232e4d5a8cfd00783dbb23c79b2cd77307/python/cudf_polars/cudf_polars/experimental/rapidsmpf/utils.py#L90) looks like:

```python
class Metadata:
    """Metadata payload for an ETL workload."""

    local_count: int
    """Local chunk-count estimate for the current rank."""
    global_count: int | None
    """Global chunk-count estimate across all ranks."""
    partitioning: HashPartitioned | None
    """How the data is partitioned, or None if not partitioned."""
    duplicated: bool
    """Whether the data is duplicated (identical) on all workers."""

class HashPartitioned:
    """Hash-partitioning metadata.

    columns: tuple[str, ...]
    """Columns the data is hash-partitioned on."""
    scope: Literal["local", "global"]
    """Whether data is partitioned locally (within a rank) or globally (across all ranks)."""
    count: int
    """The modulus used for hash partitioning (number of partitions)."""
```

## Partitioning metadata

An important "open" question is how to best encode partitioning status of the data being sent through a channel. For now, the `partitioning` attribute can either be empty (`None`), or it can be set to `HashPartitioned`. This makes it easy to keep track of data that has already been globally shuffled. However, it does not make it easy to distinguish between data that was directly shuffled into N_g global chunks, and data that was shuffled between R ranks, and then N_l local chunks.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Align on standard ETL Metadata schema #817

Current Metadata Schema

Partitioning metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Align on standard ETL Metadata schema #817

Description

Current Metadata Schema

Partitioning metadata

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions