Skip to content

Ray-native collective communication library #35311

@valiantljk

Description

@valiantljk

Description

Ray collective has supported both gloo and nccl as backend, and currently supports torch.Tensor, numpy.ndarray and cupy.ndarray.

There are cases where users may not have the required dependency for nccl, e.g., cuda; And for gloo, the current pygloo doesn't seem to be well maintained: https://github.com/ray-project/pygloo.

The idea of Ray-native collective communication is to implement the collective primitives via Ray actor and task.
Users can directly use these libraries after pip install ray and generic python objects are expected to be supported in these API.

The distributed object store can be leveraged and node-local scheduling may also be explored to achieve high bandwidth and low latency collective operation.

Adding these Ray-native implementation will largely speedup the adoption of Ray collective in the long term. Users should have the flexibility to enjoy either Ray-native, or existing Ray collective (with gloo or nccl backend).

The implementation should be a superset of current Ray collective, ideally exposed as another backend, i.e., gloo, nccl, ray

Use case

In @ray-project/deltacat, we have a use case where a single actor needs to handle large volume of data from thousands/millions of tasks; However, we found that this can be largely improved by using distributed actors and an all-reduce operation; It'll significantly simplify our implementation if we have the ray-native collective library, either through a tree-based implementation or a ring-based implementation.

Metadata

Metadata

Assignees

Labels

P2Important issue, but not time-criticalcommunity-backlogcoreIssues that should be addressed in Ray CoreenhancementRequest for new feature and/or capabilityperformance

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions