-
Notifications
You must be signed in to change notification settings - Fork 6.8k
Description
Description
Ray collective has supported both gloo and nccl as backend, and currently supports torch.Tensor, numpy.ndarray and cupy.ndarray.
There are cases where users may not have the required dependency for nccl, e.g., cuda; And for gloo, the current pygloo doesn't seem to be well maintained: https://github.com/ray-project/pygloo.
The idea of Ray-native collective communication is to implement the collective primitives via Ray actor and task.
Users can directly use these libraries after pip install ray
and generic python objects are expected to be supported in these API.
The distributed object store can be leveraged and node-local scheduling may also be explored to achieve high bandwidth and low latency collective operation.
Adding these Ray-native implementation will largely speedup the adoption of Ray collective in the long term. Users should have the flexibility to enjoy either Ray-native, or existing Ray collective (with gloo or nccl backend).
The implementation should be a superset of current Ray collective, ideally exposed as another backend, i.e., gloo, nccl, ray
Use case
In @ray-project/deltacat, we have a use case where a single actor needs to handle large volume of data from thousands/millions of tasks; However, we found that this can be largely improved by using distributed actors and an all-reduce operation; It'll significantly simplify our implementation if we have the ray-native collective library, either through a tree-based implementation or a ring-based implementation.