-
Notifications
You must be signed in to change notification settings - Fork 5
[DNL] Experimenting on what a PTD transport buffer would look like #43
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
return await self.store.get_meta(key, request) | ||
|
||
@endpoint | ||
async def handshake(self, file_store_name): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
setup_comms
return | ||
logger.info(f"Finalizing handshake from {file_store_name}") | ||
|
||
file_store = torch.distributed.FileStore(file_store_name, 2) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TCPStore
# we allocate on the fly | ||
tensor = await transport_buffer.read_into(tensor=None) | ||
|
||
pg = self.pgs[transport_buffer.file_store_name] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
from storage_volume
|
||
if request.tensor_slice is None: | ||
await transport_buffer.write_from(self.kv[key]) | ||
await transport_buffer.write_from( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
move r inside
volume_coord = self.volume_id_to_coord[volume_id] | ||
return self.storage_volumes.slice(**volume_coord) | ||
storage_volume = self.storage_volumes.slice(**volume_coord) | ||
storage_volume.volume_id = volume_id |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🤔
|
||
|
||
import torch | ||
from torchstore.utils import _gloo_factory |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🤔
|
||
local_pgs = {} | ||
|
||
class TorchDistributedBuffer(TransportBuffer): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🤔
# TODO: eventually this should be dependent on the connections available to a storage_volume | ||
|
||
#TODO: | ||
if True: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🤔
|
||
return global_tensor | ||
|
||
def _gloo_factory( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🤔
else: | ||
transport_buffer.allocate(request.tensor_val) | ||
|
||
if isinstance(transport_buffer, TorchDistributedBuffer): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👎🏽
if transport_buffer.is_object: | ||
return transport_buffer.objects | ||
|
||
if isinstance(transport_buffer, TorchDistributedBuffer): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👎🏽
closed in favor of #44 |
Evaluating the lift for what it would take to implement a transport protocol around PTD.
The main drawback here is having to initiate a process group handshake on the first transport request across two actors. The handshake could be cache'd for future references, but this might not be super scalable.
Currently only basic tests pass and cache-ing logic is wrong for pg group creation
Ignore rdma changes, I'm not planning on landing this PR