Skip to content

Conversation

LucasLLC
Copy link
Contributor

@LucasLLC LucasLLC commented Sep 27, 2025

Evaluating the lift for what it would take to implement a transport protocol around PTD.

The main drawback here is having to initiate a process group handshake on the first transport request across two actors. The handshake could be cache'd for future references, but this might not be super scalable.

Currently only basic tests pass and cache-ing logic is wrong for pg group creation

Ignore rdma changes, I'm not planning on landing this PR

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Sep 27, 2025
return await self.store.get_meta(key, request)

@endpoint
async def handshake(self, file_store_name):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

setup_comms

return
logger.info(f"Finalizing handshake from {file_store_name}")

file_store = torch.distributed.FileStore(file_store_name, 2)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TCPStore

# we allocate on the fly
tensor = await transport_buffer.read_into(tensor=None)

pg = self.pgs[transport_buffer.file_store_name]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

from storage_volume


if request.tensor_slice is None:
await transport_buffer.write_from(self.kv[key])
await transport_buffer.write_from(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

move r inside

volume_coord = self.volume_id_to_coord[volume_id]
return self.storage_volumes.slice(**volume_coord)
storage_volume = self.storage_volumes.slice(**volume_coord)
storage_volume.volume_id = volume_id
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤔



import torch
from torchstore.utils import _gloo_factory
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤔


local_pgs = {}

class TorchDistributedBuffer(TransportBuffer):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤔

# TODO: eventually this should be dependent on the connections available to a storage_volume

#TODO:
if True:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤔


return global_tensor

def _gloo_factory(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤔

else:
transport_buffer.allocate(request.tensor_val)

if isinstance(transport_buffer, TorchDistributedBuffer):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👎🏽

if transport_buffer.is_object:
return transport_buffer.objects

if isinstance(transport_buffer, TorchDistributedBuffer):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👎🏽

@LucasLLC
Copy link
Contributor Author

LucasLLC commented Oct 1, 2025

closed in favor of #44

@LucasLLC LucasLLC closed this Oct 1, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant