[DNL] Experimenting on what a PTD transport buffer would look like #43

LucasLLC · 2025-09-27T00:02:59Z

Evaluating the lift for what it would take to implement a transport protocol around PTD.

The main drawback here is having to initiate a process group handshake on the first transport request across two actors. The handshake could be cache'd for future references, but this might not be super scalable.

Currently only basic tests pass and cache-ing logic is wrong for pg group creation

Ignore rdma changes, I'm not planning on landing this PR

LucasLLC · 2025-09-29T14:08:18Z

torchstore/storage_volume.py

+        return await self.store.get_meta(key, request)
+
+    @endpoint
+    async def handshake(self, file_store_name):


setup_comms

LucasLLC · 2025-09-29T14:08:32Z

torchstore/storage_volume.py

+            return 
+        logger.info(f"Finalizing handshake from {file_store_name}")
+
+        file_store = torch.distributed.FileStore(file_store_name, 2)


LucasLLC · 2025-09-29T14:09:01Z

torchstore/storage_volume.py

        # we allocate on the fly
-        tensor = await transport_buffer.read_into(tensor=None)
+
+        pg = self.pgs[transport_buffer.file_store_name]


from storage_volume

LucasLLC · 2025-09-29T14:09:42Z

torchstore/storage_volume.py


        if request.tensor_slice is None:
-            await transport_buffer.write_from(self.kv[key])
+            await transport_buffer.write_from(


move r inside

LucasLLC · 2025-09-29T14:10:01Z

torchstore/strategy.py

        volume_coord = self.volume_id_to_coord[volume_id]
-        return self.storage_volumes.slice(**volume_coord)
+        storage_volume = self.storage_volumes.slice(**volume_coord)
+        storage_volume.volume_id = volume_id


LucasLLC · 2025-09-29T14:10:31Z

torchstore/transport/buffers.py


+
 import torch
+from torchstore.utils import _gloo_factory


LucasLLC · 2025-09-29T14:10:55Z

torchstore/transport/buffers.py

+
+local_pgs = {}
+
+class TorchDistributedBuffer(TransportBuffer):


LucasLLC · 2025-09-29T14:11:20Z

torchstore/transport/pipe.py

        # TODO: eventually this should be dependent on the connections available to a storage_volume
+
+        #TODO:
+        if True:


LucasLLC · 2025-09-29T14:11:35Z

torchstore/utils.py


    return global_tensor
+
+def _gloo_factory(


LucasLLC · 2025-09-29T14:11:52Z

torchstore/transport/pipe.py

        else:
            transport_buffer.allocate(request.tensor_val)

+        if isinstance(transport_buffer, TorchDistributedBuffer):


LucasLLC · 2025-09-29T14:12:03Z

torchstore/transport/pipe.py

        if transport_buffer.is_object:
            return transport_buffer.objects

+        if isinstance(transport_buffer, TorchDistributedBuffer):


LucasLLC · 2025-10-01T15:11:59Z

closed in favor of #44

LucasLLC added 4 commits September 25, 2025 11:28

looking

5e2e2dc

fix rdma buffer dtensor logic

2ab1411

test code

8fac5ab

testing out gloo transport protocol

b24008e

meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Sep 27, 2025

LucasLLC added 2 commits September 26, 2025 17:05

fix

4bc7a96

fix cache logic + add benchmarks

34b3158

LucasLLC commented Sep 29, 2025

View reviewed changes

torchstore/transport/buffers.py

import torch

from torchstore.utils import _gloo_factory

Copy link

Contributor Author

LucasLLC Sep 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤔

LucasLLC commented Sep 29, 2025

View reviewed changes

torchstore/transport/buffers.py

local_pgs = {}

class TorchDistributedBuffer(TransportBuffer):

Copy link

Contributor Author

LucasLLC Sep 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤔

LucasLLC commented Sep 29, 2025

View reviewed changes

torchstore/utils.py

return global_tensor

def _gloo_factory(

Copy link

Contributor Author

LucasLLC Sep 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤔

LucasLLC commented Sep 29, 2025

View reviewed changes

LucasLLC closed this Oct 1, 2025


		local_pgs = {}

		class TorchDistributedBuffer(TransportBuffer):

[DNL] Experimenting on what a PTD transport buffer would look like #43

[DNL] Experimenting on what a PTD transport buffer would look like #43

Uh oh!

Conversation

LucasLLC commented Sep 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

LucasLLC commented Oct 1, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

LucasLLC commented Sep 27, 2025 •

edited

Loading