[WIP] PTD / Gloo transport implementation #44

LucasLLC · 2025-09-29T20:30:48Z

[WIP] PTD / Gloo transport implementation

LucasLLC · 2025-09-29T21:08:51Z

torchstore/transport/pipe.py


+        latency_trcker.track_step("allocate")
+
+        # TODO: booooo


Not a huge fan of this, need to think of something better

LucasLLC · 2025-09-29T21:09:29Z

torchstore/transport/pipe.py


+        latency_trcker.track_step("allocate")
+
+        # TODO: booooo


Not a huge fan of this, need to think of something better

JenniferWang · 2025-09-29T21:39:33Z

torchstore/transport/torch_distributed_buffer.py

+        state["transport_context"] = None
+        return state
+
+    async def setup_comms(self, storage_volume):


Sorry it's not clear to me what does the topology look like for multiple replicas.

@JenniferWang each client/volume pair creates a 1:1 process group for send/recv which is cached.

For clarity, filestore is getting replaced with tcpstore before merging

casteryh

This is incredible! Huge congratulations for making this work!

Left some comments and questions.

casteryh · 2025-09-30T21:40:39Z

torchstore/transport/buffers.py

        self.objects = other_buffer.objects
        self.requires_meta = other_buffer.requires_meta

+    async def setup_comms(self, storage_volume) -> None:


Is this safe to be called concurrently and idempotent? Based on how you use it in the create transport buffer code, I assume it is more like ensure_comms ?

Ah actually it's not safe or idempotent. It's also not safe to call concurrently from the same client/volume combo.

We may need a lock based on the client, wdyt?

Ah actually it's not safe or idempotent. It's also not safe to call concurrently from the same client/volume combo.

We may need a lock based on the client, wdyt?

Not in the scope of this PR but we can add a TODO and add an issue just to keep track of this.

casteryh · 2025-09-30T21:46:10Z

torchstore/transport/pipe.py


-        transport_buffer = self.create_transport_buffer()
+    async def get_from_storage_volume(self, key, request: Request):
+        latency_trcker = LatencyTracker(f"get_from_storage_volume:{key}")


casteryh · 2025-09-30T21:46:42Z

torchstore/transport/pipe.py

+
+        # TODO: re-evaluate thiss logic for better polymorphism
+        t = None
+        if transport_buffer.read_ahead:


can you explain what read_ahead means?

This is a bit of a bummer, but for ptd style comms I need to call 'recv' before calling the storage volume equivalent (send), otherwise the code hangs.

I'd like to circle back here and figure out one path that works for all transport buffers, but I'm punting this down the road in the interest of getting something working.

casteryh · 2025-09-30T21:48:46Z

torchstore/transport/pipe.py

        # TODO: consider placing the buffer inside the request or vice versa
        transport_buffer.update(
            await self.storage_volume.get.call_one(
                key, transport_buffer, request.meta_only()


if the buffer is not read_ahead, then the storage_volume does the read/write, correct?

storage volume always does the read ahead, the only thing that changes is the order. In the non-ptd case we call storage_volume.get before we call transport_buffer.read_into in the client.

A bit messy -- open to suggestions here

torchstore/transport/pipe.py

torchstore/storage_volume.py

torchstore/transport/buffers.py

torchstore/transport/pipe.py

torchstore/strategy.py

casteryh

overall LGTM, except two remaining concerns:

I don't think a separate finish() method is necessary and it makes more sense to keep the original semantics of read_into and write_from, by creating a background task to poll the pytorch future.
Am I thinking straight here?
Can we add a TODO and a tracking issue to document the unsafe behavior of setup_comms?

casteryh · 2025-10-01T18:00:20Z

torchstore/transport/torch_distributed_buffer.py

+
+        assert self.fut is None
+        pg = self.transport_context[self.file_store_name]
+        self.fut = pg.send([tensor], dstRank=self.remote_rank, tag=0)


this is useful so we can do things concurrently while the future is pending (actually necessary so we can schedule the recv in the storage volume from the same thread

From all the code that uses either read_into or write_from I always see buffer.finish() follows immediately?

What I am saying is a pytorch.futures.Future can be converted to a asyncio style future simply by create a polling task. So the additional finish() is not necessary.

Suggested change

self.fut = pg.send([tensor], dstRank=self.remote_rank, tag=0)

self.fut = pg.send([tensor], dstRank=self.remote_rank, tag=0)

async def poll_pt_future(fut):

while not fut.done():

await asyncio.sleep(0.01) # or other poll frequency

await asyncio.create_task(poll_pt_future(self.fut))

torchstore/transport/torch_distributed_buffer.py

torchstore/storage_volume.py

casteryh · 2025-10-02T20:14:27Z

torchstore/storage_volume.py

        for shard in self.kv[key].values():
            if shard["slice"] == request.tensor_slice:
                await transport_buffer.write_from(shard["tensor"])
+                transport_buffer.finish()


Suggested change

transport_buffer.finish()

await transport_buffer.finish()

torchstore/transport/buffers.py

initial working using gloo

6f538a8

meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Sep 29, 2025

torchstore

cf45eb3

LucasLLC commented Sep 29, 2025

View reviewed changes

JenniferWang reviewed Sep 29, 2025

View reviewed changes

LucasLLC added 3 commits September 30, 2025 08:10

fix tests

7ef05ff

working gloo code

bac48d2

re-parameretize tests

9240af6

casteryh reviewed Sep 30, 2025

View reviewed changes

LucasLLC mentioned this pull request Oct 1, 2025

[DNL] Experimenting on what a PTD transport buffer would look like #43

Closed

casteryh requested changes Oct 1, 2025

View reviewed changes