Adding torch accelerator to ddp-tutorial-series example #1376

dggaytan · 2025-07-21T20:41:49Z

Adding accelerator to ddp tutorials examples

Support for multiple accelerators:

Updated ddp_setup functions in multigpu.py, multigpu_torchrun.py, and multinode.py to use torch.accelerator for device management. The initialization of process groups now dynamically selects the backend based on the device type, with a fallback to CPU if no accelerator is available.
Modified Trainer classes in multigpu_torchrun.py and multinode.py to accept a device parameter and use it for model placement and snapshot loading.

Improvements to example execution:

Added run_example.sh to simplify running tutorial examples with configurable GPU counts and node settings.
Updated run_distributed_examples.sh to include a new function for running all DDP tutorial series examples.

Dependency updates:

Increased the minimum PyTorch version requirement in requirements.txt to 2.7 to ensure compatibility with the new torch.accelerator API.

CC: @msaroufim @malfet @dvrogozh

netlify · 2025-07-21T20:41:54Z

✅ Deploy Preview for pytorch-examples-preview canceled.

Name	Link
🔨 Latest commit	`67b4a05`
🔍 Latest deploy log	https://app.netlify.com/projects/pytorch-examples-preview/deploys/68c1d8c8dd9a010008f36121

dvrogozh · 2025-08-06T14:42:26Z

distributed/ddp-tutorial-series/multigpu.py

-    os.environ["MASTER_PORT"] = "12355"
-    torch.cuda.set_device(rank)
-    init_process_group(backend="nccl", rank=rank, world_size=world_size)
+    os.environ["MASTER_PORT"] = "12453"


Why port was changed?

It was an error from my side, I've changed the port

dvrogozh · 2025-08-06T14:45:18Z

distributed/ddp-tutorial-series/multigpu.py

+    if torch.accelerator.is_available():
+        device_type = torch.accelerator.current_accelerator()
+        torch.accelerator.set_device_idx(rank)
+        device: torch.device = torch.device(f"{device_type}:{rank}")


Suggested change

device: torch.device = torch.device(f"{device_type}:{rank}")

device = torch.device(f"{device_type}:{rank}")

dvrogozh · 2025-08-06T14:46:28Z

distributed/ddp-tutorial-series/multigpu.py

+        device_type = torch.accelerator.current_accelerator()
+        torch.accelerator.set_device_idx(rank)
+        device: torch.device = torch.device(f"{device_type}:{rank}")
+        torch.accelerator.device_index(rank)


There is no such API device_index() in 2.7: https://docs.pytorch.org/docs/stable/accelerator.html

What is it doing? You did set index 2 lines above...

Ok, device_index() will appear only in 2.8: https://docs.pytorch.org/docs/main/generated/torch.accelerator.device_index.html#torch.accelerator.device_index. And this is a context manager, i.e. you need to use it as with device_index(). I don't see why you are using it here. And recently merged #1375 attempts to do the same. I think it will need a fix as well.

It does not make sense to call context manager without with. Did you intend to call set_device_index() instead?

yes, I'm making the changes, thanks

dvrogozh · 2025-08-06T14:47:28Z

distributed/ddp-tutorial-series/multigpu.py

+
+    # torch.cuda.set_device(rank)
+    # init_process_group(backend="xccl", rank=rank, world_size=world_size)


Remove comments:

Suggested change

# torch.cuda.set_device(rank)

# init_process_group(backend="xccl", rank=rank, world_size=world_size)

dvrogozh · 2025-08-06T14:48:00Z

distributed/ddp-tutorial-series/multigpu.py


-    world_size = torch.cuda.device_count()
+    world_size = torch.accelerator.device_count()
+    print(world_size)


Remove or convert to descriptive message:

Suggested change

print(world_size)

dvrogozh · 2025-08-06T14:55:04Z

distributed/ddp-tutorial-series/multigpu_torchrun.py

+        device_type = torch.accelerator.current_accelerator()
+        device: torch.device = torch.device(f"{device_type}:{rank}")
+        torch.accelerator.device_index(rank)
+        print(f"Running on rank {rank} on device {device}")


I have hard time to understand this code block. It does not make sense to me in multiple places. Why you name what current_accelerator() as device_type if you return it from ddp_setup() in the same way as you return device for CPU path? Does ddp_setup() return different values? Next something is happening with the rank which is also not quite clear.

I think what you are trying to achieve is closer to this:

Suggested change

device_type = torch.accelerator.current_accelerator()

device: torch.device = torch.device(f"{device_type}:{rank}")

torch.accelerator.device_index(rank)

print(f"Running on rank {rank} on device {device}")

torch.accelerator.set_device_index(rank)

device = torch.accelerator.current_accelerator()

print(f"Running on rank {rank} on device {device}")

yes, so... there is a function on this file called _load_snapshot where it gets the snapshot directly from the device in which is being run, and in my first tests it was not getting the snapshot at all, so I changed it to device_type to get only the XPU variable.

Now, I've tested again with only the "device" variable and it worked, sorry for the maze 🤓

I'm updating it with your suggestion, thanks

dvrogozh · 2025-08-06T14:55:49Z

distributed/ddp-tutorial-series/multigpu_torchrun.py

+        print(f"Running on rank {rank} on device {device}")
+        backend = torch.distributed.get_default_backend_for_device(device)
+        torch.distributed.init_process_group(backend=backend)
+        return device_type


and respective to above:

Suggested change

return device_type

return device

dvrogozh · 2025-08-06T14:56:29Z

distributed/ddp-tutorial-series/multinode.py

-    init_process_group(backend="nccl")
+    rank = int(os.environ["LOCAL_RANK"])
+    if torch.accelerator.is_available():
+        device_type = torch.accelerator.current_accelerator()


same comments as above

not addressed.

dvrogozh · 2025-08-06T14:56:53Z

distributed/ddp-tutorial-series/requirements.txt

@@ -1 +1 @@
-torch>=1.11.0
+torch>=2.7


add new line in end of file

dvrogozh · 2025-08-06T14:56:58Z

distributed/ddp-tutorial-series/run_example.sh

+# example.py
+
+echo "Launching ${1:-example.py} with ${2:-2} gpus"
+torchrun --nnodes=1 --nproc_per_node=${2:-2} --rdzv_id=101 --rdzv_endpoint="localhost:5972" ${1:-example.py}


add new line in end of file

dvrogozh · 2025-08-07T15:27:24Z

distributed/ddp-tutorial-series/multigpu.py

-    os.environ["MASTER_PORT"] = "12355"
-    torch.cuda.set_device(rank)
-    init_process_group(backend="nccl", rank=rank, world_size=world_size)
+    os.environ["MASTER_PORT"] = "12455"


It's still different port number.

dvrogozh · 2025-08-07T15:29:33Z

distributed/ddp-tutorial-series/multigpu.py

+        device_type = torch.accelerator.current_accelerator()
+        torch.accelerator.set_device_idx(rank)
+        device: torch.device = torch.device(f"{device_type}:{rank}")
+        torch.accelerator.device_index(rank)


It does not make sense to call context manager without with. Did you intend to call set_device_index() instead?

dvrogozh · 2025-08-07T15:30:19Z

distributed/ddp-tutorial-series/multigpu_torchrun.py

+    if torch.accelerator.is_available():
+        device_type = torch.accelerator.current_accelerator()
+        device = torch.device(f"{device_type}:{rank}")
+        torch.accelerator.device_index(rank)


same as above

dvrogozh · 2025-08-07T15:31:17Z

distributed/ddp-tutorial-series/multigpu_torchrun.py

        optimizer: torch.optim.Optimizer,
        save_every: int,
        snapshot_path: str,
+        device 


would be nice to have type designation here:

Suggested change

device

device: torch.device,

dvrogozh · 2025-08-07T15:31:38Z

distributed/ddp-tutorial-series/multinode.py

-    init_process_group(backend="nccl")
+    rank = int(os.environ["LOCAL_RANK"])
+    if torch.accelerator.is_available():
+        device_type = torch.accelerator.current_accelerator()


not addressed.

Signed-off-by: dggaytan <[email protected]>

dggaytan · 2025-09-10T20:22:35Z

continuing in #1393 for clean comments and changes

meta-cla bot added the cla signed label Jul 21, 2025

dvrogozh suggested changes Aug 6, 2025

View reviewed changes

dggaytan force-pushed the dggaytan/distributed_DDP branch from 2c0eb8f to 2ca1a5c Compare August 6, 2025 21:08

dvrogozh suggested changes Aug 7, 2025

View reviewed changes

dggaytan added 2 commits August 7, 2025 15:47

Adding torch accelerator to ddp-tutorial-series example

3e2c3ae

Signed-off-by: dggaytan <[email protected]>

Adding torch accelerator to ddp-tutorial-series example

67b4a05

Signed-off-by: dggaytan <[email protected]>

dggaytan mentioned this pull request Sep 8, 2025

Merge changes to default branch jafraustro/examples_xpu#40

Merged

dggaytan requested a review from dvrogozh September 8, 2025 22:50

dggaytan force-pushed the dggaytan/distributed_DDP branch from cc9f51e to 67b4a05 Compare September 10, 2025 20:00

dggaytan closed this Sep 10, 2025

	device: torch.device = torch.device(f"{device_type}:{rank}")
	device = torch.device(f"{device_type}:{rank}")


		# torch.cuda.set_device(rank)
		# init_process_group(backend="xccl", rank=rank, world_size=world_size)

		@@ -1 +1 @@
		torch>=1.11.0 No newline at end of file
		torch>=2.7 No newline at end of file

Adding torch accelerator to ddp-tutorial-series example #1376

Adding torch accelerator to ddp-tutorial-series example #1376

Uh oh!

Conversation

dggaytan commented Jul 21, 2025

Support for multiple accelerators:

Improvements to example execution:

Dependency updates:

Uh oh!

netlify bot commented Jul 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Deploy Preview for pytorch-examples-preview canceled.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dggaytan Aug 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dggaytan commented Sep 10, 2025

Uh oh!

Uh oh!

netlify bot commented Jul 21, 2025 •

edited

Loading

dggaytan Aug 6, 2025 •

edited

Loading