Skip to content

Conversation

msaroufim
Copy link
Member

@msaroufim msaroufim commented Aug 25, 2025

These examples were just crashing even on older versions of torch

(create) ➜  batch git:(main) ✗ python reinforce.py
(create) ➜  batch git:(main) ✗ python reinforce.py
[Gloo] Rank 0 is connected to 1 peer ranks. Expected number of connected peer ranks is : 1
[Gloo] Rank 1 is connected to 1 peer ranks. Expected number of connected peer ranks is : 1
Episode 0       Last reward: 22.00      Average reward: 11.00
Episode 1       Last reward: 45.00      Average reward: 28.00
Episode 2       Last reward: 11.00      Average reward: 19.50
Episode 3       Last reward: 33.00      Average reward: 26.25
Episode 4       Last reward: 10.00      Average reward: 18.12
Episode 5       Last reward: 41.00      Average reward: 29.56
Episode 6       Last reward: 18.00      Average reward: 23.78
Episode 7       Last reward: 18.00      Average reward: 20.89
Episode 8       Last reward: 60.00      Average reward: 40.45
Episode 9       Last reward: 55.00      Average reward: 47.72
[Gloo] Rank 1 is connected to 1 peer ranks. Expected number of connected peer ranks is : 1
[Gloo] Rank 0 is connected to 1 peer ranks. Expected number of connected peer ranks is : 1
Episode 0       Last reward: 22.00      Average reward: 11.00
Episode 1       Last reward: 45.00      Average reward: 28.00
Episode 2       Last reward: 11.00      Average reward: 19.50
Episode 3       Last reward: 33.00      Average reward: 26.25
Episode 4       Last reward: 10.00      Average reward: 18.12
Episode 5       Last reward: 41.00      Average reward: 29.56
Episode 6       Last reward: 18.00      Average reward: 23.78
Episode 7       Last reward: 18.00      Average reward: 20.89
Episode 8       Last reward: 60.00      Average reward: 40.45
Episode 9       Last reward: 55.00      Average reward: 47.72
2, 13.118742942810059, 13.514524221420288


(create) ➜  batch git:(main) ✗ python parameter_server.py
[Gloo] Rank 0 is connected to 5 peer ranks. Expected number of connected peer ranks is : 5
[Gloo] Rank 3 is connected to 5 peer ranks. Expected number of connected peer ranks is : 5
[Gloo] Rank 5 is connected to 5 peer ranks. Expected number of connected peer ranks is : 5
[Gloo] Rank 2 is connected to 5 peer ranks. Expected number of connected peer ranks is : 5
[Gloo] Rank 1 is connected to 5 peer ranks. Expected number of connected peer ranks is : 5
[Gloo] Rank 4 is connected to 5 peer ranks. Expected number of connected peer ranks is : 5
[Gloo] Rank 0 is connected to 5 peer ranks. Expected number of connected peer ranks is : 5
[Gloo] Rank 1 is connected to 5 peer ranks. Expected number of connected peer ranks is : 5
[Gloo] Rank 3 is connected to 5 peer ranks. Expected number of connected peer ranks is : 5
[Gloo] Rank 2 is connected to 5 peer ranks. Expected number of connected peer ranks is : 5
[Gloo] Rank 5 is connected to 5 peer ranks. Expected number of connected peer ranks is : 5
[Gloo] Rank 4 is connected to 5 peer ranks. Expected number of connected peer ranks is : 5
12:53:07 Start training
12:53:09 trainer5 processing one batch
12:53:09 trainer3 processing one batch
12:53:09 trainer1 processing one batch
12:53:09 trainer2 processing one batch
12:53:09 trainer4 processing one batch
12:53:10 trainer5 reporting grads
12:53:10 trainer1 reporting grads
12:53:11 PS got 0/5 updates
12:53:11 trainer3 reporting grads
12:53:11 trainer2 reporting grads
12:53:11 PS got 1/5 updates
12:53:11 PS got 2/5 updates
12:53:11 trainer4 reporting grads
12:53:11 PS got 3/5 updates
12:53:11 PS got 4/5 updates
12:53:11 PS updated model
12:53:11 trainer5 got updated model
12:53:11 trainer5 processing one batch
12:53:11 trainer3 got updated model
12:53:11 trainer3 processing one batch
12:53:11 trainer2 got updated model
12:53:11 trainer2 processing one batch
12:53:11 trainer5 reporting grads
12:53:11 trainer3 reporting grads
12:53:11 trainer2 reporting grads
12:53:11 trainer1 got updated model
12:53:11 trainer4 got updated model
12:53:11 trainer1 processing one batch
12:53:11 trainer4 processing one batch
12:53:11 trainer1 reporting grads
12:53:11 trainer4 reporting grads
12:53:11 PS got 0/5 updates
12:53:11 PS got 0/5 updates
12:53:11 PS got 1/5 updates
12:53:11 PS got 3/5 updates
12:53:11 PS got 4/5 updates
12:53:12 PS updated model
12:53:12 trainer3 got updated model
12:53:12 trainer3 processing one batch
12:53:12 trainer3 reporting grads
12:53:12 trainer2 got updated model
12:53:12 trainer5 got updated model
12:53:12 trainer5 processing one batch
12:53:12 trainer2 processing one batch
12:53:12 trainer4 got updated model
12:53:12 trainer4 processing one batch
12:53:12 trainer2 reporting grads
12:53:12 trainer5 reporting grads
12:53:12 trainer1 got updated model
12:53:12 trainer4 reporting grads
12:53:12 trainer1 processing one batch
12:53:12 trainer1 reporting grads
12:53:12 PS got 0/5 updates
12:53:12 PS got 1/5 updates
12:53:12 PS got 1/5 updates
12:53:12 PS got 2/5 updates
12:53:12 PS got 3/5 updates
12:53:12 PS updated model
12:53:12 trainer3 got updated model
12:53:12 trainer3 processing one batch
12:53:13 trainer3 reporting grads
12:53:13 PS got 0/5 updates
12:53:13 trainer1 got updated model
12:53:13 trainer2 got updated model
12:53:13 trainer1 processing one batch
12:53:13 trainer2 processing one batch
12:53:13 trainer1 reporting grads
12:53:13 trainer4 got updated model
12:53:13 trainer2 reporting grads
12:53:13 trainer4 processing one batch
12:53:13 trainer5 got updated model
12:53:13 trainer5 processing one batch
12:53:13 trainer4 reporting grads
12:53:13 trainer5 reporting grads
12:53:13 PS got 1/5 updates
12:53:13 PS got 1/5 updates
12:53:13 PS got 2/5 updates
12:53:13 PS got 4/5 updates
12:53:13 PS updated model
12:53:13 trainer3 got updated model
12:53:13 trainer3 processing one batch
12:53:13 trainer3 reporting grads
12:53:13 PS got 0/5 updates
12:53:13 trainer1 got updated model
12:53:13 trainer1 processing one batch
12:53:13 trainer1 reporting grads
12:53:13 trainer5 got updated model
12:53:13 trainer2 got updated model
12:53:13 trainer5 processing one batch
12:53:13 trainer2 processing one batch
12:53:13 trainer4 got updated model
12:53:13 trainer4 processing one batch
12:53:13 trainer4 reporting grads
12:53:13 trainer2 reporting grads
12:53:13 trainer5 reporting grads
12:53:13 PS got 1/5 updates
12:53:13 PS got 2/5 updates
12:53:13 PS got 2/5 updates
12:53:13 PS got 4/5 updates
12:53:13 PS updated model
12:53:14 trainer3 got updated model
12:53:14 trainer3 processing one batch
12:53:14 trainer3 reporting grads
12:53:14 PS got 0/5 updates
12:53:14 trainer1 got updated model
12:53:14 trainer2 got updated model
12:53:14 trainer1 processing one batch
12:53:14 trainer2 processing one batch
12:53:14 trainer5 got updated model
12:53:14 trainer5 processing one batch
12:53:14 trainer1 reporting grads
12:53:14 trainer2 reporting grads
12:53:14 trainer5 reporting grads
12:53:14 trainer4 got updated model
12:53:14 trainer4 processing one batch
12:53:14 trainer4 reporting grads
12:53:14 PS got 1/5 updates
12:53:14 PS got 1/5 updates
12:53:14 PS got 1/5 updates
12:53:14 PS got 2/5 updates
12:53:14 PS updated model
12:53:14 trainer3 got updated model
12:53:14 trainer5 got updated model
12:53:14 trainer2 got updated model
12:53:14 trainer1 got updated model
12:53:14 trainer4 got updated model
12:53:14 Finish training
(create) ➜  batch git:(main) ✗

@meta-cla meta-cla bot added the cla signed label Aug 25, 2025
Copy link

netlify bot commented Aug 25, 2025

Deploy Preview for pytorch-examples-preview canceled.

Name Link
🔨 Latest commit 3b1a2d1
🔍 Latest deploy log https://app.netlify.com/projects/pytorch-examples-preview/deploys/68acbfcc523802000826142c

@msaroufim msaroufim changed the title More RPC examples Unbreak torch/distributed/rpc/batch Aug 25, 2025
@msaroufim msaroufim merged commit 7fce8bb into main Aug 25, 2025
8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant