Distributed learning time per epoch comparison on two and four nodes #9201

Cow-Kite · 2024-04-15T13:10:05Z

Cow-Kite
Apr 15, 2024

Hello. I'm currently running pyg's distributed learning example code. (github: /examples/distributed/pyg/node_ogb_cpu.py)
Distributed learning is performed by dividing the graph into two and four partitions, respectively.

Through experiments, we found that the learning time per epoch on four nodes was more than twice that of the time on two nodes.
In theory, the epoch times should be the same, but why does this happen?

The number of epochs, mini-batch size, and all variables for distributed learning are the same.
We only tested with two and four nodes.

This is result on two node

This is result on four node

Thank you very much!

kgajdamo · 2024-04-16T11:18:37Z

kgajdamo
Apr 16, 2024
Collaborator

@Cow-Kite Hmm, it should not happen. Which dataset do you use ogbn-products or ogbn-mag? Did you turned on multithreading? Unfortunately, I don't know the details of your test environment. Maybe there is a communication delay between the machines?

5 replies

Cow-Kite Apr 16, 2024
Author

The dataset used ogbn-products, and two sampler processes and one trainer process were created for each node. The batch size was set to 1,024 and the epoch was set to 10. I used NFS for distributed training. Could this have anything to do with network latency?

kgajdamo Apr 17, 2024
Collaborator

Does this mean that you have all partitions saved on a shared folder and all machines get data from that place? If so, try sending each partition to its corresponding machine. If the data is local, the data access time should be shorter. Let me know whether it helped.

Cow-Kite Apr 18, 2024
Author

Unfortunately, both the methods of storing data on each node and organizing shared folders using NFS take a long time.
Is it true that the cause of this phenomenon is the increased communication cost in the process of RPC communication to create a mini-batch?

kgajdamo Apr 18, 2024
Collaborator

Please see Jakub's comment

Cow-Kite Apr 18, 2024
Author

Thank you very much!

jjpietrak · 2024-04-16T12:10:59Z

jjpietrak
Apr 16, 2024
Collaborator

You might want to check the output of ping <IP> between all 4 nodes to see if there's no network significant delay between the additional 2 nodes. Are all of these 4 nodes registered on the same local network? Also, for you master node it's the best to choose a node with high network bandwidth.

5 replies

Cow-Kite Apr 18, 2024
Author

As a result of checking the network delay from the master node to the worker node, we confirmed that the times for the three worker nodes were the same.

This is the ping result of the worker1 node from the master node

This is the ping result of the worker2 node from the master node

This is the ping result of the worker3 node from the master node

jjpietrak Apr 19, 2024
Collaborator

How about HW config of these 4 nodes? Are they all the same, wrt. number of cores, RAM config?
You could also try increasing the RPC concurrency (--concurrency) to at least 10 threads. I believe that the increased batch processing time is caused by RPC comm delay, by increasing concurrency you should be able to run more requests in parallel.

Cow-Kite Apr 22, 2024
Author

The configuration of the four nodes is the same 4 nodes.
The number of cores is 40, and the RAM is 128GB for three servers and 112GB for the other.

Even when I experimented by increasing the RPC concurrency (--concurrency) from 4 to 10, the results were the same.

kgajdamo Apr 22, 2024
Collaborator

@Cow-Kite In my free time I will try to run a distributed training with the same configuration and verify it.

Cow-Kite Apr 22, 2024
Author

Thank you for your help!!

Distributed learning time per epoch comparison on two and four nodes #9201

Uh oh!

Cow-Kite Apr 15, 2024

This is result on two node

This is result on four node

Replies: 2 comments · 10 replies

Uh oh!

kgajdamo Apr 16, 2024 Collaborator

Uh oh!

Cow-Kite Apr 16, 2024 Author

Uh oh!

kgajdamo Apr 17, 2024 Collaborator

Uh oh!

Cow-Kite Apr 18, 2024 Author

Uh oh!

kgajdamo Apr 18, 2024 Collaborator

Uh oh!

Cow-Kite Apr 18, 2024 Author

Uh oh!

jjpietrak Apr 16, 2024 Collaborator

Uh oh!

Cow-Kite Apr 18, 2024 Author

Uh oh!

jjpietrak Apr 19, 2024 Collaborator

Uh oh!

Cow-Kite Apr 22, 2024 Author

Uh oh!

kgajdamo Apr 22, 2024 Collaborator

Uh oh!

Cow-Kite Apr 22, 2024 Author

Cow-Kite
Apr 15, 2024

Replies: 2 comments 10 replies

kgajdamo
Apr 16, 2024
Collaborator

Cow-Kite Apr 16, 2024
Author

kgajdamo Apr 17, 2024
Collaborator

Cow-Kite Apr 18, 2024
Author

kgajdamo Apr 18, 2024
Collaborator

Cow-Kite Apr 18, 2024
Author

jjpietrak
Apr 16, 2024
Collaborator

Cow-Kite Apr 18, 2024
Author

jjpietrak Apr 19, 2024
Collaborator

Cow-Kite Apr 22, 2024
Author

kgajdamo Apr 22, 2024
Collaborator

Cow-Kite Apr 22, 2024
Author