How to improve the performance of multi-gpu #1690

phyoung123 · 2022-05-08T08:45:48Z

phyoung123
May 8, 2022

deepmd-kit version 2.1.1

First of all, Thank you for your concern of my question. I have 4 GPU cards in one node, and 2 cpus with 12 cores for each one. I want to use this respectable code to train a model but I don't think my computing resources are being fully utilized, the usage of my computer is low and it cost about 10s per 100 steps, I think this is very inefficient and unacceptable.

I have tried to modify the bach_size to change the utilization rate, but an error emerged:
"batch_size" is not defined in the strict model
So I have to annotate it out to keep the code running.

I have also set the environment vaiables as follow:
export OMP_NUM_THREADS=24; export TF_INTRA_OP_PARALLELISM_THREADS=12; export TF_INTER_OP_PARALLELISM_THREADS=2; export CUDA_VISIBLE_DEVICES=0,1,2,3

So my question is how can i improve the performance of GPU cards based on my existing computer configuration . Thank you for your reply.

Lewis-YL · 2022-05-08T10:20:04Z

Lewis-YL
May 8, 2022

It is similar in my case. The GPU power usage is only 40-50%.

When the power usage is low, there is some room for optimization as mentioned in this post:
https://superuser.com/questions/1013001/relation-between-gpu-utilization-and-graphic-cards-power-consumption

The optimization may take months, I think

0 replies

njzjz · 2022-05-09T19:13:09Z

njzjz
May 9, 2022
Maintainer

For the usage of multi GPUs, refer to https://docs.deepmodeling.com/projects/deepmd/en/master/train/parallel-training.html.

0 replies

AnguseZhang · 2022-05-11T18:23:21Z

AnguseZhang
May 11, 2022
Maintainer

"it cost about 10s per 100 steps" What's the type of your GPU cards ? Also, what's the training parameters? It seems the speed is slower than the expectation.

2 replies

Lewis-YL May 12, 2022

What is your expected speed for GPU? Is there a benchmark for GPU training?
I used RTX 3080Ti and i9-12900K
In my case, it takes 8 s to train 100 steps with 3080Ti and 24s to train 100 steps with 12900K.

phyoung123 May 12, 2022
Author

I don't think there's a benchmark here, as long as you can accept it, and the reason I can't accept it for 10s per 100 steps is because I have a very wide temperature-pressure zone to iterate over

phyoung123 · 2022-05-12T00:30:34Z

phyoung123
May 12, 2022
Author

Thanks to all of you, I think the slow speed might be because the GPU cards is so outdated, which I used four 1080Ti cards for trainning. When I perform it on the Tesla v100 it only takes about 2 seconds to train 100 steps. Thanks again for your concern.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

How to improve the performance of multi-gpu #1690

Uh oh!

{{title}}

Uh oh!

Replies: 4 comments 2 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

How to improve the performance of multi-gpu #1690

Uh oh!

phyoung123 May 8, 2022

Replies: 4 comments · 2 replies

Uh oh!

Lewis-YL May 8, 2022

Uh oh!

Uh oh!

njzjz May 9, 2022 Maintainer

Uh oh!

AnguseZhang May 11, 2022 Maintainer

Uh oh!

Lewis-YL May 12, 2022

Uh oh!

phyoung123 May 12, 2022 Author

Uh oh!

phyoung123 May 12, 2022 Author

phyoung123
May 8, 2022

Replies: 4 comments 2 replies

Lewis-YL
May 8, 2022

njzjz
May 9, 2022
Maintainer

AnguseZhang
May 11, 2022
Maintainer

phyoung123 May 12, 2022
Author

phyoung123
May 12, 2022
Author