Parallelism cannot be tuned by changing environment variables (TF_INTRA_OP_PARALLELISM_THREADS,OMP_NUM_THREADS) #2789

DingChangjie · 2023-09-05T12:34:45Z

DingChangjie
Sep 5, 2023

I'm trying to train a 4-elements deep potential on a workstation equipped with 4 RTX 3080Ti cards and one i9-10900X CPU. The CPU has one socket and 10 cores. When only one GPU card is used, the training speed is around 14 s/100 steps. But when I try to accelerate it by parallel training, the parallelism can not be properly tuned by changing environment variables.
For example, I use the following script to find the most appropriate environment variables:

export TF_INTER_OP_PARALLELISM_THREADS=1
export CUDA_VISIBLE_DEVICES=2,3
echo "TF_INTRA    OMP_NUM_THREADS    TIME(s)" > record.txt
for i in $(seq 1 10)
do
    export TF_INTRA_OP_PARALLELISM_THREADS=${i}
    export OMP_NUM_THREADS=${i}
    nohup horovodrun -np 2 dp train param.json &
    # the "numb_step" in param.json is only 200, in order to test the training speed
    echo -n "${TF_INTRA_OP_PARALLELISM_THREADS}    ${OMP_NUM_THREADS}    " >> record.txt
    wait
    grep "total wall time" nohup.out | tail -1 | sed 's/.*total wall time \(.*\) s/\1/g' >> record.txt
done

Here two GPU cards are used, and the TF_INTER_OP_PARALLELISM_THREADS is fixed to 1 since there is only one socket. The output files (nohup.out) confirm that distributed trainings have indeed been established, for example:

[0] DEEPMD INFO    ---Summary of the training---------------------------------------
[0] DEEPMD INFO    distributed
[0] DEEPMD INFO    world size:           2
[0] DEEPMD INFO    my rank:              0
[0] DEEPMD INFO    node list:            ['GPU']
[0] DEEPMD INFO    running on:           GPU
[0] DEEPMD INFO    computing device:     gpu:0
[0] DEEPMD INFO    CUDA_VISIBLE_DEVICES: 2,3
[0] DEEPMD INFO    Count of visible GPU: 2
[0] DEEPMD INFO    num_intra_threads:    7
[0] DEEPMD INFO    num_inter_threads:    1
[0] DEEPMD INFO    -----------------------------------------------------------------

However, the record.txt reads:

TF_INTRA    OMP_NUM_THREADS    TIME(s)
1    1    17.35
2    2    17.98
3    3    17.97
4    4    17.77
5    5    18.18
6    6    17.80
7    7    18.05
8    8    17.67
9    9    17.87
10    10    18.17

It seems that no matter how the TF_INTRA_OP_PARALLELISM_THREADS and OMP_NUM_THREADS are changed, the parallel training costs even more time than one GPU card. I'm wordering if I've misunderstanded anything, or there are any other variables that I did not take in to consideration? (e.g., KMP_AFFINITY)
Thanks！

Answered by njzjz

Sep 5, 2023

Parallel training is equivalent to increasing the batch size.

View full answer

DingChangjie · 2023-09-05T13:13:52Z

DingChangjie
Sep 5, 2023
Author

I'm totally new to parallel training in DP, I think I misunderstood the mechanism of parallel training... According to the manual https://docs.deepmodeling.com/projects/deepmd/en/master/train/parallel-training.html#tuning-learning-rate, I should manually decrease the "numb_step" . For example, I used to train the model for 8,000,000 steps. If I use two cards, then I should manually set the numb_steps to be 8,000,000/2=4,000,000 steps to achieve similar accuracy to that of one card. Am I correct?

0 replies

njzjz · 2023-09-05T20:28:30Z

njzjz
Sep 5, 2023
Maintainer

Parallel training is equivalent to increasing the batch size.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Parallelism cannot be tuned by changing environment variables (TF_INTRA_OP_PARALLELISM_THREADS,OMP_NUM_THREADS) #2789

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Parallelism cannot be tuned by changing environment variables (TF_INTRA_OP_PARALLELISM_THREADS,OMP_NUM_THREADS) #2789

Uh oh!

DingChangjie Sep 5, 2023

Replies: 2 comments

Uh oh!

DingChangjie Sep 5, 2023 Author

Uh oh!

njzjz Sep 5, 2023 Maintainer

DingChangjie
Sep 5, 2023

DingChangjie
Sep 5, 2023
Author

njzjz
Sep 5, 2023
Maintainer