Replies: 2 comments 13 replies
-
@shishaochen Could you please take a look? Thanks! |
Beta Was this translation helpful? Give feedback.
1 reply
-
For horovod issue, the behavior should come from horovod itself but not from deepmd-kit. |
Beta Was this translation helpful? Give feedback.
12 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Hello,
I hope you are doing well. Attached please find the timeline.json script for a training run on a single node with 16 workers and a local batch size of 32. TF_Intra_Op is set to 12 and TF_Inter_Op is set to 4, with OpenMP=3. I expected to see 4 compute threads, reflecting TF_Inter_Op setting.
The strange element of this run is on threads 4-9 with so much time spent on HorovodAllReduce. Is this behavior that you have noticed as well?
In addition, within the Json script there are functions for "enable_profiler" and "profiling" which give differing outputs with respect to the tensorboard trace. May I also ask what the difference between these two functionalities are?
Thank you very much for your time!
Warm Regards

Beta Was this translation helpful? Give feedback.
All reactions