Conversation
yhtang
left a comment
There was a problem hiding this comment.
Thanks for making this! Made some comments. Let me know what you think.
yhtang
left a comment
There was a problem hiding this comment.
Now that we have a working example, could we integrate this into jio.yaml?
Also, could we add a performance-monitoring step to this job so that if throughput drops below a certain baseline, the job reports a failure?
|
hey @yhtang
We can definitely make up the fully working example, just a caveat, we're still investigating why NCCL doesn't pick up EFA on EKS. If it's ok with you we can start with this approach, performance will be low, I'll give you some numbers by EOW at most. |
|
@yhtang shared with you the performance on EKS |
yhtang
left a comment
There was a problem hiding this comment.
This looks great. Thank you for the hard work!
No description provided.