-
Notifications
You must be signed in to change notification settings - Fork 26
Open
Description
Hello @lee218llnl,
I have been a happy user of STAT for a long time and used STAT for deadlock detection at scale on many systems.
Now working with AI/ML stack and wonder if STAT would be also useful for distributed training workloads as well.
Especially, I wonder about the frameworks like PyTorch where NCCL is default backend. Any experience or suggestion about this?
Also, as I don't see many updates on STAT repo, I was wondering if there are other efforts ongoing or alternative tools being developed at LLNL (or outside).
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels