-
Notifications
You must be signed in to change notification settings - Fork 9
Open
Description
Hi @insujang,
Thank you for open-sourcing Oobleck—it’s an impressive piece of work!
I noticed in the paper that there is a parameter f that controls the fault tolerance threshold. However, I couldn’t find it in the codebase. Is there a way to configure or control this parameter? Additionally, is there a default value set for f?
Another question I have is that I noticed Zero3 is being used. In this case, each GPU should hold a unique model slice for optimizer states (as is the case with traditional Zero3). If one node fails, the corresponding Zero3 slice would also be lost. How can this be recovered? If my understanding is incorrect, please feel free to point it out.
Looking forward to your response. Thanks again for your contributions!
Metadata
Metadata
Assignees
Labels
No labels