-
Notifications
You must be signed in to change notification settings - Fork 3.6k
Open
Labels
3rd partyRelated to a 3rd-partyRelated to a 3rd-partyfeatureIs an improvement or enhancementIs an improvement or enhancement
Description
Description & Motivation
https://github.com/pytorch/torchft
https://pytorch.org/blog/fault-tolerant-llama-training-with-2000-synthetic-failures-every-15-seconds-and-no-checkpoints-on-crusoe-l40s/
I would like there to be able to use torchft to improve my fault tolerance inside of pytorch lightning.
Pitch
No response
Alternatives
No response
Additional context
No response
Metadata
Metadata
Assignees
Labels
3rd partyRelated to a 3rd-partyRelated to a 3rd-partyfeatureIs an improvement or enhancementIs an improvement or enhancement