Skip to content

Improve Fault Tolerance via TorchFT #20967

@CandiedCode

Description

@CandiedCode

Description & Motivation

https://github.com/pytorch/torchft
https://pytorch.org/blog/fault-tolerant-llama-training-with-2000-synthetic-failures-every-15-seconds-and-no-checkpoints-on-crusoe-l40s/

I would like there to be able to use torchft to improve my fault tolerance inside of pytorch lightning.

Pitch

No response

Alternatives

No response

Additional context

No response

cc @lantiga @Borda

Metadata

Metadata

Assignees

No one assigned

    Labels

    3rd partyRelated to a 3rd-partyfeatureIs an improvement or enhancement

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions