Skip to content

Commit 1c196da

Browse files
Update fault_tolerant_training_basic.rst (#16012)
1 parent 186b799 commit 1c196da

File tree

2 files changed

+0
-34
lines changed

2 files changed

+0
-34
lines changed

docs/source-pytorch/clouds/fault_tolerant_training_basic.rst

Lines changed: 0 additions & 23 deletions
Original file line numberDiff line numberDiff line change
@@ -18,26 +18,3 @@ With Fault Tolerant Training, when ``Trainer.fit()`` fails in the middle of an e
1818
Lightning will restart exactly where it failed, and everything will be restored (down to the batch it was on even if the dataset was shuffled).
1919

2020
.. warning:: Fault-tolerant Training is currently an experimental feature within Lightning.
21-
22-
----
23-
24-
***************************************************
25-
Use fault-tolerance to save money on cloud training
26-
***************************************************
27-
Cloud providers offer pre-emptible machines which can be priced as low as 1/10th the cost but can be shut-down automatically at any time.
28-
Because fault-tolerant training can automatically recover from an interruption, you can train models for many weeks/months at a time for the pre-emptible prices.
29-
30-
To easily run on the cloud with fault-tolerance with lightning-grid, use the following arguments:
31-
32-
.. code-block:: bash
33-
34-
grid run --use_spot --auto_resume lightning_script.py
35-
36-
The ``--use_spot`` argument enables cheap preemptible pricing (but the machines that can be interrupted).
37-
If the machine is interrupted, the ``--auto_resume`` argument automatically restarts the machine.
38-
39-
As long as you are running a script that runs a lightning model, the model will restore itself and handle all the details of fault tolerance.
40-
41-
----
42-
43-
.. include:: grid_costs.rst

docs/source-pytorch/clouds/fault_tolerant_training_expert.rst

Lines changed: 0 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -21,14 +21,3 @@ To enable fault tolerance on your own cloud or cluster environment enable the *P
2121
Although Lighting will now be fault-tolerant, you'll have to handle all the nuances of making sure the models are automatically restarted.
2222

2323
.. note:: This complexity is already handled for you if you use **lightning-grid**.
24-
25-
----
26-
27-
**************************************************
28-
Enable fault-tolerant behavior on your own cluster
29-
**************************************************
30-
The simplest way to enable fault-tolerant behavior is to enable lightning-grid to work on your on-prem cluster or cloud environment which will handle all the nuances of fault-tolerant training at scale.
31-
32-
Email us to connect with your own cloud account:
33-
34-

0 commit comments

Comments
 (0)