|
| 1 | +:orphan: |
| 2 | + |
| 3 | +################ |
| 4 | +Fabric Arguments |
| 5 | +################ |
| 6 | + |
| 7 | + |
| 8 | +accelerator |
| 9 | +=========== |
| 10 | + |
| 11 | +Choose one of ``"cpu"``, ``"gpu"``, ``"tpu"``, ``"auto"`` (IPU support is coming soon). |
| 12 | + |
| 13 | +.. code-block:: python |
| 14 | +
|
| 15 | + # CPU accelerator |
| 16 | + fabric = Fabric(accelerator="cpu") |
| 17 | +
|
| 18 | + # Running with GPU Accelerator using 2 GPUs |
| 19 | + fabric = Fabric(devices=2, accelerator="gpu") |
| 20 | +
|
| 21 | + # Running with TPU Accelerator using 8 tpu cores |
| 22 | + fabric = Fabric(devices=8, accelerator="tpu") |
| 23 | +
|
| 24 | + # Running with GPU Accelerator using the DistributedDataParallel strategy |
| 25 | + fabric = Fabric(devices=4, accelerator="gpu", strategy="ddp") |
| 26 | +
|
| 27 | +The ``"auto"`` option recognizes the machine you are on and selects the available accelerator. |
| 28 | + |
| 29 | +.. code-block:: python |
| 30 | +
|
| 31 | + # If your machine has GPUs, it will use the GPU Accelerator |
| 32 | + fabric = Fabric(devices=2, accelerator="auto") |
| 33 | +
|
| 34 | +
|
| 35 | +strategy |
| 36 | +======== |
| 37 | + |
| 38 | +Choose a training strategy: ``"dp"``, ``"ddp"``, ``"ddp_spawn"``, ``"tpu_spawn"``, ``"deepspeed"``, ``"ddp_sharded"``, or ``"ddp_sharded_spawn"``. |
| 39 | + |
| 40 | +.. code-block:: python |
| 41 | +
|
| 42 | + # Running with the DistributedDataParallel strategy on 4 GPUs |
| 43 | + fabric = Fabric(strategy="ddp", accelerator="gpu", devices=4) |
| 44 | +
|
| 45 | + # Running with the DDP Spawn strategy using 4 cpu processes |
| 46 | + fabric = Fabric(strategy="ddp_spawn", accelerator="cpu", devices=4) |
| 47 | +
|
| 48 | +
|
| 49 | +Additionally, you can pass in your custom strategy by configuring additional parameters. |
| 50 | + |
| 51 | +.. code-block:: python |
| 52 | +
|
| 53 | + from lightning.fabric.strategies import DeepSpeedStrategy |
| 54 | +
|
| 55 | + fabric = Fabric(strategy=DeepSpeedStrategy(stage=2), accelerator="gpu", devices=2) |
| 56 | +
|
| 57 | +
|
| 58 | +Support for Fully Sharded training strategies are coming soon. |
| 59 | + |
| 60 | + |
| 61 | +devices |
| 62 | +======= |
| 63 | + |
| 64 | +Configure the devices to run on. Can be of type: |
| 65 | + |
| 66 | +- int: the number of devices (e.g., GPUs) to train on |
| 67 | +- list of int: which device index (e.g., GPU ID) to train on (0-indexed) |
| 68 | +- str: a string representation of one of the above |
| 69 | + |
| 70 | +.. code-block:: python |
| 71 | +
|
| 72 | + # default used by Fabric, i.e., use the CPU |
| 73 | + fabric = Fabric(devices=None) |
| 74 | +
|
| 75 | + # equivalent |
| 76 | + fabric = Fabric(devices=0) |
| 77 | +
|
| 78 | + # int: run on two GPUs |
| 79 | + fabric = Fabric(devices=2, accelerator="gpu") |
| 80 | +
|
| 81 | + # list: run on GPUs 1, 4 (by bus ordering) |
| 82 | + fabric = Fabric(devices=[1, 4], accelerator="gpu") |
| 83 | + fabric = Fabric(devices="1, 4", accelerator="gpu") # equivalent |
| 84 | +
|
| 85 | + # -1: run on all GPUs |
| 86 | + fabric = Fabric(devices=-1, accelerator="gpu") |
| 87 | + fabric = Fabric(devices="-1", accelerator="gpu") # equivalent |
| 88 | +
|
| 89 | +
|
| 90 | +num_nodes |
| 91 | +========= |
| 92 | + |
| 93 | + |
| 94 | +Number of cluster nodes for distributed operation. |
| 95 | + |
| 96 | +.. code-block:: python |
| 97 | +
|
| 98 | + # Default used by Fabric |
| 99 | + fabric = Fabric(num_nodes=1) |
| 100 | +
|
| 101 | + # Run on 8 nodes |
| 102 | + fabric = Fabric(num_nodes=8) |
| 103 | +
|
| 104 | +
|
| 105 | +Learn more about distributed multi-node training on clusters :doc:`here <../../clouds/cluster>`. |
| 106 | + |
| 107 | + |
| 108 | +precision |
| 109 | +========= |
| 110 | + |
| 111 | +Fabric supports double precision (64), full precision (32), or half precision (16) operation (including `bfloat16 <https://pytorch.org/docs/1.10.0/generated/torch.Tensor.bfloat16.html>`_). |
| 112 | +Half precision, or mixed precision, is the combined use of 32 and 16-bit floating points to reduce the memory footprint during model training. |
| 113 | +This can result in improved performance, achieving significant speedups on modern GPUs. |
| 114 | + |
| 115 | +.. code-block:: python |
| 116 | +
|
| 117 | + # Default used by the Fabric |
| 118 | + fabric = Fabric(precision=32, devices=1) |
| 119 | +
|
| 120 | + # 16-bit (mixed) precision |
| 121 | + fabric = Fabric(precision=16, devices=1) |
| 122 | +
|
| 123 | + # 16-bit bfloat precision |
| 124 | + fabric = Fabric(precision="bf16", devices=1) |
| 125 | +
|
| 126 | + # 64-bit (double) precision |
| 127 | + fabric = Fabric(precision=64, devices=1) |
| 128 | +
|
| 129 | +
|
| 130 | +plugins |
| 131 | +======= |
| 132 | + |
| 133 | +:ref:`Plugins` allow you to connect arbitrary backends, precision libraries, clusters etc. For example: |
| 134 | +To define your own behavior, subclass the relevant class and pass it in. Here's an example linking up your own |
| 135 | +:class:`~lightning.fabric.plugins.environments.ClusterEnvironment`. |
| 136 | + |
| 137 | +.. code-block:: python |
| 138 | +
|
| 139 | + from lightning.fabric.plugins.environments import ClusterEnvironment |
| 140 | +
|
| 141 | +
|
| 142 | + class MyCluster(ClusterEnvironment): |
| 143 | + @property |
| 144 | + def main_address(self): |
| 145 | + return your_main_address |
| 146 | +
|
| 147 | + @property |
| 148 | + def main_port(self): |
| 149 | + return your_main_port |
| 150 | +
|
| 151 | + def world_size(self): |
| 152 | + return the_world_size |
| 153 | +
|
| 154 | +
|
| 155 | + fabric = Fabric(plugins=[MyCluster()], ...) |
| 156 | +
|
| 157 | +
|
| 158 | +callbacks |
| 159 | +========= |
| 160 | + |
| 161 | +A callback class is a collection of methods that the training loop can call at a specific point in time, for example, at the end of an epoch. |
| 162 | +Add callbacks to Fabric to inject logic into your training loop from an external callback class. |
| 163 | + |
| 164 | +.. code-block:: python |
| 165 | +
|
| 166 | + class MyCallback: |
| 167 | + def on_train_epoch_end(self, results): |
| 168 | + ... |
| 169 | +
|
| 170 | +You can then register this callback, or multiple ones directly in Fabric: |
| 171 | + |
| 172 | +.. code-block:: python |
| 173 | +
|
| 174 | + fabric = Fabric(callbacks=[MyCallback()]) |
| 175 | +
|
| 176 | +
|
| 177 | +Then, in your training loop, you can call a hook by its name. Any callback objects that have this hook will execute it: |
| 178 | + |
| 179 | +.. code-block:: python |
| 180 | +
|
| 181 | + # Call any hook by name |
| 182 | + fabric.call("on_train_epoch_end", results={...}) |
| 183 | +
|
| 184 | +
|
| 185 | +loggers |
| 186 | +======= |
| 187 | + |
| 188 | +Attach one or several loggers/experiment trackers to Fabric for convenient logging of metrics. |
| 189 | + |
| 190 | +.. code-block:: python |
| 191 | +
|
| 192 | + # Default used by Fabric, no loggers are active |
| 193 | + fabric = Fabric(loggers=[]) |
| 194 | +
|
| 195 | + # Log to a single logger |
| 196 | + fabric = Fabric(loggers=TensorBoardLogger(...)) |
| 197 | +
|
| 198 | + # Or multiple instances |
| 199 | + fabric = Fabric(loggers=[logger1, logger2, ...]) |
| 200 | +
|
| 201 | +Anywhere in your training loop, you can log metrics to all loggers at once: |
| 202 | + |
| 203 | +.. code-block:: python |
| 204 | +
|
| 205 | + fabric.log("loss", loss) |
| 206 | + fabric.log_dict({"loss": loss, "accuracy": acc}) |
0 commit comments