Skip to content

Releases: Lightning-AI/pytorch-lightning

Standard weekly patch release

15 Dec 23:06

Choose a tag to compare

[1.5.6] - 2021-12-15

Fixed

  • Fixed a bug where the DeepSpeedPlugin arguments cpu_checkpointing and contiguous_memory_optimization were not being forwarded to deepspeed correctly (#10874)
  • Fixed an issue with NeptuneLogger causing checkpoints to be uploaded with a duplicated file extension (#11015)
  • Fixed support for logging within callbacks returned from LightningModule (#10991)
  • Fixed running sanity check with RichProgressBar (#10913)
  • Fixed support for CombinedLoader while checking for warning raised with eval dataloaders (#10994)
  • The TQDM progress bar now correctly shows the on_epoch logged values on train epoch end (#11069)
  • Fixed bug where the TQDM updated the training progress bar during trainer.validate (#11069)

Contributors

@carmocca @jona-0 @kaushikb11 @Raalsky @rohitgr7

If we forgot someone due to not matching commit email with GitHub account, let us know :]

Standard weekly patch release

07 Dec 15:35

Choose a tag to compare

[1.5.5] - 2021-12-07

Fixed

  • Disabled batch_size extraction for torchmetric instances because they accumulate the metrics internally (#10815)
  • Fixed an issue with SignalConnector not restoring the default signal handlers on teardown when running on SLURM or with fault-tolerant training enabled (#10611)
  • Fixed SignalConnector._has_already_handler check for callable type (#10483)
  • Fixed an issue to return the results for each dataloader separately instead of duplicating them for each (#10810)
  • Improved exception message if rich version is less than 10.2.2 (#10839)
  • Fixed uploading best model checkpoint in NeptuneLogger (#10369)
  • Fixed early schedule reset logic in PyTorch profiler that was causing data leak (#10837)
  • Fixed a bug that caused incorrect batch indices to be passed to the BasePredictionWriter hooks when using a dataloader with num_workers > 0 (#10870)
  • Fixed an issue with item assignment on the logger on rank > 0 for those who support it (#10917)
  • Fixed importing torch_xla.debug for torch-xla<1.8 (#10836)
  • Fixed an issue with DDPSpawnPlugin and related plugins leaving a temporary checkpoint behind (#10934)
  • Fixed a TypeError occuring in the SingalConnector.teardown() method (#10961)

Contributors

@awaelchli @carmocca @four4fish @kaushikb11 @lucmos @mauvilsa @Raalsky @rohitgr7

If we forgot someone due to not matching commit email with GitHub account, let us know :]

Standard weekly patch release

30 Nov 14:41

Choose a tag to compare

[1.5.4] - 2021-11-30

Fixed

  • Fixed support for --key.help=class with the LightningCLI (#10767)
  • Fixed _compare_version for python packages (#10762)
  • Fixed TensorBoardLogger SummaryWriter not close before spawning the processes (#10777)
  • Fixed a consolidation error in Lite when attempting to save the state dict of a sharded optimizer (#10746)
  • Fixed the default logging level for batch hooks associated with training from on_step=False, on_epoch=True to on_step=True, on_epoch=False (#10756)

Removed

Contributors

@awaelchli @carmocca @kaushikb11 @rohitgr7 @tchaton

If we forgot someone due to not matching commit email with GitHub account, let us know :]

Standard weekly patch release

24 Nov 15:40

Choose a tag to compare

[1.5.3] - 2021-11-24

Fixed

  • Fixed ShardedTensor state dict hook registration to check if torch distributed is available (#10621)
  • Fixed an issue with self.log not respecting a tensor's dtype when applying computations (#10076)
  • Fixed LigtningLite _wrap_init popping unexisting keys from DataLoader signature parameters (#10613)
  • Fixed signals being registered within threads (#10610)
  • Fixed an issue that caused Lightning to extract the batch size even though it was set by the user in LightningModule.log (#10408)
  • Fixed Trainer(move_metrics_to_cpu=True) not moving the evaluation logged results to CPU (#10631)
  • Fixed the {validation,test}_step outputs getting moved to CPU with Trainer(move_metrics_to_cpu=True) (#10631)
  • Fixed signals being registered within threads (#10610)
  • Fixed an issue with collecting logged test results with multiple dataloaders (#10522)

Contributors

@ananthsub @awaelchli @carmocca @jiwidi @kaushikb11 @qqueing @rohitgr7 @shabie @tchaton

If we forgot someone due to not matching commit email with GitHub account, let us know :]

Standard weekly patch release

16 Nov 19:20

Choose a tag to compare

[1.5.2] - 2021-11-16

Fixed

  • Fixed CombinedLoader and max_size_cycle didn't receive a DistributedSampler (#10374)
  • Fixed an issue where class or init-only variables of dataclasses were passed to the dataclass constructor in utilities.apply_to_collection (#9702)
  • Fixed isinstance not working with init_meta_context, materialized model not being moved to the device (#10493)
  • Fixed an issue that prevented the Trainer to shutdown workers when execution is interrupted due to failure(#10463)
  • Squeeze the early stopping monitor to remove empty tensor dimensions (#10461)
  • Fixed sampler replacement logic with overfit_batches to only replace the sample when SequentialSampler is not used (#10486)
  • Fixed scripting causing false positive deprecation warnings (#10470, #10555)
  • Do not fail if batch size could not be inferred for logging when using DeepSpeed (#10438)
  • Fixed propagation of device and dtype information to submodules of LightningLite when they inherit from DeviceDtypeModuleMixin (#10559)

Contributors

@a-gardner1 @awaelchli @carmocca @justusschock @Raahul-Singh @rohitgr7 @SeanNaren @tchaton

If we forgot someone due to not matching commit email with GitHub account, let us know :]

Standard weekly patch release

09 Nov 19:57

Choose a tag to compare

[1.5.1] - 2021-11-09

Fixed

  • Fixed apply_to_collection(defaultdict) (#10316)
  • Fixed failure when DataLoader(batch_size=None) is passed (#10345)
  • Fixed interception of __init__ arguments for sub-classed DataLoader re-instantiation in Lite (#10334)
  • Fixed issue with pickling CSVLogger after a call to CSVLogger.save (#10388)
  • Fixed an import error being caused by PostLocalSGD when torch.distributed not available (#10359)
  • Fixed the logging with on_step=True in epoch-level hooks causing unintended side-effects. Logging with on_step=True in epoch-level hooks will now correctly raise an error (#10409)
  • Fixed deadlocks for distributed training with RichProgressBar (#10428)
  • Fixed an issue where the model wrapper in Lite converted non-floating point tensors to float (#10429)
  • Fixed an issue with inferring the dataset type in fault-tolerant training (#10432)
  • Fixed dataloader workers with persistent_workers being deleted on every iteration (#10434)

Contributors

@EspenHa @four4fish @peterdudfield @rohitgr7 @tchaton @kaushikb11 @awaelchli @Borda @carmocca

If we forgot someone due to not matching commit email with GitHub account, let us know :]

PyTorch Lightning 1.5: LightningLite, Fault-Tolerant Training, Loop Customization, Lightning Tutorials, LightningCLI v2, RichProgressBar, CheckpointIO Plugin, and Trainer Strategy Flag

02 Nov 18:58
72288b2

Choose a tag to compare

The PyTorch Lightning team and its community are excited to announce Lightning 1.5, introducing support for LightningLite, Fault-tolerant Training, Loop Customization, Lightning Tutorials, LightningCLI V2, RichProgressBar, CheckpointIO Plugin, Trainer Strategy flag, and more!

Highlights

Lightning 1.5 marks our biggest release yet. Over 60 contributors have worked on features, bugfixes and documentation improvements for a total of 640 commits since v1.4. Here are some highlights:

Fault-tolerant Training

Fault-tolerant Training is a new internal mechanism that enables PyTorch Lightning to recover from a hardware or software failure. This is particularly interesting while training in the cloud with preemptive instances which can shutdown at any time. Once a Lightning experiment unexpectedly exits, a temporary checkpoint is saved that contains the exact state of all loops and the model. With this new experimental feature, you will be able to restore your training mid-epoch on the exact batch and continue training as if it never got interrupted.

PL_FAULT_TOLERANT_TRAINING=1 python train.py

LightningLite

LightningLite enables pure PyTorch users to scale their existing code to any kind of hardware while retaining full control over their own loops and optimization logic.

With just a few lines of code and no large refactoring, you get support for multi-device, multi-node, running on different accelerators (CPU, GPU, TPU), native automatic mixed precision (half and bfloat16), and double precision, in just a few seconds. And no special launcher required! Check out our documentation to find out how you can get one step closer to boilerplate-free research!

class Lite(LightningLite):
    def run(self):
        # Let Lite setup your dataloader(s)
        train_loader = self.setup_dataloaders(torch.utils.data.DataLoader(...))

        model = Net()  # .to() not needed
        optimizer = optim.Adam(model.parameters())
        # Let Lite setup your model and optimizer
        model, optimizer = self.setup(model, optimizer)

        for epoch in range(5):
            for data, target in train_loader:
                optimizer.zero_grad()
                output = model(data)  # data is already on the device
                loss = F.nll_loss(output, target)
                self.backward(loss)  # instead of loss.backward()
                optimizer.step()


Lite(accelerator="gpu", devices="auto").run()

Loop Customization

The new Loop API lets advanced users swap out the default gradient descent optimization loop at the core of Lightning with a different optimization paradigm. This is part of our effort to make Lightning the simplest, most flexible framework to take any kind of deep learning research to production.

Read our comprehensive introduction to loops

New Rich Progress Bar

We integrated with Rich and created a new and improved progress bar for Lightning.
Try it out:

pip install rich
from pytorch_lightning import Trainer
from pytorch_lightning.callbacks import RichProgressBar

trainer = Trainer(callbacks=[RichProgressBar()])

New Trainer Arguments: Strategy and Devices

With the new strategy and devices arguments in the Trainer, it is now easer to switch from one hardware to another.

Before After
Trainer(accelerator="ddp", gpus=2) Trainer(accelerator="gpu", devices=2, strategy="ddp")
Trainer(accelerator="ddp_cpu", num_processes=2) Trainer(accelerator="cpu", devices=2, strategy="ddp")
Trainer(accelerator="tpu_spawn", tpu_cores=8) Trainer(accelerator="tpu", devices=8)

The new devices argument is now agnostic to all accelerators, but the previous arguments gpus, tpu_cores, ipus are still available and work the same as before. In addition, it is now also possible to set devices="auto" or accelerator="auto" to select the best accelerator available on the hardware.

from pytorch_lightning import Trainer

trainer = Trainer(accelerator="auto", devices="auto")

LightningCLI V2

This release adds support for running not just Trainer.fit but any of the Trainer entry points!

python script.py fit
python script.py test

LightningCLI now supports registries for callbacks, optimizers, learning rate schedulers, LightningModules and LightningDataModules. This greatly improves the command line experience as only the class names and arguments are required as follows:

python script.py \
    --trainer.callbacks=EarlyStopping \
    --trainer.callbacks.patience=5 \
    --trainer.callbacks.LearningRateMonitor \
    --trainer.callbacks.logging_interval=epoch \
    --optimizer=Adam \
    --optimizer.lr=0.01 \
    --lr_scheduler=OneCycleLR \
    --lr_scheduler=anneal_strategy=linear

We've also added support for a manual mode where the CLI takes care of the instantiation but you have control over the Trainer calls:

cli = LightningCLI(MyModel, run=False)
cli.trainer.fit(cli.model)

Try out LightninCLI!

CheckpointIO Plugins

As part of our commitment to extensibility, we have abstracted the checkpointing logic into a CheckpointIO plugin. This enables users to adapt Lightning to their own infrastructure.

from pytorch_lightning.plugins import CheckpointIO

class CustomCheckpointIO(CheckpointIO):
  
    def save_checkpoint(self, checkpoint, path):
        # put all logic related to saving a checkpoint here

    def load_checkpoint(self, path):
        # put all logic related to loading a checkpoint here

    def remove_checkpoint(self, path):
        # put all logic related to deleting a checkpoint here

BFloat16 Support

PyTorch 1.10 introduces native Automatic Mixed Precision (AMP) support for torch.bfloat16 on CPU (was already supported for TPUs), enabling higher performance compared with torch.float16. Switch to bfloat16 training by setting the argument:

from pytorch_lightning import Trainer

trainer = Trainer(precision="bf16")

Enable Auto Parameters Tying

It is pretty common to share parameters within a model. However, TPUs don't retain shared parameters once moved on the devices. Lightning now supports automatic detection and re-assignement to alleviate this problem from TPUs.

Infinite Training

Infinite training is now supported by setting Trainer(max_epochs=-1) for an unlimited number of epochs, or Trainer(max_steps=-1) for an endless epoch.

Note: you will want to avoid logging with on_epoch=True in case of max_steps=-1.

DeepSpeed Stage 1

DeepSpeed is a deep learning training optimization library, providing the means to train massive billion parameter models at scale. Lightning now also supports the DeepSpeed ZeRO Stage 1 protocol that partitions your optimizer states across your GPUs to reduce memory.

from pytorch_lightning import Trainer

trainer = Trainer(gpus=4, strategy="deepspeed_stage_1", precision=16)
trainer.fit(model)

For even more memory savings and model sharding advice, check out stage 2 & 3 as well in our multi-GPU docs.

Gradient Clipping Customization

By overriding the LightningModule.configure_gradient_clipping hook, you can customize gradient clipping to your needs:

# Perform gradient clipping on gradients associated with discriminator (optimizer_idx=1) in GAN
def configure_gradient_clipping(
    self,
    optimizer,
    optimizer_idx,
    gradient_clip_val,
    gradient_clip_algorithm
):
    if optimizer_idx == 1:
        # Lightning will handle the gradient clipping
        self.clip_gradients(
            optimizer,
            gradient_clip_val=gradient_clip_val,
            gradient_clip_algorithm=gradient_clip_algorithm
        )

This means you can now implement state-of-the-art clipping algorithms with Lightning!

Determinism

Added support for torch.use_deterministic_algorithms. Read more about how it works here. You can enable it by setting:

from pytorch_lightning import Trainer

trainer = Trainer(deterministic=True)

Anomaly Detection

Lightning makes it easier to debug your code, so we've added support for torch.set_detect_anomaly. With this, PyTorch detects numerical anomalies like NaN or inf during forward and backward. Read more about anomaly detection here

from pytorch_lightning import Trainer

trainer = Trainer(detect_anomaly=True)

DDP Debugging Improvements

Are you having a hard time debugging DDP on your remote machine? Now you can de...

Read more

Standard weekly patch release

30 Sep 13:43

Choose a tag to compare

[1.4.9] - 2021-09-30

  • Moved the gradient unscaling in NativeMixedPrecisionPlugin from pre_optimizer_step to post_backward (#9606)
  • Fixed gradient unscaling being called too late, causing gradient clipping and gradient norm tracking to be applied incorrectly (#9606)
  • Fixed lr_find to generate same results on multiple calls (#9704)
  • Fixed reset metrics on validation epoch end (#9717)
  • Fixed input validation for gradient_clip_val, gradient_clip_algorithm, track_grad_norm and terminate_on_nan Trainer arguments (#9595)
  • Reset metrics before each task starts (#9410)

Contributors

@rohitgr7 @tchaton

If we forgot someone due to not matching commit email with GitHub account, let us know :]

Standard weekly patch release

22 Sep 19:15

Choose a tag to compare

[1.4.8] - 2021-09-22

  • Fixed error reporting in DDP process reconciliation when processes are launched by an external agent (#9389)
  • Added PL_RECONCILE_PROCESS environment variable to enable process reconciliation regardless of cluster environment settings (#)(#9389)
  • Fixed add_argparse_args raising TypeError when args are typed as typing.Generic in Python 3.6 (#9554)
  • Fixed back-compatibility for saving hyperparameters from a single container and inferring its argument name by reverting #9125 (#9642)

Contributors

@ananthsub @akihironitta @awaelchli @carmocca @tchaton

If we forgot someone due to not matching commit email with GitHub account, let us know :]

Standard weekly patch release

15 Sep 09:28

Choose a tag to compare

[1.4.7] - 2021-09-14

  • Fixed logging of nan parameters (#9364)
  • Fixed replace_sampler missing the batch size under specific conditions (#9367)
  • Pass init args to ShardedDataParallel (#9483)
  • Fixed collision of user argument when using ShardedDDP (#9512)
  • Fixed DeepSpeed crash for RNNs (#9489)

Contributors

@asanakoy @awaelchli @borisdayma @carmocca @guotuofeng @justusschock @kaushikb11 @rohitgr7 @SeanNaren

If we forgot someone due to not matching commit email with GitHub account, let us know :]