Skip to content

Releases: Lightning-AI/pytorch-lightning

PyTorch Lightning 1.6.5: Standard patch release

13 Jul 00:26
ff53616

Choose a tag to compare

[1.6.5] - 2022-07-13

Fixed

  • Fixed estimated_stepping_batches requiring distributed comms in configure_optimizers for the DeepSpeedStrategy (#13350)
  • Fixed bug with Python version check that prevented use with development versions of Python (#13420)
  • The loops now call .set_epoch() also on batch samplers if the dataloader has one wrapped in a distributed sampler (#13396)
  • Fixed the restoration of log step during restart (#13467)

Contributors

@adamjstewart @akihironitta @awaelchli @Borda @martinosorb @rohitgr7 @SeanNaren

PyTorch Lightning 1.6.4: Standard patch release

01 Jun 14:32
74b1317

Choose a tag to compare

[1.6.4] - 2022-06-01

Added

  • Added all DDP params to be exposed through hpu parallel strategy (#13067)

Changed

  • Keep torch.backends.cudnn.benchmark=False by default (unlike in v1.6.{0-4}) after speed and memory problems depending on the data used. Please consider tuning Trainer(benchmark) manually. (#13154)
  • Prevent modification of torch.backends.cudnn.benchmark when Trainer(benchmark=...) is not set (#13154)

Fixed

  • Fixed an issue causing zero-division error for empty dataloaders (#12885)
  • Fixed mismatching default values for the types of some arguments in the DeepSpeed and Fully-Sharded strategies which made the CLI unable to use them (#12989)
  • Avoid redundant callback restore warning while tuning (#13026)
  • Fixed Trainer(precision=64) during evaluation which now uses the wrapped precision module (#12983)
  • Fixed an issue to use wrapped LightningModule for evaluation during trainer.fit for BaguaStrategy (#12983)
  • Fixed an issue wrt unnecessary usage of habana mixed precision package for fp32 types (#13028)
  • Fixed the number of references of LightningModule so it can be deleted (#12897)
  • Fixed materialize_module setting a module's child recursively (#12870)
  • Fixed issue where the CLI could not pass a Profiler to the Trainer (#13084)
  • Fixed torchelastic detection with non-distributed installations (#13142)
  • Fixed logging's step values when multiple dataloaders are used during evaluation (#12184)
  • Fixed epoch logging on train epoch end (#13025)
  • Fixed DDPStrategy and DDPSpawnStrategy to initialize optimizers only after moving the module to the device (#11952)

Contributors

@akihironitta @ananthsub @ar90n @awaelchli @Borda @carmocca @dependabot @jerome-habana @mads-oestergaard @otaj @rohitgr7

PyTorch Lightning 1.6.3: Standard patch release

03 May 20:36

Choose a tag to compare

[1.6.3] - 2022-05-03

Fixed

  • Use only a single instance of rich.console.Console throughout codebase (#12886)
  • Fixed an issue to ensure all the checkpoint states are saved in a common filepath with DeepspeedStrategy (#12887)
  • Fixed trainer.logger deprecation message (#12671)
  • Fixed an issue where sharded grad scaler is passed in when using BF16 with the ShardedStrategy (#12915)
  • Fixed an issue wrt recursive invocation of DDP configuration in hpu parallel plugin (#12912)
  • Fixed printing of ragged dictionaries in Trainer.validate and Trainer.test (#12857)
  • Fixed threading support for legacy loading of checkpoints (#12814)
  • Fixed pickling of KFoldLoop (#12441)
  • Stopped optimizer_zero_grad from being called after IPU execution (#12913)
  • Fixed fuse_modules to be qat-aware for torch>=1.11 (#12891)
  • Enforced eval shuffle warning only for default samplers in DataLoader (#12653)
  • Enable mixed precision in DDPFullyShardedStrategy when precision=16 (#12965)
  • Fixed TQDMProgressBar reset and update to show correct time estimation (#12889)
  • Fixed fit loop restart logic to enable resume using the checkpoint (#12821)

Contributors

@akihironitta @carmocca @hmellor @jerome-habana @kaushikb11 @krshrimali @mauvilsa @niberger @ORippler @otaj @rohitgr7 @SeanNaren

PyTorch Lightning 1.6.2: Standard patch release

27 Apr 17:04

Choose a tag to compare

[1.6.2] - 2022-04-27

Fixed

  • Fixed ImportError when torch.distributed is not available. (#12794)
  • When using custom DataLoaders in LightningDataModule, multiple inheritance is resolved properly (#12716)
  • Fixed encoding issues on terminals that do not support unicode characters (#12828)
  • Fixed support for ModelCheckpoint monitors with dots (#12783)

Contributors

@akihironitta @alvitawa @awaelchli @Borda @carmocca @code-review-doctor @ethanfurman @HenryLau0220 @krshrimali @otaj

PyTorch Lightning 1.6.1: Standard weekly patch release

13 Apr 18:30

Choose a tag to compare

[1.6.1] - 2022-04-13

Changed

  • Support strategy argument being case insensitive (#12528)

Fixed

  • Run main progress bar updates independent of val progress bar updates in TQDMProgressBar (#12563)
  • Avoid calling average_parameters multiple times per optimizer step (#12452)
  • Properly pass some Logger's parent's arguments to super().__init__() (#12609)
  • Fixed an issue where incorrect type warnings appear when the overridden LightningLite.run method accepts user-defined arguments (#12629)
  • Fixed rank_zero_only decorator in LSF environments (#12587)
  • Don't raise a warning when nn.Module is not saved under hparams (#12669)
  • Raise MisconfigurationException when the accelerator is available but the user passes invalid ([]/0/"0") values to the devices flag (#12708)
  • Support auto_select_gpus with the accelerator and devices API (#12608)

Contributors

@akihironitta @awaelchli @Borda @carmocca @kaushikb11 @krshrimali @mauvilsa @otaj @pre-commit-ci @rohitgr7 @semaphore-egg @tkonopka @wayi1

If we forgot someone due to not matching the commit email with the GitHub account, let us know :]

PyTorch Lightning 1.6: Support Intel's Habana Accelerator, New efficient DDP strategy (Bagua), Manual Fault-tolerance, Stability and Reliability.

29 Mar 19:35
44e3edb

Choose a tag to compare

The core team is excited to announce the PyTorch Lightning 1.6 release ⚡

Highlights

PyTorch Lightning 1.6 is the work of 99 contributors who have worked on features, bug-fixes, and documentation for a total of over 750 commits since 1.5. This is our most active release yet. Here are some highlights:

Introducing Intel's Habana Accelerator

Lightning 1.6 now supports the Habana® framework, which includes Gaudi® AI training processors. Their heterogeneous architecture includes a cluster of fully programmable Tensor Processing Cores (TPC) along with its associated development tools and libraries and a configurable Matrix Math engine.

You can leverage the Habana hardware to accelerate your Deep Learning training workloads simply by passing:

trainer = pl.Trainer(accelerator="hpu")

# single Gaudi training
trainer = pl.Trainer(accelerator="hpu", devices=1)

# distributed training with 8 Gaudi
trainer = pl.Trainer(accelerator="hpu", devices=8)

The Bagua Strategy

The Bagua Strategy is a deep learning acceleration framework that supports multiple, advanced distributed training algorithms with state-of-the-art system relaxation techniques. Enabling Bagua, which can be considerably faster than vanilla PyTorch DDP, is as simple as:

trainer = pl.Trainer(strategy="bagua")

# or to choose a custom algorithm
trainer = pl.Trainer(strategy=BaguaStrategy(algorithm="gradient_allreduce")  # default

Towards stable Accelerator, Strategy, and Plugin APIs

The Accelerator, Strategy, and Plugin APIs are a core part of PyTorch Lightning. They're where all the distributed boilerplate lives, and we're constantly working to improve both them and the overall PyTorch Lightning platform experience.

In this release, we've made some large changes to achieve that goal. Not to worry, though! The only users affected by these changes are those who use custom implementations of Accelerator and Strategy (TrainingTypePlugin) as well as certain Plugins. In particular, we want to highlight the following changes:

  • All TrainingTypePlugins have been renamed to Strategy (#11120). Strategy is a more appropriate name because it encompasses more than simply training communcation. This change is now aligned with the changes we implemented in 1.5, which introduced the new strategy and devices flags to the Trainer.

    # Before
    from pytorch_lightning.plugins import DDPPlugin
    
    # New
    from pytorch_lightning.strategies import DDPStrategy
  • The Accelerator and PrecisionPlugin have moved into Strategy. All strategies now take an optional parameter accelerator and precision_plugin (#11022, #10570).

  • Custom Accelerator implementations must now implement two new abstract methods: is_available() (#11797) and auto_device_count() (#10222). The latter determines how many devices get used by default when specifying Trainer(accelerator=..., devices="auto").

  • We redesigned the process creation for spawn-based strategies such as DDPSpawnStrategy and TPUSpawnStrategy (#10896). All spawn-based strategies now spawn processes immediately upon calling Trainer.{fit,validate,test,predict}, which means the hooks/callbacks prepare_data, setup, configure_sharded_model and teardown all run under an initialized process group. These changes align the spawn-based strategies with their non-spawn counterparts (such as DDPStrategy).

We've also exposed the process group backend for use. For example, you can now easily enable fairring like this:

# Explicitly specify the process group backend if you choose to
ddp = pl.strategies.DDPStrategy(process_group_backend="fairring")
trainer = Trainer(strategy=ddp, accelerator="gpu", devices=8)

In a similar fashion, if installing torch>=1.11, you can enable DDP static graph to apply special runtime optimizations:

trainer = Trainer(devices=4, strategy=DDPStrategy(static_graph=True))

LightningCLI improvements

In the previous release, we added shorthand notation support for registered components. In this release, we added a flag to automatically register all available components:

from pytorch_lightning.utilities.cli import LightningCLI

LightningCLI(auto_registry=True)

We have also added support for the ReduceLROnPlateau scheduler with shorthand notation:

$ python script.py fit --optimizer=Adam --lr_scheduler=ReduceLROnPlateau --lr_scheduler.monitor=metric_to_track

If you need to customize the learning rate scheduler configuration, you can do so by overriding:

class MyLightningCLI(LightningCLI):
    @staticmethod
    def configure_optimizers(lightning_module, optimizer, lr_scheduler=None):
        return {"optimizer": optimizer, "lr_scheduler": {"scheduler": lr_scheduler, ...}}

Finally, loggers are also now configurable with shorthand:

$ python script.py fit --trainer.logger=WandbLogger --trainer.logger.name="my_lightning_run"

Control SLURM's re-queueing

We've added the ability to turn the automatic resubmission on or off when a job gets interrupted by the SLURM controller (via signal handling). Users who prefer to let their code handle the resubmission (for example, when submitit is used) can now pass:

from pytorch_lightning.plugins.environments import SLURMEnvironment

trainer = pl.Trainer(plugins=SLURMEnvironment(auto_requeue=False))

Fault-tolerance improvements

The Fault-tolerance training under manual optimization now tracks optimization progress. We also changed the graceful exit signal from SIGUSR1 to SIGTERM for better support inside cloud instances.
An additional feature we're excited to announce is support for consecutive trainer.fit() calls.

trainer = pl.Trainer(max_epochs=2)
trainer.fit(model)

# now, run 2 more epochs
trainer.fit_loop.max_epochs = 4
trainer.fit(model)

Loop customization improvements

The Loop's state is now included as part of the checkpoints saved by the library. This enables finer restoration of custom loops.

We've also made it easier to replace Lightning's loops with your own. For example:

class MyCustomLoop(pl.loops.TrainingEpochLoop):
    ...

trainer = pl.Trainer(...)
trainer.fit_loop.replace(epoch_loop=MyCustomLoop)
# Trainer runs the fit loop with your new epoch loop!
trainer.fit(model)

Data-Loading improvements

In previous versions, Lightning required that the DataLoader instance set its input arguments as instance attributes. This meant that custom DataLoaders also had this hidden requirement. In this release, we do this automatically for the user, easing the passing of custom loaders:

class MyDataLoader(torch.utils.data.DataLoader):
    def __init__(self, a=123, *args, **kwargs):
-       # this was required before
-       self.a = a
        super().__init__(*args, **kwargs)

trainer.fit(model, train_dataloader=MyDataLoader())

As of this release, Lightning no longer pre-fetches 1 extra batch if it doesn't need to. Previously, doing so would conflict with the internal pre-fetching done by optimized data loaders such as FFCV's. You can now define your own pre-fetching value like this:

class MyCustomLoop(pl.loops.FitLoop):
    @property
    def prefetch_batches(self):
        return 7  # lucky number 7

trainer = pl.Trainer(...)
trainer.fit_loop = MyCustomLoop(min_epochs=trainer.min_epochs, max_epochs=trainer.max_epochs)

New Hooks

LightningModule.lr_scheduler_step

Lightning now allows the use of custom learning rate schedulers that aren't natively available in PyTorch. A great example of this is Timm Schedulers.

When using custom learning rate schedulers relying on an API other than PyTorch's, you can now define the LightningModule.lr_scheduler_step with your desired logic.

from timm.scheduler import TanhLRScheduler


class MyLightningModule(pl.LightningModule):
    def configure_optimizers(self):...
Read more

Standard weekly patch release

09 Feb 20:42

Choose a tag to compare

[1.5.10] - 2022-02-08

Fixed

  • Fixed an issue to avoid validation loop run on restart (#11552)
  • The Rich progress bar now correctly shows the on_epoch logged values on train epoch end (#11689)
  • Fixed an issue to make the step argument in WandbLogger.log_image work (#11716)
  • Fixed restore_optimizers for mapping states (#11757)
  • With DPStrategy, the batch is not explicitly moved to the device (#11780)
  • Fixed an issue to avoid val bar disappear after trainer.validate() (#11700)
  • Fixed supporting remote filesystems with Trainer.weights_save_path for fault-tolerant training (#11776)
  • Fixed check for available modules (#11526)
  • Fixed bug where the path for "last" checkpoints was not getting saved correctly which caused newer runs to not remove the previous "last" checkpoint (#11481)
  • Fixed bug where the path for best checkpoints was not getting saved correctly when no metric was monitored which caused newer runs to not use the best checkpoint (#11481)

Contributors

@ananthsub @Borda @circlecrystal @NathanGodey @nithinraok @rohitgr7

If we forgot someone due to not matching commit email with GitHub account, let us know :]

Standard weekly patch release

20 Jan 19:48

Choose a tag to compare

[1.5.9] - 2022-01-20

Fixed

  • Pinned sphinx-autodoc-typehints with <v1.15 (#11400)
  • Skipped testing with PyTorch 1.7 and Python 3.9 on Ubuntu (#11217)
  • Fixed type promotion when tensors of higher category than float are logged (#11401)
  • Fixed the format of the configuration saved automatically by the CLI's SaveConfigCallback (#11532)

Changed

  • Changed LSFEnvironment to use LSB_DJOB_RANKFILE environment variable instead of LSB_HOSTS for determining node rank and main address (#10825)
  • Disabled sampler replacement when using IterableDataset (#11507)

Contributors

@ajtritt @akihironitta @carmocca @rohitgr7

If we forgot someone due to not matching commit email with GitHub account, let us know :]

Standard weekly patch release

05 Jan 15:23

Choose a tag to compare

[1.5.8] - 2022-01-05

Fixed

  • Fixed LightningCLI race condition while saving the config (#11199)
  • Fixed the default value used with log(reduce_fx=min|max) (#11310)
  • Fixed data fetcher selection (#11294)
  • Fixed a race condition that could result in incorrect (zero) values being observed in prediction writer callbacks (#11288)
  • Fixed dataloaders not getting reloaded the correct amount of times when setting reload_dataloaders_every_n_epochs and check_val_every_n_epoch (#10948)

Contributors

@adamviola @akihironitta @awaelchli @Borda @carmocca @edpizzi

If we forgot someone due to not matching commit email with GitHub account, let us know :]

Standard weekly patch release

21 Dec 18:33

Choose a tag to compare

[1.5.7] - 2021-12-21

Fixed

  • Fixed NeptuneLogger when using DDP (#11030)
  • Fixed a bug to disable logging hyperparameters in logger if there are no hparams (#11105)
  • Avoid the deprecated onnx.export(example_outputs=...) in torch 1.10 (#11116)
  • Fixed an issue when torch-scripting a LightningModule after training with Trainer(sync_batchnorm=True) (#11078)
  • Fixed an AttributeError occuring when using a CombinedLoader (multiple dataloaders) for prediction (#11111)
  • Fixed bug where Trainer(track_grad_norm=..., logger=False) would fail (#11114)
  • Fixed an incorrect warning being produced by the model summary when using bf16 precision on CPU (#11161)

Changed

  • DeepSpeed does not require lightning module zero 3 partitioning (#10655)
  • The ModelCheckpoint callback now saves and restores attributes best_k_models, kth_best_model_path, kth_value, and last_model_path (#10995)

Contributors

@awaelchli @borchero @carmocca @guyang3532 @kaushikb11 @ORippler @Raalsky @rohitgr7 @SeanNaren

If we forgot someone due to not matching commit email with GitHub account, let us know :]