Releases: Lightning-AI/pytorch-lightning
PyTorch Lightning 1.6.5: Standard patch release
[1.6.5] - 2022-07-13
Fixed
- Fixed
estimated_stepping_batchesrequiring distributed comms inconfigure_optimizersfor theDeepSpeedStrategy(#13350) - Fixed bug with Python version check that prevented use with development versions of Python (#13420)
- The loops now call
.set_epoch()also on batch samplers if the dataloader has one wrapped in a distributed sampler (#13396) - Fixed the restoration of log step during restart (#13467)
Contributors
@adamjstewart @akihironitta @awaelchli @Borda @martinosorb @rohitgr7 @SeanNaren
PyTorch Lightning 1.6.4: Standard patch release
[1.6.4] - 2022-06-01
Added
- Added all DDP params to be exposed through hpu parallel strategy (#13067)
Changed
- Keep
torch.backends.cudnn.benchmark=Falseby default (unlike in v1.6.{0-4}) after speed and memory problems depending on the data used. Please consider tuningTrainer(benchmark)manually. (#13154) - Prevent modification of
torch.backends.cudnn.benchmarkwhenTrainer(benchmark=...)is not set (#13154)
Fixed
- Fixed an issue causing zero-division error for empty dataloaders (#12885)
- Fixed mismatching default values for the types of some arguments in the DeepSpeed and Fully-Sharded strategies which made the CLI unable to use them (#12989)
- Avoid redundant callback restore warning while tuning (#13026)
- Fixed
Trainer(precision=64)during evaluation which now uses the wrapped precision module (#12983) - Fixed an issue to use wrapped
LightningModulefor evaluation duringtrainer.fitforBaguaStrategy(#12983) - Fixed an issue wrt unnecessary usage of habana mixed precision package for fp32 types (#13028)
- Fixed the number of references of
LightningModuleso it can be deleted (#12897) - Fixed
materialize_modulesetting a module's child recursively (#12870) - Fixed issue where the CLI could not pass a
Profilerto theTrainer(#13084) - Fixed torchelastic detection with non-distributed installations (#13142)
- Fixed logging's step values when multiple dataloaders are used during evaluation (#12184)
- Fixed epoch logging on train epoch end (#13025)
- Fixed
DDPStrategyandDDPSpawnStrategyto initialize optimizers only after moving the module to the device (#11952)
Contributors
@akihironitta @ananthsub @ar90n @awaelchli @Borda @carmocca @dependabot @jerome-habana @mads-oestergaard @otaj @rohitgr7
PyTorch Lightning 1.6.3: Standard patch release
[1.6.3] - 2022-05-03
Fixed
- Use only a single instance of
rich.console.Consolethroughout codebase (#12886) - Fixed an issue to ensure all the checkpoint states are saved in a common filepath with
DeepspeedStrategy(#12887) - Fixed
trainer.loggerdeprecation message (#12671) - Fixed an issue where sharded grad scaler is passed in when using BF16 with the
ShardedStrategy(#12915) - Fixed an issue wrt recursive invocation of DDP configuration in hpu parallel plugin (#12912)
- Fixed printing of ragged dictionaries in
Trainer.validateandTrainer.test(#12857) - Fixed threading support for legacy loading of checkpoints (#12814)
- Fixed pickling of
KFoldLoop(#12441) - Stopped
optimizer_zero_gradfrom being called after IPU execution (#12913) - Fixed
fuse_modulesto be qat-aware fortorch>=1.11(#12891) - Enforced eval shuffle warning only for default samplers in DataLoader (#12653)
- Enable mixed precision in
DDPFullyShardedStrategywhenprecision=16(#12965) - Fixed
TQDMProgressBarreset and update to show correct time estimation (#12889) - Fixed fit loop restart logic to enable resume using the checkpoint (#12821)
Contributors
@akihironitta @carmocca @hmellor @jerome-habana @kaushikb11 @krshrimali @mauvilsa @niberger @ORippler @otaj @rohitgr7 @SeanNaren
PyTorch Lightning 1.6.2: Standard patch release
[1.6.2] - 2022-04-27
Fixed
- Fixed
ImportErrorwhentorch.distributedis not available. (#12794) - When using custom DataLoaders in LightningDataModule, multiple inheritance is resolved properly (#12716)
- Fixed encoding issues on terminals that do not support unicode characters (#12828)
- Fixed support for
ModelCheckpointmonitors with dots (#12783)
Contributors
@akihironitta @alvitawa @awaelchli @Borda @carmocca @code-review-doctor @ethanfurman @HenryLau0220 @krshrimali @otaj
PyTorch Lightning 1.6.1: Standard weekly patch release
[1.6.1] - 2022-04-13
Changed
- Support
strategyargument being case insensitive (#12528)
Fixed
- Run main progress bar updates independent of val progress bar updates in
TQDMProgressBar(#12563) - Avoid calling
average_parametersmultiple times per optimizer step (#12452) - Properly pass some Logger's parent's arguments to
super().__init__()(#12609) - Fixed an issue where incorrect type warnings appear when the overridden
LightningLite.runmethod accepts user-defined arguments (#12629) - Fixed
rank_zero_onlydecorator in LSF environments (#12587) - Don't raise a warning when
nn.Moduleis not saved under hparams (#12669) - Raise
MisconfigurationExceptionwhen the accelerator is available but the user passes invalid([]/0/"0")values to thedevicesflag (#12708) - Support
auto_select_gpuswith the accelerator and devices API (#12608)
Contributors
@akihironitta @awaelchli @Borda @carmocca @kaushikb11 @krshrimali @mauvilsa @otaj @pre-commit-ci @rohitgr7 @semaphore-egg @tkonopka @wayi1
If we forgot someone due to not matching the commit email with the GitHub account, let us know :]
PyTorch Lightning 1.6: Support Intel's Habana Accelerator, New efficient DDP strategy (Bagua), Manual Fault-tolerance, Stability and Reliability.
The core team is excited to announce the PyTorch Lightning 1.6 release ⚡
Highlights
PyTorch Lightning 1.6 is the work of 99 contributors who have worked on features, bug-fixes, and documentation for a total of over 750 commits since 1.5. This is our most active release yet. Here are some highlights:
Introducing Intel's Habana Accelerator
Lightning 1.6 now supports the Habana® framework, which includes Gaudi® AI training processors. Their heterogeneous architecture includes a cluster of fully programmable Tensor Processing Cores (TPC) along with its associated development tools and libraries and a configurable Matrix Math engine.
You can leverage the Habana hardware to accelerate your Deep Learning training workloads simply by passing:
trainer = pl.Trainer(accelerator="hpu")
# single Gaudi training
trainer = pl.Trainer(accelerator="hpu", devices=1)
# distributed training with 8 Gaudi
trainer = pl.Trainer(accelerator="hpu", devices=8)The Bagua Strategy
The Bagua Strategy is a deep learning acceleration framework that supports multiple, advanced distributed training algorithms with state-of-the-art system relaxation techniques. Enabling Bagua, which can be considerably faster than vanilla PyTorch DDP, is as simple as:
trainer = pl.Trainer(strategy="bagua")
# or to choose a custom algorithm
trainer = pl.Trainer(strategy=BaguaStrategy(algorithm="gradient_allreduce") # defaultTowards stable Accelerator, Strategy, and Plugin APIs
The Accelerator, Strategy, and Plugin APIs are a core part of PyTorch Lightning. They're where all the distributed boilerplate lives, and we're constantly working to improve both them and the overall PyTorch Lightning platform experience.
In this release, we've made some large changes to achieve that goal. Not to worry, though! The only users affected by these changes are those who use custom implementations of Accelerator and Strategy (TrainingTypePlugin) as well as certain Plugins. In particular, we want to highlight the following changes:
-
All
TrainingTypePlugins have been renamed toStrategy(#11120). Strategy is a more appropriate name because it encompasses more than simply training communcation. This change is now aligned with the changes we implemented in 1.5, which introduced the newstrategyanddevicesflags to the Trainer.# Before from pytorch_lightning.plugins import DDPPlugin # New from pytorch_lightning.strategies import DDPStrategy
-
The
AcceleratorandPrecisionPluginhave moved intoStrategy. All strategies now take an optional parameteracceleratorandprecision_plugin(#11022, #10570). -
Custom Accelerator implementations must now implement two new abstract methods:
is_available()(#11797) andauto_device_count()(#10222). The latter determines how many devices get used by default when specifyingTrainer(accelerator=..., devices="auto"). -
We redesigned the process creation for spawn-based strategies such as
DDPSpawnStrategyandTPUSpawnStrategy(#10896). All spawn-based strategies now spawn processes immediately upon callingTrainer.{fit,validate,test,predict}, which means the hooks/callbacksprepare_data,setup,configure_sharded_modelandteardownall run under an initialized process group. These changes align the spawn-based strategies with their non-spawn counterparts (such asDDPStrategy).
We've also exposed the process group backend for use. For example, you can now easily enable fairring like this:
# Explicitly specify the process group backend if you choose to
ddp = pl.strategies.DDPStrategy(process_group_backend="fairring")
trainer = Trainer(strategy=ddp, accelerator="gpu", devices=8)In a similar fashion, if installing torch>=1.11, you can enable DDP static graph to apply special runtime optimizations:
trainer = Trainer(devices=4, strategy=DDPStrategy(static_graph=True))LightningCLI improvements
In the previous release, we added shorthand notation support for registered components. In this release, we added a flag to automatically register all available components:
from pytorch_lightning.utilities.cli import LightningCLI
LightningCLI(auto_registry=True)We have also added support for the ReduceLROnPlateau scheduler with shorthand notation:
$ python script.py fit --optimizer=Adam --lr_scheduler=ReduceLROnPlateau --lr_scheduler.monitor=metric_to_trackIf you need to customize the learning rate scheduler configuration, you can do so by overriding:
class MyLightningCLI(LightningCLI):
@staticmethod
def configure_optimizers(lightning_module, optimizer, lr_scheduler=None):
return {"optimizer": optimizer, "lr_scheduler": {"scheduler": lr_scheduler, ...}}Finally, loggers are also now configurable with shorthand:
$ python script.py fit --trainer.logger=WandbLogger --trainer.logger.name="my_lightning_run"
Control SLURM's re-queueing
We've added the ability to turn the automatic resubmission on or off when a job gets interrupted by the SLURM controller (via signal handling). Users who prefer to let their code handle the resubmission (for example, when submitit is used) can now pass:
from pytorch_lightning.plugins.environments import SLURMEnvironment
trainer = pl.Trainer(plugins=SLURMEnvironment(auto_requeue=False))Fault-tolerance improvements
The Fault-tolerance training under manual optimization now tracks optimization progress. We also changed the graceful exit signal from SIGUSR1 to SIGTERM for better support inside cloud instances.
An additional feature we're excited to announce is support for consecutive trainer.fit() calls.
trainer = pl.Trainer(max_epochs=2)
trainer.fit(model)
# now, run 2 more epochs
trainer.fit_loop.max_epochs = 4
trainer.fit(model)Loop customization improvements
The Loop's state is now included as part of the checkpoints saved by the library. This enables finer restoration of custom loops.
We've also made it easier to replace Lightning's loops with your own. For example:
class MyCustomLoop(pl.loops.TrainingEpochLoop):
...
trainer = pl.Trainer(...)
trainer.fit_loop.replace(epoch_loop=MyCustomLoop)
# Trainer runs the fit loop with your new epoch loop!
trainer.fit(model)Data-Loading improvements
In previous versions, Lightning required that the DataLoader instance set its input arguments as instance attributes. This meant that custom DataLoaders also had this hidden requirement. In this release, we do this automatically for the user, easing the passing of custom loaders:
class MyDataLoader(torch.utils.data.DataLoader):
def __init__(self, a=123, *args, **kwargs):
- # this was required before
- self.a = a
super().__init__(*args, **kwargs)
trainer.fit(model, train_dataloader=MyDataLoader())As of this release, Lightning no longer pre-fetches 1 extra batch if it doesn't need to. Previously, doing so would conflict with the internal pre-fetching done by optimized data loaders such as FFCV's. You can now define your own pre-fetching value like this:
class MyCustomLoop(pl.loops.FitLoop):
@property
def prefetch_batches(self):
return 7 # lucky number 7
trainer = pl.Trainer(...)
trainer.fit_loop = MyCustomLoop(min_epochs=trainer.min_epochs, max_epochs=trainer.max_epochs)New Hooks
LightningModule.lr_scheduler_step
Lightning now allows the use of custom learning rate schedulers that aren't natively available in PyTorch. A great example of this is Timm Schedulers.
When using custom learning rate schedulers relying on an API other than PyTorch's, you can now define the LightningModule.lr_scheduler_step with your desired logic.
from timm.scheduler import TanhLRScheduler
class MyLightningModule(pl.LightningModule):
def configure_optimizers(self):...Standard weekly patch release
[1.5.10] - 2022-02-08
Fixed
- Fixed an issue to avoid validation loop run on restart (#11552)
- The Rich progress bar now correctly shows the
on_epochlogged values on train epoch end (#11689) - Fixed an issue to make the
stepargument inWandbLogger.log_imagework (#11716) - Fixed
restore_optimizersfor mapping states (#11757) - With
DPStrategy, the batch is not explicitly moved to the device (#11780) - Fixed an issue to avoid val bar disappear after
trainer.validate()(#11700) - Fixed supporting remote filesystems with
Trainer.weights_save_pathfor fault-tolerant training (#11776) - Fixed check for available modules (#11526)
- Fixed bug where the path for "last" checkpoints was not getting saved correctly which caused newer runs to not remove the previous "last" checkpoint (#11481)
- Fixed bug where the path for best checkpoints was not getting saved correctly when no metric was monitored which caused newer runs to not use the best checkpoint (#11481)
Contributors
@ananthsub @Borda @circlecrystal @NathanGodey @nithinraok @rohitgr7
If we forgot someone due to not matching commit email with GitHub account, let us know :]
Standard weekly patch release
[1.5.9] - 2022-01-20
Fixed
- Pinned sphinx-autodoc-typehints with <v1.15 (#11400)
- Skipped testing with PyTorch 1.7 and Python 3.9 on Ubuntu (#11217)
- Fixed type promotion when tensors of higher category than float are logged (#11401)
- Fixed the format of the configuration saved automatically by the CLI's
SaveConfigCallback(#11532)
Changed
- Changed
LSFEnvironmentto useLSB_DJOB_RANKFILEenvironment variable instead ofLSB_HOSTSfor determining node rank and main address (#10825) - Disabled sampler replacement when using
IterableDataset(#11507)
Contributors
@ajtritt @akihironitta @carmocca @rohitgr7
If we forgot someone due to not matching commit email with GitHub account, let us know :]
Standard weekly patch release
[1.5.8] - 2022-01-05
Fixed
- Fixed
LightningCLIrace condition while saving the config (#11199) - Fixed the default value used with
log(reduce_fx=min|max)(#11310) - Fixed data fetcher selection (#11294)
- Fixed a race condition that could result in incorrect (zero) values being observed in prediction writer callbacks (#11288)
- Fixed dataloaders not getting reloaded the correct amount of times when setting
reload_dataloaders_every_n_epochsandcheck_val_every_n_epoch(#10948)
Contributors
@adamviola @akihironitta @awaelchli @Borda @carmocca @edpizzi
If we forgot someone due to not matching commit email with GitHub account, let us know :]
Standard weekly patch release
[1.5.7] - 2021-12-21
Fixed
- Fixed
NeptuneLoggerwhen using DDP (#11030) - Fixed a bug to disable logging hyperparameters in logger if there are no hparams (#11105)
- Avoid the deprecated
onnx.export(example_outputs=...)in torch 1.10 (#11116) - Fixed an issue when torch-scripting a
LightningModuleafter training withTrainer(sync_batchnorm=True)(#11078) - Fixed an
AttributeErroroccuring when using aCombinedLoader(multiple dataloaders) for prediction (#11111) - Fixed bug where
Trainer(track_grad_norm=..., logger=False)would fail (#11114) - Fixed an incorrect warning being produced by the model summary when using
bf16precision on CPU (#11161)
Changed
- DeepSpeed does not require lightning module zero 3 partitioning (#10655)
- The
ModelCheckpointcallback now saves and restores attributesbest_k_models,kth_best_model_path,kth_value, andlast_model_path(#10995)
Contributors
@awaelchli @borchero @carmocca @guyang3532 @kaushikb11 @ORippler @Raalsky @rohitgr7 @SeanNaren
If we forgot someone due to not matching commit email with GitHub account, let us know :]