Releases: Lightning-AI/pytorch-lightning
PyTorch Lightning 1.7.3: Standard patch release
[1.7.3] - 2022-08-25
Fixed
- Fixed an assertion error when using a
ReduceOnPlateauscheduler with the Horovod strategy (#14215) - Fixed an
AttributeErrorwhen accessingLightningModule.loggerand the Trainer has multiple loggers (#14234) - Fixed wrong num padding for
RichProgressBar(#14296) - Added back support for
logging in theconfigure_gradient_clippinghook after unintended removal in v1.7.2 (#14298) - Fixed an issue to avoid the impact of sanity check on
reload_dataloaders_every_n_epochsfor validation (#13964)
Contributors
@awaelchli @Borda @carmocca @dependabot @kaushikb11 @otaj @rohitgr7
Dependency hotfix
[0.5.7] - 2022-08-22
Changed
- Release LAI docs as stable (#14250)
- Compatibility for Python 3.10
Fixed
Contributors
If we forgot someone due to not matching commit email with GitHub account, let us know :]
Minor patch release
PyTorch Lightning 1.7.2: Standard patch release
[1.7.2] - 2022-08-17
Added
- Added
FullyShardedNativeNativeMixedPrecisionPluginto handle precision forDDPFullyShardedNativeStrategy(#14092) - Added profiling to these hooks:
on_before_batch_transfer,transfer_batch_to_device,on_after_batch_transfer,configure_gradient_clipping,clip_gradients(#14069)
Changed
- Updated compatibility for LightningLite to run with the latest DeepSpeed 0.7.0 (13967)
- Raised a
MisconfigurationExceptionif batch transfer hooks are overriden withIPUAccelerator(13961) - The default project name in
WandbLoggeris now "lightning_logs" (#14145) - The
WandbLogger.nameproperty no longer returns the name of the experiment, and instead returns the project's name (#14145)
Fixed
- Fixed a bug that caused spurious
AttributeErrorwhen multipleDataLoaderclasses are imported (#14117) - Fixed epoch-end logging results not being reset after the end of the epoch (#14061)
- Fixed saving hyperparameters in a composition where the parent class is not a
LightningModuleorLightningDataModule(#14151) - Fixed epoch-end logging results not being reset after the end of the epoch (#14061)
- Fixed the device placement when
LightningModule.cuda()gets called without specifying a device index and the current cuda device was not 0 (#14128) - Avoided false positive warning about using
sync_distwhen using torchmetrics (#14143) - Avoid
metadata.entry_pointsdeprecation warning on Python 3.10 (#14052) - Avoid raising the sampler warning if num_replicas=1 (#14097)
- Fixed resuming from a checkpoint when using Stochastic Weight Averaging (SWA) (#9938)
- Avoided requiring the FairScale package to use precision with the fsdp native strategy (#14092)
- Fixed an issue in which the default name for a run in
WandbLoggerwould be set to the project name instead of a randomly generated string (#14145) - Fixed not preserving set attributes on
DataLoaderandBatchSamplerwhen instantiated inside*_dataloaderhooks (#14212)
Contributors
@adamreeve @akihironitta @awaelchli @Borda @carmocca @dependabot @otaj @rohitgr7
PyTorch Lightning 1.7.1: Standard patch release
[1.7.1] - 2022-08-09
Fixed
- Casted only floating point tensors to fp16 with IPUs (#13983)
- Casted tensors to fp16 before moving them to device with
DeepSpeedStrategy(#14000) - Fixed the
NeptuneLoggerdependency being unrecognized (#13988) - Fixed an issue where users would be warned about unset
max_epochseven whenfast_dev_runwas set (#13262) - Fixed MPS device being unrecognized (#13992)
- Fixed incorrect
precision="mixed"being used withDeepSpeedStrategyandIPUStrategy(#14041) - Fixed dtype inference during gradient norm computation (#14051)
- Fixed a bug that caused
ddp_find_unused_parametersto be setFalse, whereas the intended default isTrue(#14095)
Contributors
@adamjstewart @akihironitta @awaelchli @Birch-san @carmocca @clementpoiret @dependabot @rohitgr7
Week bugfix release
[0.5.5] - 2022-08-9
Deprecated
- Deprecate sheety API (#14004)
Fixed
- Resolved a bug where the work statuses will grow quickly and be duplicated (#13970)
- Resolved a bug about a race condition when sending the work state through the caller_queue (#14074)
- Fixed Start Lightning App on Cloud if Repo Begins With Name "Lightning" (#14025)
Contributors
If we forgot someone due to not matching commit email with GitHub account, let us know :]
PyTorch Lightning 1.7: Apple Silicon support, Native FSDP, Collaborative training, and multi-GPU support with Jupyter notebooks
The core team is excited to announce the release of PyTorch Lightning 1.7 ⚡
PyTorch Lightning 1.7 is the culmination of work from 106 contributors who have worked on features, bug-fixes, and documentation for a total of over 492 commits since 1.6.0.
Highlights
Apple Silicon Support
For those using PyTorch 1.12 on M1 or M2 Apple machines, we have created the MPSAccelerator. MPSAccelerator enables accelerated GPU training on Apple’s Metal Performance Shaders (MPS) as a backend process.
NOTE
Support for this accelerator is currently marked as experimental in PyTorch. Because many operators are still missing, you may run into a few rough edges.
# Selects the accelerator
trainer = pl.Trainer(accelerator="mps")
# Equivalent to
from pytorch_lightning.accelerators import MPSAccelerator
trainer = pl.Trainer(accelerator=MPSAccelerator())
# Defaults to "mps" when run on M1 or M2 Apple machines
# to avoid code changes when switching computers
trainer = pl.Trainer(accelerator="gpu")Native Fully Sharded Data Parallel Strategy
PyTorch 1.12 also added native support for Fully Sharded Data Parallel (FSDP). Previously, PyTorch Lightning enabled this by using the fairscale project. You can now choose between both options.
NOTE
Support for this strategy is marked as beta in PyTorch.
# Native PyTorch implementation
trainer = pl.Trainer(strategy="fsdp_native")
# Equivalent to
from pytorch_lightning.strategies import DDPFullyShardedNativeStrategy
trainer = pl.Trainer(strategy=DDPFullyShardedNativeStrategy())
# For reference, FairScale's implementation can be used with
trainer = pl.Trainer(strategy="fsdp")A Collaborative Training strategy using Hivemind
Collaborative Training solves the need for top-tier multi-GPU servers by allowing you to train across unreliable machines such as local ones or even preemptible cloud compute across the Internet.
Under the hood, we use Hivemind. This provides de-centralized training across the Internet.
from pytorch_lightning.strategies import HivemindStrategy
trainer = pl.Trainer(
strategy=HivemindStrategy(target_batch_size=8192),
accelerator="gpu",
devices=1
)For more information, check out the docs.
Distributed support in Jupyter Notebooks
So far, the only multi-GPU strategy supported in Jupyter notebooks (including Grid.ai, Google Colab, and Kaggle, for example) has been the Data-Parallel (DP) strategy (strategy="dp"). DP, however, has several limitations that often obstruct users' workflows. It can be slow, it's incompatible with TorchMetrics, it doesn't persist state changes on replicas, and it's difficult to use with non-primitive input- and output structures.
In this release, we've added support for Distributed Data Parallel in Jupyter notebooks using the fork mechanism to address these shortcomings. This is only available for MacOS and Linux (sorry Windows!).
NOTE
This feature is experimental.
This is how you use multi-device in notebooks now:
# Train on 2 GPUs in a Jupyter notebook
trainer = pl.Trainer(accelerator="gpu", devices=2)
# Can be set explicitly
trainer = pl.Trainer(accelerator="gpu", devices=2, strategy="ddp_notebook")
# Can also be used in non-interactive environments
trainer = pl.Trainer(accelerator="gpu", devices=2, strategy="ddp_fork")By default, the Trainer detects the interactive environment and selects the right strategy for you. Learn more in the full documentation.
Versioning of "last" checkpoints
If a run is configured to save to the same directory as a previous run and ModelCheckpoint(save_last=True) is enabled, the "last" checkpoint is now versioned with a simple -v1 suffix to avoid overwriting the existing "last" checkpoint. This mimics the behaviour for checkpoints that monitor a metric.
Automatically reload the "last" checkpoint
In certain scenarios, like when running in a cloud spot instance with fault-tolerant training enabled, it is useful to load the latest available checkpoint. It is now possible to pass the string ckpt_path="last" in order to load the latest available checkpoint from the set of existing checkpoints.
trainer = Trainer(...)
trainer.fit(..., ckpt_path="last")Validation every N batches across epochs
In some cases, for example iteration based training, it is useful to run validation after every N number of training batches without being limited by the epoch boundary. Now, you can enable validation based on total training batches.
trainer = Trainer(..., val_check_interval=N, check_val_every_n_epoch=None)
trainer.fit(...)For example, given 5 epochs of 10 batches, setting N=25 would run validation in the 3rd and 5th epoch.
CPU stats monitoring
PyTorch Lightning provides the DeviceStatsMonitor callback to monitor the stats of the hardware currently used. However, users often also want to monitor the stats of other hardware. In this release, we have added an option to additionally monitor CPU stats:
from pytorch_lightning.callbacks import DeviceStatsMonitor
# Log both CPU stats and GPU stats
trainer = pl.Trainer(callbacks=DeviceStatsMonitor(cpu_stats=True), accelerator="gpu")
# Log just the GPU stats
trainer = pl.Trainer(callbacks=DeviceStatsMonitor(cpu_stats=False), accelerator="gpu")
# Equivalent to `DeviceStatsMonitor()`
trainer = pl.Trainer(callbacks=DeviceStatsMonitor(cpu_stats=True), accelerator="cpu")The CPU stats are gathered using the psutil package.
Automatic distributed samplers
It is now possible to use custom samplers in a distributed environment without the need to set replace_ddp_sampler=False and wrap your sampler manually with the DistributedSampler.
Inference mode support
PyTorch 1.9 introduced torch.inference_mode, which is a faster alternative for torch.no_grad. Lightning will now use inference_mode wherever possible during evaluation.
Support for warn-level determinism
In Pytorch 1.11, operations that do not have a deterministic implementation can be set to throw a warning instead of an error when ran in deterministic mode. This is now supported by our Trainer:
trainer = pl.Trainer(deterministic="warn")LightningCLI improvements
After the latest updates to jsonargparse, the library supporting the LightningCLI, there's now complete support for shorthand notation. This includes automatic support for shorthand notation to all arguments, not just the ones that are part of the registries, plus support inside configuration files.
+ # pytorch_lightning==1.7.0
trainer:
callbacks:
- - class_path: pytorch_lightning.callbacks.EarlyStopping
+ - class_path: EarlyStopping
init_args:
monitor: "loss"A header with the version that generated the config is now included.
All subclasses for a given base class can be specified by name, so there's no need to explicitly register them. The only requirement is that the module where the subclass is defined is imported prior to parsing.
from pytorch_lightning.cli import LightningCLI
import my_code.models
import my_code.optimizers
cli = LightningCLI()
# Now use any of the classes:
# python trainer.py fit --model=Model1 --optimizer=CustomOptimizerThe new version renders the registries and the auto_registry flag, introduced in 1.6.0, unnecessary, so we have deprecated them.
Support was also added for list appending; for example, to add a callback to an existing list that might be already configured:
$ python trainer.py fit \
- --trainer.callbacks=EarlyStopping \
+ --trainer.callbacks+=EarlyStopping \
--trainer.callbacks.patience=5 \
- --trainer.callbacks=LearningRateMonitor \
+ --trainer.callbacks+=LearningRateMonitor \
--trainer.callbacks.logging_interval=epochCallback registration through entry points
Entry Points are an advanced feature in Python's setuptools that allow packages to expose metadata to other packages. In Lightning, we ...