Releases: Lightning-AI/pytorch-lightning
Standard weekly patch release
[1.5.6] - 2021-12-15
Fixed
- Fixed a bug where the DeepSpeedPlugin arguments
cpu_checkpointingandcontiguous_memory_optimizationwere not being forwarded to deepspeed correctly (#10874) - Fixed an issue with
NeptuneLoggercausing checkpoints to be uploaded with a duplicated file extension (#11015) - Fixed support for logging within callbacks returned from
LightningModule(#10991) - Fixed running sanity check with
RichProgressBar(#10913) - Fixed support for
CombinedLoaderwhile checking for warning raised with eval dataloaders (#10994) - The TQDM progress bar now correctly shows the
on_epochlogged values on train epoch end (#11069) - Fixed bug where the TQDM updated the training progress bar during
trainer.validate(#11069)
Contributors
@carmocca @jona-0 @kaushikb11 @Raalsky @rohitgr7
If we forgot someone due to not matching commit email with GitHub account, let us know :]
Standard weekly patch release
[1.5.5] - 2021-12-07
Fixed
- Disabled batch_size extraction for torchmetric instances because they accumulate the metrics internally (#10815)
- Fixed an issue with
SignalConnectornot restoring the default signal handlers on teardown when running on SLURM or with fault-tolerant training enabled (#10611) - Fixed
SignalConnector._has_already_handlercheck for callable type (#10483) - Fixed an issue to return the results for each dataloader separately instead of duplicating them for each (#10810)
- Improved exception message if
richversion is less than10.2.2(#10839) - Fixed uploading best model checkpoint in NeptuneLogger (#10369)
- Fixed early schedule reset logic in PyTorch profiler that was causing data leak (#10837)
- Fixed a bug that caused incorrect batch indices to be passed to the
BasePredictionWriterhooks when using a dataloader withnum_workers > 0(#10870) - Fixed an issue with item assignment on the logger on rank > 0 for those who support it (#10917)
- Fixed importing
torch_xla.debugfortorch-xla<1.8(#10836) - Fixed an issue with
DDPSpawnPluginand related plugins leaving a temporary checkpoint behind (#10934) - Fixed a
TypeErroroccuring in theSingalConnector.teardown()method (#10961)
Contributors
@awaelchli @carmocca @four4fish @kaushikb11 @lucmos @mauvilsa @Raalsky @rohitgr7
If we forgot someone due to not matching commit email with GitHub account, let us know :]
Standard weekly patch release
[1.5.4] - 2021-11-30
Fixed
- Fixed support for
--key.help=classwith theLightningCLI(#10767) - Fixed
_compare_versionfor python packages (#10762) - Fixed TensorBoardLogger
SummaryWriternot close before spawning the processes (#10777) - Fixed a consolidation error in Lite when attempting to save the state dict of a sharded optimizer (#10746)
- Fixed the default logging level for batch hooks associated with training from
on_step=False, on_epoch=Truetoon_step=True, on_epoch=False(#10756)
Removed
Contributors
@awaelchli @carmocca @kaushikb11 @rohitgr7 @tchaton
If we forgot someone due to not matching commit email with GitHub account, let us know :]
Standard weekly patch release
[1.5.3] - 2021-11-24
Fixed
- Fixed
ShardedTensorstate dict hook registration to check if torch distributed is available (#10621) - Fixed an issue with
self.lognot respecting a tensor'sdtypewhen applying computations (#10076) - Fixed LigtningLite
_wrap_initpopping unexisting keys from DataLoader signature parameters (#10613) - Fixed signals being registered within threads (#10610)
- Fixed an issue that caused Lightning to extract the batch size even though it was set by the user in
LightningModule.log(#10408) - Fixed
Trainer(move_metrics_to_cpu=True)not moving the evaluation logged results to CPU (#10631) - Fixed the
{validation,test}_stepoutputs getting moved to CPU withTrainer(move_metrics_to_cpu=True)(#10631) - Fixed signals being registered within threads (#10610)
- Fixed an issue with collecting logged test results with multiple dataloaders (#10522)
Contributors
@ananthsub @awaelchli @carmocca @jiwidi @kaushikb11 @qqueing @rohitgr7 @shabie @tchaton
If we forgot someone due to not matching commit email with GitHub account, let us know :]
Standard weekly patch release
[1.5.2] - 2021-11-16
Fixed
- Fixed
CombinedLoaderandmax_size_cycledidn't receive aDistributedSampler(#10374) - Fixed an issue where class or init-only variables of dataclasses were passed to the dataclass constructor in
utilities.apply_to_collection(#9702) - Fixed
isinstancenot working withinit_meta_context, materialized model not being moved to the device (#10493) - Fixed an issue that prevented the Trainer to shutdown workers when execution is interrupted due to failure(#10463)
- Squeeze the early stopping monitor to remove empty tensor dimensions (#10461)
- Fixed sampler replacement logic with
overfit_batchesto only replace the sample whenSequentialSampleris not used (#10486) - Fixed scripting causing false positive deprecation warnings (#10470, #10555)
- Do not fail if batch size could not be inferred for logging when using DeepSpeed (#10438)
- Fixed propagation of device and dtype information to submodules of LightningLite when they inherit from
DeviceDtypeModuleMixin(#10559)
Contributors
@a-gardner1 @awaelchli @carmocca @justusschock @Raahul-Singh @rohitgr7 @SeanNaren @tchaton
If we forgot someone due to not matching commit email with GitHub account, let us know :]
Standard weekly patch release
[1.5.1] - 2021-11-09
Fixed
- Fixed
apply_to_collection(defaultdict)(#10316) - Fixed failure when
DataLoader(batch_size=None)is passed (#10345) - Fixed interception of
__init__arguments for sub-classed DataLoader re-instantiation in Lite (#10334) - Fixed issue with pickling
CSVLoggerafter a call toCSVLogger.save(#10388) - Fixed an import error being caused by
PostLocalSGDwhentorch.distributednot available (#10359) - Fixed the logging with
on_step=Truein epoch-level hooks causing unintended side-effects. Logging withon_step=Truein epoch-level hooks will now correctly raise an error (#10409) - Fixed deadlocks for distributed training with
RichProgressBar(#10428) - Fixed an issue where the model wrapper in Lite converted non-floating point tensors to float (#10429)
- Fixed an issue with inferring the dataset type in fault-tolerant training (#10432)
- Fixed dataloader workers with
persistent_workersbeing deleted on every iteration (#10434)
Contributors
@EspenHa @four4fish @peterdudfield @rohitgr7 @tchaton @kaushikb11 @awaelchli @Borda @carmocca
If we forgot someone due to not matching commit email with GitHub account, let us know :]
PyTorch Lightning 1.5: LightningLite, Fault-Tolerant Training, Loop Customization, Lightning Tutorials, LightningCLI v2, RichProgressBar, CheckpointIO Plugin, and Trainer Strategy Flag
The PyTorch Lightning team and its community are excited to announce Lightning 1.5, introducing support for LightningLite, Fault-tolerant Training, Loop Customization, Lightning Tutorials, LightningCLI V2, RichProgressBar, CheckpointIO Plugin, Trainer Strategy flag, and more!
Highlights
Lightning 1.5 marks our biggest release yet. Over 60 contributors have worked on features, bugfixes and documentation improvements for a total of 640 commits since v1.4. Here are some highlights:
Fault-tolerant Training
Fault-tolerant Training is a new internal mechanism that enables PyTorch Lightning to recover from a hardware or software failure. This is particularly interesting while training in the cloud with preemptive instances which can shutdown at any time. Once a Lightning experiment unexpectedly exits, a temporary checkpoint is saved that contains the exact state of all loops and the model. With this new experimental feature, you will be able to restore your training mid-epoch on the exact batch and continue training as if it never got interrupted.
PL_FAULT_TOLERANT_TRAINING=1 python train.pyLightningLite
LightningLite enables pure PyTorch users to scale their existing code to any kind of hardware while retaining full control over their own loops and optimization logic.
With just a few lines of code and no large refactoring, you get support for multi-device, multi-node, running on different accelerators (CPU, GPU, TPU), native automatic mixed precision (half and bfloat16), and double precision, in just a few seconds. And no special launcher required! Check out our documentation to find out how you can get one step closer to boilerplate-free research!
class Lite(LightningLite):
def run(self):
# Let Lite setup your dataloader(s)
train_loader = self.setup_dataloaders(torch.utils.data.DataLoader(...))
model = Net() # .to() not needed
optimizer = optim.Adam(model.parameters())
# Let Lite setup your model and optimizer
model, optimizer = self.setup(model, optimizer)
for epoch in range(5):
for data, target in train_loader:
optimizer.zero_grad()
output = model(data) # data is already on the device
loss = F.nll_loss(output, target)
self.backward(loss) # instead of loss.backward()
optimizer.step()
Lite(accelerator="gpu", devices="auto").run()Loop Customization
The new Loop API lets advanced users swap out the default gradient descent optimization loop at the core of Lightning with a different optimization paradigm. This is part of our effort to make Lightning the simplest, most flexible framework to take any kind of deep learning research to production.
Read our comprehensive introduction to loops
New Rich Progress Bar
We integrated with Rich and created a new and improved progress bar for Lightning.
Try it out:
pip install richfrom pytorch_lightning import Trainer
from pytorch_lightning.callbacks import RichProgressBar
trainer = Trainer(callbacks=[RichProgressBar()])New Trainer Arguments: Strategy and Devices
With the new strategy and devices arguments in the Trainer, it is now easer to switch from one hardware to another.
| Before | After |
|---|---|
Trainer(accelerator="ddp", gpus=2) |
Trainer(accelerator="gpu", devices=2, strategy="ddp") |
Trainer(accelerator="ddp_cpu", num_processes=2) |
Trainer(accelerator="cpu", devices=2, strategy="ddp") |
Trainer(accelerator="tpu_spawn", tpu_cores=8) |
Trainer(accelerator="tpu", devices=8) |
The new devices argument is now agnostic to all accelerators, but the previous arguments gpus, tpu_cores, ipus are still available and work the same as before. In addition, it is now also possible to set devices="auto" or accelerator="auto" to select the best accelerator available on the hardware.
from pytorch_lightning import Trainer
trainer = Trainer(accelerator="auto", devices="auto")LightningCLI V2
This release adds support for running not just Trainer.fit but any of the Trainer entry points!
python script.py fit
python script.py testLightningCLI now supports registries for callbacks, optimizers, learning rate schedulers, LightningModules and LightningDataModules. This greatly improves the command line experience as only the class names and arguments are required as follows:
python script.py \
--trainer.callbacks=EarlyStopping \
--trainer.callbacks.patience=5 \
--trainer.callbacks.LearningRateMonitor \
--trainer.callbacks.logging_interval=epoch \
--optimizer=Adam \
--optimizer.lr=0.01 \
--lr_scheduler=OneCycleLR \
--lr_scheduler=anneal_strategy=linearWe've also added support for a manual mode where the CLI takes care of the instantiation but you have control over the Trainer calls:
cli = LightningCLI(MyModel, run=False)
cli.trainer.fit(cli.model)CheckpointIO Plugins
As part of our commitment to extensibility, we have abstracted the checkpointing logic into a CheckpointIO plugin. This enables users to adapt Lightning to their own infrastructure.
from pytorch_lightning.plugins import CheckpointIO
class CustomCheckpointIO(CheckpointIO):
def save_checkpoint(self, checkpoint, path):
# put all logic related to saving a checkpoint here
def load_checkpoint(self, path):
# put all logic related to loading a checkpoint here
def remove_checkpoint(self, path):
# put all logic related to deleting a checkpoint hereBFloat16 Support
PyTorch 1.10 introduces native Automatic Mixed Precision (AMP) support for torch.bfloat16 on CPU (was already supported for TPUs), enabling higher performance compared with torch.float16. Switch to bfloat16 training by setting the argument:
from pytorch_lightning import Trainer
trainer = Trainer(precision="bf16")Enable Auto Parameters Tying
It is pretty common to share parameters within a model. However, TPUs don't retain shared parameters once moved on the devices. Lightning now supports automatic detection and re-assignement to alleviate this problem from TPUs.
Infinite Training
Infinite training is now supported by setting Trainer(max_epochs=-1) for an unlimited number of epochs, or Trainer(max_steps=-1) for an endless epoch.
Note: you will want to avoid logging with
on_epoch=Truein case ofmax_steps=-1.
DeepSpeed Stage 1
DeepSpeed is a deep learning training optimization library, providing the means to train massive billion parameter models at scale. Lightning now also supports the DeepSpeed ZeRO Stage 1 protocol that partitions your optimizer states across your GPUs to reduce memory.
from pytorch_lightning import Trainer
trainer = Trainer(gpus=4, strategy="deepspeed_stage_1", precision=16)
trainer.fit(model)For even more memory savings and model sharding advice, check out stage 2 & 3 as well in our multi-GPU docs.
Gradient Clipping Customization
By overriding the LightningModule.configure_gradient_clipping hook, you can customize gradient clipping to your needs:
# Perform gradient clipping on gradients associated with discriminator (optimizer_idx=1) in GAN
def configure_gradient_clipping(
self,
optimizer,
optimizer_idx,
gradient_clip_val,
gradient_clip_algorithm
):
if optimizer_idx == 1:
# Lightning will handle the gradient clipping
self.clip_gradients(
optimizer,
gradient_clip_val=gradient_clip_val,
gradient_clip_algorithm=gradient_clip_algorithm
)This means you can now implement state-of-the-art clipping algorithms with Lightning!
Determinism
Added support for torch.use_deterministic_algorithms. Read more about how it works here. You can enable it by setting:
from pytorch_lightning import Trainer
trainer = Trainer(deterministic=True)Anomaly Detection
Lightning makes it easier to debug your code, so we've added support for torch.set_detect_anomaly. With this, PyTorch detects numerical anomalies like NaN or inf during forward and backward. Read more about anomaly detection here
from pytorch_lightning import Trainer
trainer = Trainer(detect_anomaly=True)DDP Debugging Improvements
Are you having a hard time debugging DDP on your remote machine? Now you can de...
Standard weekly patch release
[1.4.9] - 2021-09-30
- Moved the gradient unscaling in
NativeMixedPrecisionPluginfrompre_optimizer_steptopost_backward(#9606) - Fixed gradient unscaling being called too late, causing gradient clipping and gradient norm tracking to be applied incorrectly (#9606)
- Fixed
lr_findto generate same results on multiple calls (#9704) - Fixed
resetmetrics on validation epoch end (#9717) - Fixed input validation for
gradient_clip_val,gradient_clip_algorithm,track_grad_normandterminate_on_nanTrainer arguments (#9595) - Reset metrics before each task starts (#9410)
Contributors
If we forgot someone due to not matching commit email with GitHub account, let us know :]
Standard weekly patch release
[1.4.8] - 2021-09-22
- Fixed error reporting in DDP process reconciliation when processes are launched by an external agent (#9389)
- Added PL_RECONCILE_PROCESS environment variable to enable process reconciliation regardless of cluster environment settings (#)(#9389)
- Fixed
add_argparse_argsraisingTypeErrorwhen args are typed astyping.Genericin Python 3.6 (#9554) - Fixed back-compatibility for saving hyperparameters from a single container and inferring its argument name by reverting #9125 (#9642)
Contributors
@ananthsub @akihironitta @awaelchli @carmocca @tchaton
If we forgot someone due to not matching commit email with GitHub account, let us know :]
Standard weekly patch release
[1.4.7] - 2021-09-14
- Fixed logging of nan parameters (#9364)
- Fixed
replace_samplermissing the batch size under specific conditions (#9367) - Pass init args to ShardedDataParallel (#9483)
- Fixed collision of user argument when using ShardedDDP (#9512)
- Fixed DeepSpeed crash for RNNs (#9489)
Contributors
@asanakoy @awaelchli @borisdayma @carmocca @guotuofeng @justusschock @kaushikb11 @rohitgr7 @SeanNaren
If we forgot someone due to not matching commit email with GitHub account, let us know :]