Skip to content

Releases: Lightning-AI/pytorch-lightning

Bug fixes and .test() fix + TPU tests

10 Jul 02:01
92d6abc

Choose a tag to compare

Overview

The point of this release is more bug fixes ahead of v 1.0.0. We now have CI tests on TPU thanks to @zcain117 from Google! 🙂
This means we fixed many TPU bugs we hadn’t caught before because we had no tests.
In addition, we fixed:

  • all the file path errors with loggers (txs @awaelchli)
  • pickling errors with loggers (txs @awaelchli)
  • fixed all the .test() calls

Detail changes

Added

  • Added a PSNR metric: peak signal-to-noise ratio (#2483)
  • Added functional regression metrics (#2492)

Removed

  • Removed auto val reduce (#2462)

Fixed

  • Flattening Wandb Hyperparameters (#2459)
  • Fixed using the same DDP python interpreter and actually running (#2482)
  • Fixed model summary input type conversion for models that have input dtype different from model parameters (#2510)
  • Made TensorBoardLogger and CometLogger pickleable (#2518)
  • Fixed a problem with MLflowLogger creating multiple run folders (#2502)
  • Fixed global_step increment (#2455)
  • Fixed TPU hanging example (#2488)
  • Fixed argparse default value bug (#2526)
  • Fixed Dice and IoU to avoid NaN by adding small eps (#2545)
  • Fixed accumulate gradients schedule at epoch 0 (continued) (#2513)
  • Fixed Trainer .fit() returning last not best weights in "ddp_spawn" (#2565)
  • Fixed passing (do not pass) TPU weights back on test (#2566)
  • Fixed DDP tests and .test() (#2512, #2570)

Contributors

@anthonytec2, @awaelchli, @bernardomig, @Borda, @EspenHa, @HHousen, @InCogNiTo124, @rohitgr7, @williamFalcon

If we forgot someone due to not matching commit email with GitHub account, let us know :]

More bug fixing!

01 Jul 11:56
695e051

Choose a tag to compare

Detail changes

Added

  • Added reduce ddp results on eval (#2434)
  • Added a warning when an IterableDataset has __len__ defined (#2437)

Changed

  • Enabled no returns from eval (#2446)

Fixed

  • Fixes train outputs (#2428)
  • Fixes Conda dependencies (#2412)
  • Fixed Apex scaling with decoupled backward (#2433)
  • Fixed crashing or wrong displaying progressbar because of missing ipywidgets (#2417)
  • Fixed TPU saving dir (fc26078, 04e68f0)
  • Fixed logging on rank 0 only (#2425)

Contributors

@awaelchli, @Borda, @olineumann, @williamFalcon

Bug fixing

29 Jun 11:38
dec074c

Choose a tag to compare

Fixed

DDP and Checkpoint bug fixes

29 Jun 02:09
8f07b77

Choose a tag to compare

Pre-release

Overview

As we continue to strengthen the codebase with more tests, we’re finally getting rid of annoying bugs that have been around for a bit now. Mostly around the inconsistent checkpoint and early stopping behaviour (amazing work @awaelchli @jeremyjordan )

Noteworthy changes:

  • Fixed TPU flag parsing
  • fixed average_precision metric
  • all the checkpoint issues should be gone now (including backward support for old checkpoints)
  • DDP + loggers should be fixed

Detail changes

Added

  • Added TorchText support for moving data to GPU (#2379)

Changed

  • Changed epoch indexing from 0 instead of 1 (#2289)
  • Refactor Model backward (#2276)
  • Refactored training_batch + tests to verify correctness (#2327, #2328)
  • Refactored training loop (#2336)
  • Made optimization steps for hooks (#2363)
  • Changed default apex level to 'O2' (#2362)

Removed

  • Moved TrainsLogger to Bolts (#2384)

Fixed

  • Fixed parsing TPU arguments and TPU tests (#2094)
  • Fixed number batches in case of multiple dataloaders and limit_{*}_batches (#1920, #2226)
  • Fixed an issue with forward hooks not being removed after model summary (#2298)
  • Fix for load_from_checkpoint() not working with absolute path on Windows (#2294)
  • Fixed an issue how _has_len handles NotImplementedError e.g. raised by torchtext.data.Iterator (#2293), (#2307)
  • Fixed average_precision metric (#2319)
  • Fixed ROC metric for CUDA tensors (#2304)
  • Fixed average_precision metric (#2319)
  • Fixed lost compatibility with custom datatypes implementing .to (#2335)
  • Fixed loading model with kwargs (#2387)
  • Fixed sum(0) for trainer.num_val_batches (#2268)
  • Fixed checking if the parameters are a DictConfig Object (#2216)
  • Fixed SLURM weights saving (#2341)
  • Fixed swaps LR scheduler order (#2356)
  • Fixed adding tensorboard hparams logging test (#2342)
  • Fixed use model ref for tear down (#2360)
  • Fixed logger crash on DDP (#2388)
  • Fixed several issues with early stopping and checkpoint callbacks (#1504, #2391)
  • Fixed loading past checkpoints from v0.7.x (#2405)
  • Fixed loading model without arguments (#2403)

Contributors

@airium, @awaelchli, @Borda, @elias-ramzi, @jeremyjordan, @lezwon, @mateuszpieniak, @mmiakashs, @pwl, @rohitgr7, @ssakhavi, @thschaaf, @tridao, @williamFalcon

If we forgot someone due to not matching commit email with GitHub account, let us know :]

Fixing hooks & hparams

19 Jun 06:44
2fbc997

Choose a tag to compare

Overview

Fixing critical bugs in newly added hooks and hparams assignment.
The recommended data following:

  1. use prepare_data to download and process the dataset.
  2. use setup to do splits, and build your model internals

Detail changes

  • Fixed the load_from_checkpoint path detected as URL bug (#2244)
  • Fixed hooks - added barrier (#2245, #2257, #2260)
  • Fixed hparams - remove frame inspection on self.hparams (#2253)
  • Fixed setup and on fit calls (#2252)
  • Fixed GPU template (#2255)

Metrics, speed improvements, new hooks and flags

19 Jun 07:02
e0b7359

Choose a tag to compare

Overview

Highlights of this release are adding Metric package and new hooks and flags to customize your workflow.

Major features:

  • brand new Metrics package with built-in DDP support (by @justusschock and @SkafteNicki)
  • hparams can now be anything! (call self.save_hyperparameters() to register anything in the _init_
  • many speed improvements (how we move data, adjusted some flags & PL now adds 300ms overhead per epoch only!)
  • much faster ddp implementation. Old one was renamed ddp_spawn
  • better support for Hydra
  • added the overfit_batches flag and corrected some bugs with the limit_[train,val,test]_batches flag
  • added conda support
  • tons of bug fixes 😉

Detail changes

Added

  • Added overfit_batches, limit_{val|test}_batches flags (overfit now uses training set for all three) (#2213)
  • Added metrics
  • Added type hints in Trainer.fit() and Trainer.test() to reflect that also a list of dataloaders can be passed in (#1723)
  • Allow dataloaders without sampler field present (#1907)
  • Added option save_last to save the model at the end of every epoch in ModelCheckpoint (#1908)
  • Early stopping checks on_validation_end (#1458)
  • Attribute best_model_path to ModelCheckpoint for storing and later retrieving the path to the best saved model file (#1799)
  • Speed up single-core TPU training by loading data using ParallelLoader (#2033)
  • Added a model hook transfer_batch_to_device that enables moving custom data structures to the target device (#1756)
  • Added black formatter for the code with code-checker on pull (#1610)
  • Added back the slow spawn ddp implementation as ddp_spawn (#2115)
  • Added loading checkpoints from URLs (#1667)
  • Added a callback method on_keyboard_interrupt for handling KeyboardInterrupt events during training (#2134)
  • Added a decorator auto_move_data that moves data to the correct device when using the LightningModule for inference (#1905)
  • Added ckpt_path option to LightningModule.test(...) to load particular checkpoint (#2190)
  • Added setup and teardown hooks for model (#2229)

Changed

  • Allow user to select individual TPU core to train on (#1729)
  • Removed non-finite values from loss in LRFinder (#1862)
  • Allow passing model hyperparameters as complete kwarg list (#1896)
  • Renamed ModelCheckpoint's attributes best to best_model_score and kth_best_model to kth_best_model_path (#1799)
  • Re-Enable Logger's ImportErrors (#1938)
  • Changed the default value of the Trainer argument weights_summary from full to top (#2029)
  • Raise an error when lightning replaces an existing sampler (#2020)
  • Enabled prepare_data from correct processes - clarify local vs global rank (#2166)
  • Remove explicit flush from tensorboard logger (#2126)
  • Changed epoch indexing from 1 instead of 0 (#2206)

Deprecated

  • Deprecated flags: (#2213)
    • overfit_pct in favour of overfit_batches
    • val_percent_check in favour of limit_val_batches
    • test_percent_check in favour of limit_test_batches
  • Deprecated ModelCheckpoint's attributes best and kth_best_model (#1799)
  • Dropped official support/testing for older PyTorch versions <1.3 (#1917)

Removed

  • Removed unintended Trainer argument progress_bar_callback, the callback should be passed in by Trainer(callbacks=[...]) instead (#1855)
  • Removed obsolete self._device in Trainer (#1849)
  • Removed deprecated API (#2073)
    • Packages: pytorch_lightning.pt_overrides, pytorch_lightning.root_module
    • Modules: pytorch_lightning.logging.comet_logger, pytorch_lightning.logging.mlflow_logger, pytorch_lightning.logging.test_tube_logger, pytorch_lightning.overrides.override_data_parallel, pytorch_lightning.core.model_saving, pytorch_lightning.core.root_module
    • Trainer arguments: add_row_log_interval, default_save_path, gradient_clip, nb_gpu_nodes, max_nb_epochs, min_nb_epochs, nb_sanity_val_steps
    • Trainer attributes: nb_gpu_nodes, num_gpu_nodes, gradient_clip, max_nb_epochs, min_nb_epochs, nb_sanity_val_steps, default_save_path, tng_tqdm_dic

Fixed

  • Run graceful training teardown on interpreter exit (#1631)
  • Fixed user warning when apex was used together with learning rate schedulers (#1873)
  • Fixed multiple calls of EarlyStopping callback (#1863)
  • Fixed an issue with Trainer.from_argparse_args when passing in unknown Trainer args (#1932)
  • Fixed bug related to logger not being reset correctly for model after tuner algorithms (#1933)
  • Fixed root node resolution for SLURM cluster with dash in hostname (#1954)
  • Fixed LearningRateLogger in multi-scheduler setting (#1944)
  • Fixed test configuration check and testing (#1804)
  • Fixed an issue with Trainer constructor silently ignoring unknown/misspelt arguments (#1820)
  • Fixed save_weights_only in ModelCheckpoint (#1780)
  • Allow use of same WandbLogger instance for multiple training loops (#2055)
  • Fixed an issue with _auto_collect_arguments collecting local variables that are not constructor arguments and not working for signatures that have the instance not named self (#2048)
  • Fixed mistake in parameters' grad norm tracking (#2012)
  • Fixed CPU and hanging GPU crash (#2118)
  • Fixed an issue with the model summary and example_input_array depending on a specific ordering of the submodules in a LightningModule (#1773)
  • Fixed Tpu logging (#2230)
  • Fixed Pid port + duplicate rank_zero logging (#2140, #2231)

Contributors

@awaelchli, @baldassarreFe, @Borda, @borisdayma, @cuent, @devashishshankar, @ivannz, @j-dsouza, @justusschock, @kepler, @kumuji, @lezwon, @lgvaz, @LoicGrobol, @mateuszpieniak, @maximsch2, @moi90, @rohitgr7, @SkafteNicki, @tullie, @williamFalcon, @yukw777, @ZhaofengWu

If we forgot someone due to not matching commit email with GitHub account, let us know :]

Transfer learning, tuning batch size, torchelastic support

15 May 12:37
e95e1d7

Choose a tag to compare

Overview

Highlights of this release are adding support for TorchElastic enables distributed PyTorch training jobs to be executed in a fault-tolerant and elastic manner; auto-scaling of batch size; new transfer learning example; an option to provide seed to random generators to ensure reproducibility.

Detail changes

Added

  • Added callback for logging learning rates (#1498)
  • Added transfer learning example (for a binary classification task in computer vision) (#1564)
  • Added type hints in Trainer.fit() and Trainer.test() to reflect that also a list of dataloaders can be passed in (#1723).
  • Added auto scaling of batch size (#1638)
  • The progress bar metrics now also get updated in training_epoch_end (#1724)
  • Enable NeptuneLogger to work with distributed_backend=ddp (#1753)
  • Added option to provide seed to random generators to ensure reproducibility (#1572)
  • Added override for hparams in load_from_ckpt (#1797)
  • Added support multi-node distributed execution under torchelastic (#1811, #1818)
  • Added using store_true for bool args (#1822, #1842)
  • Added dummy logger for internally disabling logging for some features (#1836)

Changed

  • Enable non-blocking for device transfers to GPU (#1843)
  • Replace mata_tags.csv with hparams.yaml (#1271)
  • Reduction when batch_size < num_gpus (#1609)
  • Updated LightningTemplateModel to look more like Colab example (#1577)
  • Don't convert namedtuple to tuple when transferring the batch to target device (#1589)
  • Allow passing hparams as a keyword argument to LightningModule when loading from checkpoint (#1639)
  • Args should come after the last positional argument (#1807)
  • Made DDP the default if no backend specified with multiple GPUs (#1789)

Deprecated

  • Deprecated tags_csv in favor of hparams_file (#1271)

Fixed

  • Fixed broken link in PR template (#1675)
  • Fixed ModelCheckpoint not None checking file path (#1654)
  • Trainer now calls on_load_checkpoint() when resuming from a checkpoint (#1666)
  • Fixed sampler logic for DDP with the iterable dataset (#1734)
  • Fixed _reset_eval_dataloader() for IterableDataset (#1560)
  • Fixed Horovod distributed backend to set the root_gpu property (#1669)
  • Fixed wandb logger global_step affects other loggers (#1492)
  • Fixed disabling progress bar on non-zero ranks using Horovod backend (#1709)
  • Fixed bugs that prevent LP finder to be used together with early stopping and validation dataloaders (#1676)
  • Fixed a bug in Trainer that prepended the checkpoint path with version_ when it shouldn't (#1748)
  • Fixed LR key name in case of param groups in LearningRateLogger (#1719)
  • Fixed saving native AMP scaler state (introduced in #1561)
  • Fixed accumulation parameter and suggestion method for learning rate finder (#1801)
  • Fixed num processes wasn't being set properly and auto sampler was DDP failing (#1819)
  • Fixed bugs in semantic segmentation example (#1824)
  • Fixed saving native AMP scaler state (#1561, #1777)
  • Fixed native AMP + DDP (#1788)
  • Fixed hparam logging with metrics (#1647)

Contributors

@ashwinb, @awaelchli, @Borda, @cmpute, @festeh, @jbschiratti, @justusschock, @kepler, @kumuji, @nanddalal, @nathanbreitsch, @olineumann, @pitercl, @rohitgr7, @S-aiueo32, @SkafteNicki, @tgaddair, @tullie, @tw991, @williamFalcon, @ybrovman, @yukw777

If we forgot someone due to not matching commit email with GitHub account, let us know :]

Critical DDP bug fixes

27 Apr 13:06
694f1d7

Choose a tag to compare

We made a few changes to Callbacks to test ops on detached GPU tensors to avoid CPU transfer. However, it made callbacks unpicklable which will crash DDP.

This release fixes that core issue

Changed

  • Allow logging of metrics together with hparams (#1630)
  • Allow metrics logged together with hparams (#1630)

Removed

  • Removed Warning from trainer loop (#1634)

Fixed

  • Fixed ModelCheckpoint not being fixable (#1632)
  • Fixed CPU DDP breaking change and DDP change (#1635)
  • Tested pickling (#1636)

Contributors

@justusschock, @quinor, @williamFalcon

PyTorch 1.5 support, native PyTorch AMP, speed/memory optimizations and many bug fixes

26 Apr 15:08
d290b81

Choose a tag to compare

Key updates

  • PyTorch 1.5 support
  • Added Horovod distributed_backend option
  • Enable forward compatibility with the native AMP (PyTorch 1.6).
  • Support 8-core TPU on Kaggle
  • Added ability to customize progress_bar via Callbacks
  • Speed/memory optimizations.
  • Improved Argparse usability with Trainer
  • Docs improvements
  • Tons of bug fixes

Detail changes

Added

  • Added flag replace_sampler_ddp to manually disaple sampler replacement in ddp (#1513)
  • Added speed parity tests (max 1 sec difference per epoch)(#1482)
  • Added auto_select_gpus flag to trainer that enables automatic selection of available GPUs on exclusive mode systems.
  • Added learining rate finder (#1347)
  • Added support for ddp mode in clusters without SLURM (#1387)
  • Added test_dataloaders parameter to Trainer.test() (#1434)
  • Added terminate_on_nan flag to trainer that performs a NaN check with each training iteration when set to True (#1475)
  • Added speed parity tests (max 1 sec difference per epoch)(#1482)
  • Added terminate_on_nan flag to trainer that performs a NaN check with each training iteration when set to True. (#1475)
  • Added ddp_cpu backend for testing ddp without GPUs (#1158)
  • Added Horovod support as a distributed backend Trainer(distributed_backend='horovod') (#1529)
  • Added support for 8 core distributed training on Kaggle TPU's (#1568)
  • Added support for native AMP (#1561, [#1580)

Changed

  • Changed the default behaviour to no longer include a NaN check with each training iteration. (#1475)
  • Decoupled the progress bar from trainer. It is a callback now and can be customized or even be replaced entirely (#1450).
  • Changed lr schedule step interval behavior to update every backwards pass instead of every forwards pass (#1477)
  • Defines shared proc. rank, remove rank from instances (e.g. loggers) (#1408)
  • Updated semantic segmentation example with custom u-net and logging (#1371)
  • Disabled val and test shuffling (#1600)

Deprecated

  • Deprecated training_tqdm_dict in favor of progress_bar_dict (#1450).

Removed

  • Removed test_dataloaders parameter from Trainer.fit() (#1434)

Fixed

  • Added the possibility to pass nested metrics dictionaries to loggers (#1582)
  • Fixed memory leak from opt return (#1528)
  • Fixed saving checkpoint before deleting old ones (#1453)
  • Fixed loggers - flushing last logged metrics even before continue, e.g. trainer.test() results (#1459)
  • Fixed optimizer configuration when configure_optimizers returns dict without lr_scheduler (#1443)
  • Fixed LightningModule - mixing hparams and arguments in LightningModule.__init__() crashes load_from_checkpoint() (#1505)
  • Added a missing call to the on_before_zero_grad model hook (#1493).
  • Allow use of sweeps with WandbLogger (#1512)
  • Fixed a bug that caused the callbacks Trainer argument to reference a global variable (#1534).
  • Fixed a bug that set all boolean CLI arguments from Trainer.add_argparse_args always to True (#1571)
  • Fixed do not copy the batch when training on a single GPU (#1576, [#1579)
  • Fixed soft checkpoint removing on DDP (#1408)
  • Fixed automatic parser bug (#1585)
  • Fixed bool conversion from string (#1606)

Contributors

@alexeykarnachev, @areshytko, @awaelchli, @Borda, @borisdayma, @ethanwharris, @fschlatt, @HenryJia, @Ir1d, @justusschock, @karlinjf, @lezwon, @neggert, @rmrao, @rohitgr7, @SkafteNicki, @tgaddair, @williamFalcon

If we forgot someone due to not matching commit email with GitHub account, let us know :]

DDP bug fixes

10 Apr 12:44
afc43db

Choose a tag to compare

We had a few (subtle) bugs that affected DDP and a few key things in 0.7.2 so we released 0.7.3 to fix them because they are critical for DDP. sorry about that! still, no API changes, but please do skip straight to 0.7.3 upgrade for those fixes

Detail changes

Added

  • Added rank_zero_warn for warning only in rank 0 (#1428)

Fixed

  • Fixed default DistributedSampler for DDP training (#1425)
  • Fixed workers warning not on windows (#1430)
  • Fixed returning tuple from run_training_batch (#1431)
  • Fixed gradient clipping (#1438)
  • Fixed pretty print (#1441)

Contributors

@alsrgv, @Borda, @williamFalcon