Skip to content

Releases: Lightning-AI/pytorch-lightning

Fixing hooks & hparams

19 Jun 06:44
2fbc997
Compare
Choose a tag to compare

Overview

Fixing critical bugs in newly added hooks and hparams assignment.
The recommended data following:

  1. use prepare_data to download and process the dataset.
  2. use setup to do splits, and build your model internals

Detail changes

  • Fixed the load_from_checkpoint path detected as URL bug (#2244)
  • Fixed hooks - added barrier (#2245, #2257, #2260)
  • Fixed hparams - remove frame inspection on self.hparams (#2253)
  • Fixed setup and on fit calls (#2252)
  • Fixed GPU template (#2255)

Metrics, speed improvements, new hooks and flags

19 Jun 07:02
e0b7359
Compare
Choose a tag to compare

Overview

Highlights of this release are adding Metric package and new hooks and flags to customize your workflow.

Major features:

  • brand new Metrics package with built-in DDP support (by @justusschock and @SkafteNicki)
  • hparams can now be anything! (call self.save_hyperparameters() to register anything in the _init_
  • many speed improvements (how we move data, adjusted some flags & PL now adds 300ms overhead per epoch only!)
  • much faster ddp implementation. Old one was renamed ddp_spawn
  • better support for Hydra
  • added the overfit_batches flag and corrected some bugs with the limit_[train,val,test]_batches flag
  • added conda support
  • tons of bug fixes 😉

Detail changes

Added

  • Added overfit_batches, limit_{val|test}_batches flags (overfit now uses training set for all three) (#2213)
  • Added metrics
  • Added type hints in Trainer.fit() and Trainer.test() to reflect that also a list of dataloaders can be passed in (#1723)
  • Allow dataloaders without sampler field present (#1907)
  • Added option save_last to save the model at the end of every epoch in ModelCheckpoint (#1908)
  • Early stopping checks on_validation_end (#1458)
  • Attribute best_model_path to ModelCheckpoint for storing and later retrieving the path to the best saved model file (#1799)
  • Speed up single-core TPU training by loading data using ParallelLoader (#2033)
  • Added a model hook transfer_batch_to_device that enables moving custom data structures to the target device (#1756)
  • Added black formatter for the code with code-checker on pull (#1610)
  • Added back the slow spawn ddp implementation as ddp_spawn (#2115)
  • Added loading checkpoints from URLs (#1667)
  • Added a callback method on_keyboard_interrupt for handling KeyboardInterrupt events during training (#2134)
  • Added a decorator auto_move_data that moves data to the correct device when using the LightningModule for inference (#1905)
  • Added ckpt_path option to LightningModule.test(...) to load particular checkpoint (#2190)
  • Added setup and teardown hooks for model (#2229)

Changed

  • Allow user to select individual TPU core to train on (#1729)
  • Removed non-finite values from loss in LRFinder (#1862)
  • Allow passing model hyperparameters as complete kwarg list (#1896)
  • Renamed ModelCheckpoint's attributes best to best_model_score and kth_best_model to kth_best_model_path (#1799)
  • Re-Enable Logger's ImportErrors (#1938)
  • Changed the default value of the Trainer argument weights_summary from full to top (#2029)
  • Raise an error when lightning replaces an existing sampler (#2020)
  • Enabled prepare_data from correct processes - clarify local vs global rank (#2166)
  • Remove explicit flush from tensorboard logger (#2126)
  • Changed epoch indexing from 1 instead of 0 (#2206)

Deprecated

  • Deprecated flags: (#2213)
    • overfit_pct in favour of overfit_batches
    • val_percent_check in favour of limit_val_batches
    • test_percent_check in favour of limit_test_batches
  • Deprecated ModelCheckpoint's attributes best and kth_best_model (#1799)
  • Dropped official support/testing for older PyTorch versions <1.3 (#1917)

Removed

  • Removed unintended Trainer argument progress_bar_callback, the callback should be passed in by Trainer(callbacks=[...]) instead (#1855)
  • Removed obsolete self._device in Trainer (#1849)
  • Removed deprecated API (#2073)
    • Packages: pytorch_lightning.pt_overrides, pytorch_lightning.root_module
    • Modules: pytorch_lightning.logging.comet_logger, pytorch_lightning.logging.mlflow_logger, pytorch_lightning.logging.test_tube_logger, pytorch_lightning.overrides.override_data_parallel, pytorch_lightning.core.model_saving, pytorch_lightning.core.root_module
    • Trainer arguments: add_row_log_interval, default_save_path, gradient_clip, nb_gpu_nodes, max_nb_epochs, min_nb_epochs, nb_sanity_val_steps
    • Trainer attributes: nb_gpu_nodes, num_gpu_nodes, gradient_clip, max_nb_epochs, min_nb_epochs, nb_sanity_val_steps, default_save_path, tng_tqdm_dic

Fixed

  • Run graceful training teardown on interpreter exit (#1631)
  • Fixed user warning when apex was used together with learning rate schedulers (#1873)
  • Fixed multiple calls of EarlyStopping callback (#1863)
  • Fixed an issue with Trainer.from_argparse_args when passing in unknown Trainer args (#1932)
  • Fixed bug related to logger not being reset correctly for model after tuner algorithms (#1933)
  • Fixed root node resolution for SLURM cluster with dash in hostname (#1954)
  • Fixed LearningRateLogger in multi-scheduler setting (#1944)
  • Fixed test configuration check and testing (#1804)
  • Fixed an issue with Trainer constructor silently ignoring unknown/misspelt arguments (#1820)
  • Fixed save_weights_only in ModelCheckpoint (#1780)
  • Allow use of same WandbLogger instance for multiple training loops (#2055)
  • Fixed an issue with _auto_collect_arguments collecting local variables that are not constructor arguments and not working for signatures that have the instance not named self (#2048)
  • Fixed mistake in parameters' grad norm tracking (#2012)
  • Fixed CPU and hanging GPU crash (#2118)
  • Fixed an issue with the model summary and example_input_array depending on a specific ordering of the submodules in a LightningModule (#1773)
  • Fixed Tpu logging (#2230)
  • Fixed Pid port + duplicate rank_zero logging (#2140, #2231)

Contributors

@awaelchli, @baldassarreFe, @Borda, @borisdayma, @cuent, @devashishshankar, @ivannz, @j-dsouza, @justusschock, @kepler, @kumuji, @lezwon, @lgvaz, @LoicGrobol, @mateuszpieniak, @maximsch2, @moi90, @rohitgr7, @SkafteNicki, @tullie, @williamFalcon, @yukw777, @ZhaofengWu

If we forgot someone due to not matching commit email with GitHub account, let us know :]

Transfer learning, tuning batch size, torchelastic support

15 May 12:37
e95e1d7
Compare
Choose a tag to compare

Overview

Highlights of this release are adding support for TorchElastic enables distributed PyTorch training jobs to be executed in a fault-tolerant and elastic manner; auto-scaling of batch size; new transfer learning example; an option to provide seed to random generators to ensure reproducibility.

Detail changes

Added

  • Added callback for logging learning rates (#1498)
  • Added transfer learning example (for a binary classification task in computer vision) (#1564)
  • Added type hints in Trainer.fit() and Trainer.test() to reflect that also a list of dataloaders can be passed in (#1723).
  • Added auto scaling of batch size (#1638)
  • The progress bar metrics now also get updated in training_epoch_end (#1724)
  • Enable NeptuneLogger to work with distributed_backend=ddp (#1753)
  • Added option to provide seed to random generators to ensure reproducibility (#1572)
  • Added override for hparams in load_from_ckpt (#1797)
  • Added support multi-node distributed execution under torchelastic (#1811, #1818)
  • Added using store_true for bool args (#1822, #1842)
  • Added dummy logger for internally disabling logging for some features (#1836)

Changed

  • Enable non-blocking for device transfers to GPU (#1843)
  • Replace mata_tags.csv with hparams.yaml (#1271)
  • Reduction when batch_size < num_gpus (#1609)
  • Updated LightningTemplateModel to look more like Colab example (#1577)
  • Don't convert namedtuple to tuple when transferring the batch to target device (#1589)
  • Allow passing hparams as a keyword argument to LightningModule when loading from checkpoint (#1639)
  • Args should come after the last positional argument (#1807)
  • Made DDP the default if no backend specified with multiple GPUs (#1789)

Deprecated

  • Deprecated tags_csv in favor of hparams_file (#1271)

Fixed

  • Fixed broken link in PR template (#1675)
  • Fixed ModelCheckpoint not None checking file path (#1654)
  • Trainer now calls on_load_checkpoint() when resuming from a checkpoint (#1666)
  • Fixed sampler logic for DDP with the iterable dataset (#1734)
  • Fixed _reset_eval_dataloader() for IterableDataset (#1560)
  • Fixed Horovod distributed backend to set the root_gpu property (#1669)
  • Fixed wandb logger global_step affects other loggers (#1492)
  • Fixed disabling progress bar on non-zero ranks using Horovod backend (#1709)
  • Fixed bugs that prevent LP finder to be used together with early stopping and validation dataloaders (#1676)
  • Fixed a bug in Trainer that prepended the checkpoint path with version_ when it shouldn't (#1748)
  • Fixed LR key name in case of param groups in LearningRateLogger (#1719)
  • Fixed saving native AMP scaler state (introduced in #1561)
  • Fixed accumulation parameter and suggestion method for learning rate finder (#1801)
  • Fixed num processes wasn't being set properly and auto sampler was DDP failing (#1819)
  • Fixed bugs in semantic segmentation example (#1824)
  • Fixed saving native AMP scaler state (#1561, #1777)
  • Fixed native AMP + DDP (#1788)
  • Fixed hparam logging with metrics (#1647)

Contributors

@ashwinb, @awaelchli, @Borda, @cmpute, @festeh, @jbschiratti, @justusschock, @kepler, @kumuji, @nanddalal, @nathanbreitsch, @olineumann, @pitercl, @rohitgr7, @S-aiueo32, @SkafteNicki, @tgaddair, @tullie, @tw991, @williamFalcon, @ybrovman, @yukw777

If we forgot someone due to not matching commit email with GitHub account, let us know :]

Critical DDP bug fixes

27 Apr 13:06
694f1d7
Compare
Choose a tag to compare

We made a few changes to Callbacks to test ops on detached GPU tensors to avoid CPU transfer. However, it made callbacks unpicklable which will crash DDP.

This release fixes that core issue

Changed

  • Allow logging of metrics together with hparams (#1630)
  • Allow metrics logged together with hparams (#1630)

Removed

  • Removed Warning from trainer loop (#1634)

Fixed

  • Fixed ModelCheckpoint not being fixable (#1632)
  • Fixed CPU DDP breaking change and DDP change (#1635)
  • Tested pickling (#1636)

Contributors

@justusschock, @quinor, @williamFalcon

PyTorch 1.5 support, native PyTorch AMP, speed/memory optimizations and many bug fixes

26 Apr 15:08
d290b81
Compare
Choose a tag to compare

Key updates

  • PyTorch 1.5 support
  • Added Horovod distributed_backend option
  • Enable forward compatibility with the native AMP (PyTorch 1.6).
  • Support 8-core TPU on Kaggle
  • Added ability to customize progress_bar via Callbacks
  • Speed/memory optimizations.
  • Improved Argparse usability with Trainer
  • Docs improvements
  • Tons of bug fixes

Detail changes

Added

  • Added flag replace_sampler_ddp to manually disaple sampler replacement in ddp (#1513)
  • Added speed parity tests (max 1 sec difference per epoch)(#1482)
  • Added auto_select_gpus flag to trainer that enables automatic selection of available GPUs on exclusive mode systems.
  • Added learining rate finder (#1347)
  • Added support for ddp mode in clusters without SLURM (#1387)
  • Added test_dataloaders parameter to Trainer.test() (#1434)
  • Added terminate_on_nan flag to trainer that performs a NaN check with each training iteration when set to True (#1475)
  • Added speed parity tests (max 1 sec difference per epoch)(#1482)
  • Added terminate_on_nan flag to trainer that performs a NaN check with each training iteration when set to True. (#1475)
  • Added ddp_cpu backend for testing ddp without GPUs (#1158)
  • Added Horovod support as a distributed backend Trainer(distributed_backend='horovod') (#1529)
  • Added support for 8 core distributed training on Kaggle TPU's (#1568)
  • Added support for native AMP (#1561, [#1580)

Changed

  • Changed the default behaviour to no longer include a NaN check with each training iteration. (#1475)
  • Decoupled the progress bar from trainer. It is a callback now and can be customized or even be replaced entirely (#1450).
  • Changed lr schedule step interval behavior to update every backwards pass instead of every forwards pass (#1477)
  • Defines shared proc. rank, remove rank from instances (e.g. loggers) (#1408)
  • Updated semantic segmentation example with custom u-net and logging (#1371)
  • Disabled val and test shuffling (#1600)

Deprecated

  • Deprecated training_tqdm_dict in favor of progress_bar_dict (#1450).

Removed

  • Removed test_dataloaders parameter from Trainer.fit() (#1434)

Fixed

  • Added the possibility to pass nested metrics dictionaries to loggers (#1582)
  • Fixed memory leak from opt return (#1528)
  • Fixed saving checkpoint before deleting old ones (#1453)
  • Fixed loggers - flushing last logged metrics even before continue, e.g. trainer.test() results (#1459)
  • Fixed optimizer configuration when configure_optimizers returns dict without lr_scheduler (#1443)
  • Fixed LightningModule - mixing hparams and arguments in LightningModule.__init__() crashes load_from_checkpoint() (#1505)
  • Added a missing call to the on_before_zero_grad model hook (#1493).
  • Allow use of sweeps with WandbLogger (#1512)
  • Fixed a bug that caused the callbacks Trainer argument to reference a global variable (#1534).
  • Fixed a bug that set all boolean CLI arguments from Trainer.add_argparse_args always to True (#1571)
  • Fixed do not copy the batch when training on a single GPU (#1576, [#1579)
  • Fixed soft checkpoint removing on DDP (#1408)
  • Fixed automatic parser bug (#1585)
  • Fixed bool conversion from string (#1606)

Contributors

@alexeykarnachev, @areshytko, @awaelchli, @Borda, @borisdayma, @ethanwharris, @fschlatt, @HenryJia, @Ir1d, @justusschock, @karlinjf, @lezwon, @neggert, @rmrao, @rohitgr7, @SkafteNicki, @tgaddair, @williamFalcon

If we forgot someone due to not matching commit email with GitHub account, let us know :]

DDP bug fixes

10 Apr 12:44
afc43db
Compare
Choose a tag to compare

We had a few (subtle) bugs that affected DDP and a few key things in 0.7.2 so we released 0.7.3 to fix them because they are critical for DDP. sorry about that! still, no API changes, but please do skip straight to 0.7.3 upgrade for those fixes

Detail changes

Added

  • Added rank_zero_warn for warning only in rank 0 (#1428)

Fixed

  • Fixed default DistributedSampler for DDP training (#1425)
  • Fixed workers warning not on windows (#1430)
  • Fixed returning tuple from run_training_batch (#1431)
  • Fixed gradient clipping (#1438)
  • Fixed pretty print (#1441)

Contributors

@alsrgv, @Borda, @williamFalcon

Many bug fixes, added flexibility, parity tests with pytorch and more

08 Apr 18:46
b5c6d0e
Compare
Choose a tag to compare

Overview

This release aims at fixing particular issues and improving the user development experience via extending docs, adding typing and supporting python 3.8. In particular, some of the release highlights are:

  • Added benchmark for comparing lightning with vanilla implementations
  • Extended optimizer support with particular frequency
  • Several improvements for loggers such as represent no-primitive types, supporting hierarchical dictionaries for hyper param searchers
  • Added model configuration checking before it runs
  • Simplify the PL examples structure (shallower and more readable)
  • Improved Trainer CLI arguments handling (generalization)
  • Two Trainer argument become deprecated: print_nan_grads and show_progress_bar

Detail changes

Added

  • Added same step loggers' metrics aggregation (#1278)
  • Added parity test between a vanilla MNIST model and lightning model (#1284)
  • Added parity test between a vanilla RNN model and lightning model (#1351)
  • Added Reinforcement Learning - Deep Q-network (DQN) lightning example (#1232)
  • Added support for hierarchical dict (#1152)
  • Added TrainsLogger class (#1122)
  • Added type hints to pytorch_lightning.core (#946)
  • Added support for IterableDataset in validation and testing (#1104)
  • Added support for non-primitive types in hparams for TensorboardLogger (#1130)
  • Added a check that stops the training when loss or weights contain NaN or inf values. (#1097)
  • Added support for IterableDataset when val_check_interval=1.0 (default), this will trigger validation at the end of each epoch. (#1283)
  • Added summary method to Profilers. (#1259)
  • Added informative errors if user defined dataloader has zero length (#1280)
  • Added testing for python 3.8 (#915)
  • Added a training_epoch_end method which is the mirror of validation_epoch_end. (#1357)
  • Added model configuration checking (#1199)
  • Added support for optimizer frequencies through LightningModule.configure_optimizers() (#1269)
  • Added option to run without an optimizer by returning None from configure_optimizers. (#1279)
  • Added a warning when the number of data loader workers is small. (#1378)

Changed

  • Changed (renamed and refactored) TensorRunningMean -> TensorRunningAccum: running accumulations were generalized. (#1278)
  • Changed progress_bar_refresh_rate trainer flag to disable progress bar when setting to 0. (#1108)
  • Enhanced load_from_checkpoint to also forward params to the model (#1307)
  • Updated references to self.forward() to instead use the __call__ interface. (#1211)
  • Changed default behaviour of configure_optimizers to use no optimizer rather than Adam. (#1279)
  • Allow uploading models on W&B (#1339)
  • On DP and DDP2 unsqueeze is automated now (#1319)
  • Did not always create a DataLoader during reinstantiation, but the same type as before (if a subclass of DataLoader) (#1346)
  • Did not interfere with a default sampler (#1318)
  • Removed default Adam optimizer (#1317)
  • Gave warnings for unimplemented required lightning methods (#1317)
  • Made evaluate method private >> Trainer._evaluate(...). (#1260)
  • Simplify the PL examples structure (shallower and more readable) (#1247)
  • Changed min-max GPU memory to be on their own plots (#1358)
  • Remove .item which causes sync issues (#1254)
  • Changed smoothing in TQDM to decrease variability of time remaining between training/eval (#1194)
  • Change default logger to a dedicated one (#1064)

Deprecated

  • Deprecated Trainer argument print_nan_grads (#1097)
  • Deprecated Trainer argument show_progress_bar (#1108)

Removed

  • Removed duplicated module pytorch_lightning.utilities.arg_parse for loading CLI arguments (#1167)
  • Removed wandb logger's finalize method (#1193)
  • Dropped torchvision dependency in tests and added own MNIST dataset class instead (#986)

Fixed

  • Fixed model_checkpoint when saving all models (#1359)
  • Trainer.add_argparse_args classmethod fixed. Now it adds a type for the arguments (#1147)
  • Fixed bug related to type cheking of ReduceLROnPlateau lr schedulers(#1114)
  • Fixed a bug to ensure lightning checkpoints to be backward compatible (#1132)
  • Fixed a bug that created an extra dataloader with active reload_dataloaders_every_epoch (#1181)
  • Fixed all warnings and errors in the docs build process (#1191)
  • Fixed an issue where val_percent_check=0 would not disable validation (#1251)
  • Fixed average of incomplete TensorRunningMean (#1309)
  • Fixed WandbLogger.watch with wandb.init() (#1311)
  • Fixed an issue with early stopping that would prevent it from monitoring training metrics when validation is disabled / not implemented (#1235)
  • Fixed a bug that would cause trainer.test() to run on the validation set when overloading validation_epoch_end and test_end (#1353)
  • Fixed WandbLogger.watch - use of the watch method without importing wandb (#1311)
  • Fixed WandbLogger to be used with 'ddp' - allow reinits in sub-processes (#1149, #1360)
  • Made training_epoch_end behave like validation_epoch_end (#1357)
  • Fixed fast_dev_run running validation twice (#1365)
  • Fixed pickle error from quick patch __code__ (#1352)
  • Fixed memory leak on GPU0 (#1094, #1349)
  • Fixed checkpointing interval (#1272)
  • Fixed validation and training loops run the partial dataset (#1192)
  • Fixed running on_validation_end only on main process in DDP (#1125)
  • Fixed load_spawn_weights only in proc rank 0 (#1385)
  • Fixes use_amp issue (#1145)
  • Fixes using deprecated use_amp attribute (#1145)
  • Fixed Tensorboard logger error: lightning_logs directory not exists in multi-node DDP on nodes with rank != 0 (#1375)
  • Fixed Unimplemented backend XLA error on TPU (#1387)

Contributors

@alexeykarnachev, @amoudgl, @areshytko, @asafmanor, @awaelchli, @bkkaggle, @bmartinn, @Borda, @borisdayma, @cmpute, @djbyrne, @ethanwharris, @gerardrbentley, @jbschiratti, @jeremyjordan, @justusschock, @monney, @mpariente, @pertschuk, @rmrao, @S-aiueo32, @shubhamagarwal92, @SkafteNicki, @sneiman, @tullie, @vanpelt, @williamFalcon, @xingzhaolee

If we forgot someone due to not matching commit email with GitHub account, let us know :]

Minor deprecation fix

13 Mar 14:14
Compare
Choose a tag to compare

Monir bug fix with print issues and data_loader (#1080)

TPU support & profiling

10 Mar 23:13
Compare
Choose a tag to compare
Pre-release

Overview

This is the first joint release between pytorch-bearer and Lightning, here we come ...

This release adds support for training models on Tensor Processing Units (TPU). We can now train models on GPUs and TPUs by changing a single parameter in Trainer (see docs). We are also bringing the flexibility of Bearer into Lightning by allowing for arbitrary user-defined callbacks, see docs.

We are also including a profiler that allows Lightning users to identify training bottlenecks (see docs).

This release also includes automatic sampler setup depending on the selected backend, Lightning configures the sampler correctly (no need for user input).

The loggers have also been extended to support for multiple concurrent loggers to be passed to Trainer as an iterable, docs and added support for step-based learning rate scheduling.

At last, lots of bug fixes (see below).

Detail changes

Added

  • Added automatic sampler setup. Depending on DDP or TPU, lightning configures the sampler correctly (user needs to do nothing) (#926)
  • Added reload_dataloaders_every_epoch=False flag for trainer. Some users require reloading data every epoch (#926)
  • Added progress_bar_refresh_rate=50 flag for trainer. The refresh rate on notebooks (#926)
  • Updated governance docs
  • Added a check to ensure that the metric used for early stopping exists before training commences (#542)
  • Added optimizer_idx argument to backward hook (#733)
  • Added entity argument to WandbLogger to be passed to wandb.init (#783)
  • Added a tool for profiling training runs (#782)
  • Improved flexibility for naming of TensorBoard logs, can now set version to a str to just save to that directory, and use name='' to prevent experiment-name directory (#804)
  • Added option to specify step key when logging metrics (#808)
  • Added train_dataloader, val_dataloader and test_dataloader arguments to Trainer.fit(), for alternative data parsing (#759)
  • Added Tensor Processing Unit (TPU) support (#868)
  • Added semantic segmentation example (#751, #876, #881)
  • Split callbacks in multiple files (#849)
  • Support for user-defined callbacks (#889 and #950)
  • Added support for multiple loggers to be passed to Trainer as an iterable (e.g. list, tuple, etc.) (#903)
  • Added support for step-based learning rate scheduling (#941)
  • Added support for logging hparams as dict (#1029)
  • Checkpoint and early stopping now work without val. step (#1041)
  • Support graceful training cleanup after Keyboard Interrupt (#856, #1019)
  • Added type hints for function arguments (#912)
  • Added default argparser for Trainer (#952, #1023)
  • Added TPU gradient clipping (#963)
  • Added max/min number of steps in Trainer (#728)

Changed

  • Changed default TQDM to use tqdm.auto for prettier outputs in IPython notebooks (#752)
  • Changed pytorch_lightning.logging to pytorch_lightning.loggers (#767)
  • Moved the default tqdm_dict definition from Trainer to LightningModule, so it can be overridden by the user (#749)
  • Moved functionality of LightningModule.load_from_metrics into LightningModule.load_from_checkpoint (#995)
  • Changed Checkpoint path parameter from filepath to dirpath (#1016)
  • Freezed models hparams as Namespace property (#1029)
  • Dropped logging config in package init (#1015)
  • Renames model steps (#1051)
    • training_end >> training_epoch_end
    • validation_end >> validation_epoch_end
    • test_end >> test_epoch_end
  • Refactor dataloading, supports infinite dataloader (#955)
  • Create single file in TensorBoardLogger (#777)

Deprecated

  • Deprecated pytorch_lightning.logging (#767)
  • Deprecated LightningModule.load_from_metrics in favour of LightningModule.load_from_checkpoint (#995, #1079)
  • Deprecated @data_loader decorator (#926)
  • Deprecated model steps training_end, validation_end and test_end (#1051, #1056)

Removed

  • Removed dependency on pandas (#736)
  • Removed dependency on torchvision (#797)
  • Removed dependency on scikit-learn (#801)

Fixed

  • Fixed a bug where early stopping on_end_epoch would be called inconsistently when check_val_every_n_epoch == 0 (#743)
  • Fixed a bug where the model checkpoint didn't write to the same directory as the logger (#771)
  • Fixed a bug where the TensorBoardLogger class would create an additional empty log file during fitting (#777)
  • Fixed a bug where global_step was advanced incorrectly when using accumulate_grad_batches > 1 (#832)
  • Fixed a bug when calling self.logger.experiment with multiple loggers (#1009)
  • Fixed a bug when calling logger.append_tags on a NeptuneLogger with a single tag (#1009)
  • Fixed sending back data from .spawn by saving and loading the trained model in/out of the process (#1017)
  • Fixed port collision on DDP (#1010)
  • Fixed/tested pass overrides (#918)
  • Fixed comet logger to log after train (#892)
  • Remove deprecated args to learning rate step function (#890)

Contributors

@airglow, @akshaykvnit, @AljoSt, @AntixK, @awaelchli, @baeseongsu, @bobkemp, @Borda, @calclavia, @Calysto, @djbyrne, @ethanwharris, @fdelrio89, @hadim, @hanbyul-kim, @jeremyjordan, @kuynzereb, @luiscape, @MattPainter01, @neggert, @onkyo14taro, @peteriz, @shoarora, @SkafteNicki, @smallzzy, @srush, @theevann, @tullie, @williamFalcon, @xeTaiz, @xssChauhan, @yukw777

If we forgot someone due to not matching commit email with GitHub account, let us know :]

Simplifications & new docs

21 Jan 22:51
Compare
Choose a tag to compare

This release focused on a ton of bug fixes, small optimizations to training but most importantly, clean new docs!

Major changes

We have released New documentation, please bear with us as we fix broken links and patch in missing pieces.
This project moved to new org PyTorchLightning, so no longer the root sits on WilliamFalcon/PyTorchLightning.
We have added own custom Tensorboard logger as default logger.
We have upgrade Continues Integration to speed up the automatic testing.
We have fixed GAN training - supporting multiple optimizers.

Complete changelog

Added

  • Added support for resuming from a specific checkpoint via resume_from_checkpoint argument (#516)
  • Added support for ReduceLROnPlateau scheduler (#320)
  • Added support for Apex mode O2 in conjunction with Data Parallel (#493)
  • Added option (save_top_k) to save the top k models in the ModelCheckpoint class (#128)
  • Added on_train_start and on_train_end hooks to ModelHooks (#598)
  • Added TensorBoardLogger (#607)
  • Added support for weight summary of model with multiple inputs (#543)
  • Added map_location argument to load_from_metrics and load_from_checkpoint (#625)
  • Added option to disable validation by setting val_percent_check=0 (#649)
  • Added NeptuneLogger class (#648)
  • Added WandbLogger class (#627)

Changed

  • Changed the default progress bar to print to stdout instead of stderr (#531)
  • Renamed step_idx to step, epoch_idx to epoch, max_num_epochs to max_epochs and min_num_epochs to min_epochs (#589)
  • Renamed several Trainer atributes: (#567)
    • total_batch_nb to total_batches,
    • nb_val_batches to num_val_batches,
    • nb_training_batches to num_training_batches,
    • max_nb_epochs to max_epochs,
    • min_nb_epochs to min_epochs,
    • nb_test_batches to num_test_batches,
    • and nb_val_batches to num_val_batches (#567)
  • Changed gradient logging to use parameter names instead of indexes (#660)
  • Changed the default logger to TensorBoardLogger (#609)
  • Changed the directory for tensorboard logging to be the same as model checkpointing (#706)

Deprecated

  • Deprecated max_nb_epochs and min_nb_epochs (#567)
  • Deprecated the on_sanity_check_start hook in ModelHooks (#598)

Removed

  • Removed the save_best_only argument from ModelCheckpoint, use save_top_k=1 instead (#128)

Fixed

  • Fixed a bug which ocurred when using Adagrad with cuda (#554)
  • Fixed a bug where training would be on the GPU despite setting gpus=0 or gpus=[] (#561)
  • Fixed an error with print_nan_gradients when some parameters do not require gradient (#579)
  • Fixed a bug where the progress bar would show an incorrect number of total steps during the validation sanity check when using multiple validation data loaders (#597)
  • Fixed support for PyTorch 1.1.0 (#552)
  • Fixed an issue with early stopping when using a val_check_interval < 1.0 in Trainer (#492)
  • Fixed bugs relating to the CometLogger object that would cause it to not work properly (#481)
  • Fixed a bug that would occur when returning -1 from on_batch_start following an early exit or when the batch was None (#509)
  • Fixed a potential race condition with several processes trying to create checkpoint directories (#530)
  • Fixed a bug where batch 'segments' would remain on the GPU when using truncated_bptt > 1 (#532)
  • Fixed a bug when using IterableDataset (#547](#547))
  • Fixed a bug where .item was called on non-tensor objects (#602)
  • Fixed a bug where Trainer.train would crash on an uninitialized variable if the trainer was run after resuming from a checkpoint that was already at max_epochs (#608)
  • Fixed a bug where early stopping would begin two epochs early (#617)
  • Fixed a bug where num_training_batches and num_test_batches would sometimes be rounded down to zero (#649)
  • Fixed a bug where an additional batch would be processed when manually setting num_training_batches (#653)
  • Fixed a bug when batches did not have a .copy method (#701)
  • Fixed a bug when using log_gpu_memory=True in Python 3.6 (#715)
  • Fixed a bug where checkpoint writing could exit before completion, giving incomplete checkpoints (#689)
  • Fixed a bug where on_train_end was not called when early stopping (#723)

Contributors

@akhti, @alumae, @awaelchli, @Borda, @borisdayma, @ctlaltdefeat, @dreamgonfly, @elliotwaite, @fdiehl, @goodok, @haossr, @HarshSharma12, @Ir1d, @jakubczakon, @jeffling, @kuynzereb, @MartinPernus, @matthew-z, @MikeScarp, @mpariente, @neggert, @rwesterman, @ryanwongsa, @schwobr, @tullie, @vikmary, @VSJMilewski, @williamFalcon, @YehCF

If we forgot someone due to not matching commit email with GitHub account, let us know :]