Releases: Lightning-AI/pytorch-lightning
Bug fixes and .test() fix + TPU tests
Overview
The point of this release is more bug fixes ahead of v 1.0.0. We now have CI tests on TPU thanks to @zcain117 from Google! 🙂
This means we fixed many TPU bugs we hadn’t caught before because we had no tests.
In addition, we fixed:
- all the file path errors with loggers (txs @awaelchli)
- pickling errors with loggers (txs @awaelchli)
- fixed all the .test() calls
Detail changes
Added
Removed
- Removed auto val reduce (#2462)
Fixed
- Flattening Wandb Hyperparameters (#2459)
- Fixed using the same DDP python interpreter and actually running (#2482)
- Fixed model summary input type conversion for models that have input dtype different from model parameters (#2510)
- Made
TensorBoardLoggerandCometLoggerpickleable (#2518) - Fixed a problem with
MLflowLoggercreating multiple run folders (#2502) - Fixed global_step increment (#2455)
- Fixed TPU hanging example (#2488)
- Fixed
argparsedefault value bug (#2526) - Fixed Dice and IoU to avoid NaN by adding small eps (#2545)
- Fixed accumulate gradients schedule at epoch 0 (continued) (#2513)
- Fixed Trainer
.fit()returning last not best weights in "ddp_spawn" (#2565) - Fixed passing (do not pass) TPU weights back on test (#2566)
- Fixed DDP tests and
.test()(#2512, #2570)
Contributors
@anthonytec2, @awaelchli, @bernardomig, @Borda, @EspenHa, @HHousen, @InCogNiTo124, @rohitgr7, @williamFalcon
If we forgot someone due to not matching commit email with GitHub account, let us know :]
More bug fixing!
Detail changes
Added
- Added reduce ddp results on eval (#2434)
- Added a warning when an
IterableDatasethas__len__defined (#2437)
Changed
- Enabled no returns from eval (#2446)
Fixed
- Fixes train outputs (#2428)
- Fixes Conda dependencies (#2412)
- Fixed Apex scaling with decoupled backward (#2433)
- Fixed crashing or wrong displaying progressbar because of missing ipywidgets (#2417)
- Fixed TPU saving dir (fc26078, 04e68f0)
- Fixed logging on rank 0 only (#2425)
Contributors
Bug fixing
DDP and Checkpoint bug fixes
Overview
As we continue to strengthen the codebase with more tests, we’re finally getting rid of annoying bugs that have been around for a bit now. Mostly around the inconsistent checkpoint and early stopping behaviour (amazing work @awaelchli @jeremyjordan )
Noteworthy changes:
- Fixed TPU flag parsing
- fixed average_precision metric
- all the checkpoint issues should be gone now (including backward support for old checkpoints)
- DDP + loggers should be fixed
Detail changes
Added
- Added TorchText support for moving data to GPU (#2379)
Changed
- Changed epoch indexing from 0 instead of 1 (#2289)
- Refactor Model
backward(#2276) - Refactored
training_batch+ tests to verify correctness (#2327, #2328) - Refactored training loop (#2336)
- Made optimization steps for hooks (#2363)
- Changed default apex level to 'O2' (#2362)
Removed
- Moved
TrainsLoggerto Bolts (#2384)
Fixed
- Fixed parsing TPU arguments and TPU tests (#2094)
- Fixed number batches in case of multiple dataloaders and
limit_{*}_batches(#1920, #2226) - Fixed an issue with forward hooks not being removed after model summary (#2298)
- Fix for
load_from_checkpoint()not working with absolute path on Windows (#2294) - Fixed an issue how _has_len handles
NotImplementedErrore.g. raised bytorchtext.data.Iterator(#2293), (#2307) - Fixed
average_precisionmetric (#2319) - Fixed ROC metric for CUDA tensors (#2304)
- Fixed
average_precisionmetric (#2319) - Fixed lost compatibility with custom datatypes implementing
.to(#2335) - Fixed loading model with kwargs (#2387)
- Fixed sum(0) for
trainer.num_val_batches(#2268) - Fixed checking if the parameters are a
DictConfigObject (#2216) - Fixed SLURM weights saving (#2341)
- Fixed swaps LR scheduler order (#2356)
- Fixed adding tensorboard
hparamslogging test (#2342) - Fixed use model ref for tear down (#2360)
- Fixed logger crash on DDP (#2388)
- Fixed several issues with early stopping and checkpoint callbacks (#1504, #2391)
- Fixed loading past checkpoints from v0.7.x (#2405)
- Fixed loading model without arguments (#2403)
Contributors
@airium, @awaelchli, @Borda, @elias-ramzi, @jeremyjordan, @lezwon, @mateuszpieniak, @mmiakashs, @pwl, @rohitgr7, @ssakhavi, @thschaaf, @tridao, @williamFalcon
If we forgot someone due to not matching commit email with GitHub account, let us know :]
Fixing hooks & hparams
Overview
Fixing critical bugs in newly added hooks and hparams assignment.
The recommended data following:
- use
prepare_datato download and process the dataset. - use
setupto do splits, and build your model internals
Detail changes
Metrics, speed improvements, new hooks and flags
Overview
Highlights of this release are adding Metric package and new hooks and flags to customize your workflow.
Major features:
- brand new Metrics package with built-in DDP support (by @justusschock and @SkafteNicki)
hparamscan now be anything! (callself.save_hyperparameters()to register anything in the_init_- many speed improvements (how we move data, adjusted some flags & PL now adds 300ms overhead per epoch only!)
- much faster
ddpimplementation. Old one was renamedddp_spawn - better support for Hydra
- added the overfit_batches flag and corrected some bugs with the
limit_[train,val,test]_batchesflag - added conda support
- tons of bug fixes 😉
Detail changes
Added
- Added
overfit_batches,limit_{val|test}_batchesflags (overfit now uses training set for all three) (#2213) - Added metrics
- Added type hints in
Trainer.fit()andTrainer.test()to reflect that also a list of dataloaders can be passed in (#1723) - Allow dataloaders without sampler field present (#1907)
- Added option
save_lastto save the model at the end of every epoch inModelCheckpoint(#1908) - Early stopping checks
on_validation_end(#1458) - Attribute
best_model_pathtoModelCheckpointfor storing and later retrieving the path to the best saved model file (#1799) - Speed up single-core TPU training by loading data using
ParallelLoader(#2033) - Added a model hook
transfer_batch_to_devicethat enables moving custom data structures to the target device (#1756) - Added black formatter for the code with code-checker on pull (#1610)
- Added back the slow spawn ddp implementation as
ddp_spawn(#2115) - Added loading checkpoints from URLs (#1667)
- Added a callback method
on_keyboard_interruptfor handling KeyboardInterrupt events during training (#2134) - Added a decorator
auto_move_datathat moves data to the correct device when using the LightningModule for inference (#1905) - Added
ckpt_pathoption toLightningModule.test(...)to load particular checkpoint (#2190) - Added
setupandteardownhooks for model (#2229)
Changed
- Allow user to select individual TPU core to train on (#1729)
- Removed non-finite values from loss in
LRFinder(#1862) - Allow passing model hyperparameters as complete kwarg list (#1896)
- Renamed
ModelCheckpoint's attributesbesttobest_model_scoreandkth_best_modeltokth_best_model_path(#1799) - Re-Enable Logger's
ImportErrors (#1938) - Changed the default value of the Trainer argument
weights_summaryfromfulltotop(#2029) - Raise an error when lightning replaces an existing sampler (#2020)
- Enabled prepare_data from correct processes - clarify local vs global rank (#2166)
- Remove explicit flush from tensorboard logger (#2126)
- Changed epoch indexing from 1 instead of 0 (#2206)
Deprecated
- Deprecated flags: (#2213)
overfit_pctin favour ofoverfit_batchesval_percent_checkin favour oflimit_val_batchestest_percent_checkin favour oflimit_test_batches
- Deprecated
ModelCheckpoint's attributesbestandkth_best_model(#1799) - Dropped official support/testing for older PyTorch versions <1.3 (#1917)
Removed
- Removed unintended Trainer argument
progress_bar_callback, the callback should be passed in byTrainer(callbacks=[...])instead (#1855) - Removed obsolete
self._devicein Trainer (#1849) - Removed deprecated API (#2073)
- Packages:
pytorch_lightning.pt_overrides,pytorch_lightning.root_module - Modules:
pytorch_lightning.logging.comet_logger,pytorch_lightning.logging.mlflow_logger,pytorch_lightning.logging.test_tube_logger,pytorch_lightning.overrides.override_data_parallel,pytorch_lightning.core.model_saving,pytorch_lightning.core.root_module - Trainer arguments:
add_row_log_interval,default_save_path,gradient_clip,nb_gpu_nodes,max_nb_epochs,min_nb_epochs,nb_sanity_val_steps - Trainer attributes:
nb_gpu_nodes,num_gpu_nodes,gradient_clip,max_nb_epochs,min_nb_epochs,nb_sanity_val_steps,default_save_path,tng_tqdm_dic
- Packages:
Fixed
- Run graceful training teardown on interpreter exit (#1631)
- Fixed user warning when apex was used together with learning rate schedulers (#1873)
- Fixed multiple calls of
EarlyStoppingcallback (#1863) - Fixed an issue with
Trainer.from_argparse_argswhen passing in unknown Trainer args (#1932) - Fixed bug related to logger not being reset correctly for model after tuner algorithms (#1933)
- Fixed root node resolution for SLURM cluster with dash in hostname (#1954)
- Fixed
LearningRateLoggerin multi-scheduler setting (#1944) - Fixed test configuration check and testing (#1804)
- Fixed an issue with Trainer constructor silently ignoring unknown/misspelt arguments (#1820)
- Fixed
save_weights_onlyin ModelCheckpoint (#1780) - Allow use of same
WandbLoggerinstance for multiple training loops (#2055) - Fixed an issue with
_auto_collect_argumentscollecting local variables that are not constructor arguments and not working for signatures that have the instance not namedself(#2048) - Fixed mistake in parameters' grad norm tracking (#2012)
- Fixed CPU and hanging GPU crash (#2118)
- Fixed an issue with the model summary and
example_input_arraydepending on a specific ordering of the submodules in a LightningModule (#1773) - Fixed Tpu logging (#2230)
- Fixed Pid port + duplicate
rank_zerologging (#2140, #2231)
Contributors
@awaelchli, @baldassarreFe, @Borda, @borisdayma, @cuent, @devashishshankar, @ivannz, @j-dsouza, @justusschock, @kepler, @kumuji, @lezwon, @lgvaz, @LoicGrobol, @mateuszpieniak, @maximsch2, @moi90, @rohitgr7, @SkafteNicki, @tullie, @williamFalcon, @yukw777, @ZhaofengWu
If we forgot someone due to not matching commit email with GitHub account, let us know :]
Transfer learning, tuning batch size, torchelastic support
Overview
Highlights of this release are adding support for TorchElastic enables distributed PyTorch training jobs to be executed in a fault-tolerant and elastic manner; auto-scaling of batch size; new transfer learning example; an option to provide seed to random generators to ensure reproducibility.
Detail changes
Added
- Added callback for logging learning rates (#1498)
- Added transfer learning example (for a binary classification task in computer vision) (#1564)
- Added type hints in
Trainer.fit()andTrainer.test()to reflect that also a list of dataloaders can be passed in (#1723). - Added auto scaling of batch size (#1638)
- The progress bar metrics now also get updated in
training_epoch_end(#1724) - Enable
NeptuneLoggerto work withdistributed_backend=ddp(#1753) - Added option to provide seed to random generators to ensure reproducibility (#1572)
- Added override for hparams in
load_from_ckpt(#1797) - Added support multi-node distributed execution under
torchelastic(#1811, #1818) - Added using
store_truefor bool args (#1822, #1842) - Added dummy logger for internally disabling logging for some features (#1836)
Changed
- Enable
non-blockingfor device transfers to GPU (#1843) - Replace mata_tags.csv with hparams.yaml (#1271)
- Reduction when
batch_size < num_gpus(#1609) - Updated LightningTemplateModel to look more like Colab example (#1577)
- Don't convert
namedtupletotuplewhen transferring the batch to target device (#1589) - Allow passing
hparamsas a keyword argument to LightningModule when loading from checkpoint (#1639) - Args should come after the last positional argument (#1807)
- Made DDP the default if no backend specified with multiple GPUs (#1789)
Deprecated
- Deprecated
tags_csvin favor ofhparams_file(#1271)
Fixed
- Fixed broken link in PR template (#1675)
- Fixed ModelCheckpoint not None checking file path (#1654)
- Trainer now calls
on_load_checkpoint()when resuming from a checkpoint (#1666) - Fixed sampler logic for DDP with the iterable dataset (#1734)
- Fixed
_reset_eval_dataloader()for IterableDataset (#1560) - Fixed Horovod distributed backend to set the
root_gpuproperty (#1669) - Fixed wandb logger
global_stepaffects other loggers (#1492) - Fixed disabling progress bar on non-zero ranks using Horovod backend (#1709)
- Fixed bugs that prevent LP finder to be used together with early stopping and validation dataloaders (#1676)
- Fixed a bug in Trainer that prepended the checkpoint path with
version_when it shouldn't (#1748) - Fixed LR key name in case of param groups in LearningRateLogger (#1719)
- Fixed saving native AMP scaler state (introduced in #1561)
- Fixed accumulation parameter and suggestion method for learning rate finder (#1801)
- Fixed num processes wasn't being set properly and auto sampler was DDP failing (#1819)
- Fixed bugs in semantic segmentation example (#1824)
- Fixed saving native AMP scaler state (#1561, #1777)
- Fixed native AMP + DDP (#1788)
- Fixed
hparamlogging with metrics (#1647)
Contributors
@ashwinb, @awaelchli, @Borda, @cmpute, @festeh, @jbschiratti, @justusschock, @kepler, @kumuji, @nanddalal, @nathanbreitsch, @olineumann, @pitercl, @rohitgr7, @S-aiueo32, @SkafteNicki, @tgaddair, @tullie, @tw991, @williamFalcon, @ybrovman, @yukw777
If we forgot someone due to not matching commit email with GitHub account, let us know :]
Critical DDP bug fixes
We made a few changes to Callbacks to test ops on detached GPU tensors to avoid CPU transfer. However, it made callbacks unpicklable which will crash DDP.
This release fixes that core issue
Changed
- Allow logging of metrics together with hparams (#1630)
- Allow metrics logged together with hparams (#1630)
Removed
- Removed Warning from trainer loop (#1634)
Fixed
- Fixed ModelCheckpoint not being fixable (#1632)
- Fixed CPU DDP breaking change and DDP change (#1635)
- Tested pickling (#1636)
Contributors
PyTorch 1.5 support, native PyTorch AMP, speed/memory optimizations and many bug fixes
Key updates
- PyTorch 1.5 support
- Added Horovod distributed_backend option
- Enable forward compatibility with the native AMP (PyTorch 1.6).
- Support 8-core TPU on Kaggle
- Added ability to customize progress_bar via Callbacks
- Speed/memory optimizations.
- Improved Argparse usability with Trainer
- Docs improvements
- Tons of bug fixes
Detail changes
Added
- Added flag
replace_sampler_ddpto manually disaple sampler replacement in ddp (#1513) - Added speed parity tests (max 1 sec difference per epoch)(#1482)
- Added
auto_select_gpusflag to trainer that enables automatic selection of available GPUs on exclusive mode systems. - Added learining rate finder (#1347)
- Added support for ddp mode in clusters without SLURM (#1387)
- Added
test_dataloadersparameter toTrainer.test()(#1434) - Added
terminate_on_nanflag to trainer that performs a NaN check with each training iteration when set toTrue(#1475) - Added speed parity tests (max 1 sec difference per epoch)(#1482)
- Added
terminate_on_nanflag to trainer that performs a NaN check with each training iteration when set toTrue. (#1475) - Added
ddp_cpubackend for testing ddp without GPUs (#1158) - Added Horovod support as a distributed backend
Trainer(distributed_backend='horovod')(#1529) - Added support for 8 core distributed training on Kaggle TPU's (#1568)
- Added support for native AMP (#1561, [#1580)
Changed
- Changed the default behaviour to no longer include a NaN check with each training iteration. (#1475)
- Decoupled the progress bar from trainer. It is a callback now and can be customized or even be replaced entirely (#1450).
- Changed lr schedule step interval behavior to update every backwards pass instead of every forwards pass (#1477)
- Defines shared proc. rank, remove rank from instances (e.g. loggers) (#1408)
- Updated semantic segmentation example with custom u-net and logging (#1371)
- Disabled val and test shuffling (#1600)
Deprecated
- Deprecated
training_tqdm_dictin favor ofprogress_bar_dict(#1450).
Removed
- Removed
test_dataloadersparameter fromTrainer.fit()(#1434)
Fixed
- Added the possibility to pass nested metrics dictionaries to loggers (#1582)
- Fixed memory leak from opt return (#1528)
- Fixed saving checkpoint before deleting old ones (#1453)
- Fixed loggers - flushing last logged metrics even before continue, e.g.
trainer.test()results (#1459) - Fixed optimizer configuration when
configure_optimizersreturns dict withoutlr_scheduler(#1443) - Fixed
LightningModule- mixing hparams and arguments inLightningModule.__init__()crashes load_from_checkpoint() (#1505) - Added a missing call to the
on_before_zero_gradmodel hook (#1493). - Allow use of sweeps with WandbLogger (#1512)
- Fixed a bug that caused the
callbacksTrainer argument to reference a global variable (#1534). - Fixed a bug that set all boolean CLI arguments from Trainer.add_argparse_args always to True (#1571)
- Fixed do not copy the batch when training on a single GPU (#1576, [#1579)
- Fixed soft checkpoint removing on DDP (#1408)
- Fixed automatic parser bug (#1585)
- Fixed bool conversion from string (#1606)
Contributors
@alexeykarnachev, @areshytko, @awaelchli, @Borda, @borisdayma, @ethanwharris, @fschlatt, @HenryJia, @Ir1d, @justusschock, @karlinjf, @lezwon, @neggert, @rmrao, @rohitgr7, @SkafteNicki, @tgaddair, @williamFalcon
If we forgot someone due to not matching commit email with GitHub account, let us know :]
DDP bug fixes
We had a few (subtle) bugs that affected DDP and a few key things in 0.7.2 so we released 0.7.3 to fix them because they are critical for DDP. sorry about that! still, no API changes, but please do skip straight to 0.7.3 upgrade for those fixes
Detail changes
Added
- Added
rank_zero_warnfor warning only in rank 0 (#1428)
Fixed
- Fixed default
DistributedSamplerfor DDP training (#1425) - Fixed workers warning not on windows (#1430)
- Fixed returning tuple from
run_training_batch(#1431) - Fixed gradient clipping (#1438)
- Fixed pretty print (#1441)