Releases: Lightning-AI/pytorch-lightning
standard weekly patch release
Detail changes
Added
- Added PyTorch 1.7 Stable support (#3821)
- Added timeout for
tpu_device_existsto ensure process does not hang indefinitely (#4340)
Changed
- W&B log in sync with
Trainerstep (#4405) - Hook
on_after_backwardis called only whenoptimizer_stepis being called (#4439) - Moved
track_and_norm_gradintotraining loopand called only whenoptimizer_stepis being called (#4439) - Changed type checker with explicit cast of ref_model object (#4457)
Deprecated
- Deprecated passing
ModelCheckpointinstance tocheckpoint_callbackTrainer argument (#4336)
Fixed
- Disable saving checkpoints if not trained (#4372)
- Fixed error using
auto_select_gpus=Truewithgpus=-1(#4209) - Disabled training when
limit_train_batches=0(#4371) - Fixed that metrics do not store computational graph for all seen data (#4313)
- Fixed AMP unscale for
on_after_backward(#4439) - Fixed TorchScript export when module includes Metrics (#4428)
- Fixed CSV logger warning (#4419)
- Fixed skip DDP parameter sync (#4301)
Contributors
@ananthsub, @awaelchli, @borisdayma, @carmocca, @justusschock, @lezwon, @rohitgr7, @SeanNaren, @SkafteNicki, @ssaru, @tchaton, @ydcjeff
If we forgot someone due to not matching commit email with GitHub account, let us know :]
standard weekly patch release
Detail changes
Added
- Added
dirpathandfilenameparameter inModelCheckpoint(#4213) - Added plugins docs and DDPPlugin to customize ddp across all accelerators (#4258)
- Added
strictoption to the scheduler dictionary (#3586) - Added
fsspecsupport for profilers (#4162) - Added autogenerated helptext to
Trainer.add_argparse_args(#4344) - Added support for string values in
Trainer'sprofilerparameter (#3656)
Changed
- Improved error messages for invalid
configure_optimizersreturns (#3587) - Allow changing the logged step value in
validation_step(#4130) - Allow setting
replace_sampler_ddp=Truewith a distributed sampler already added (#4273) - Fixed santized parameters for
WandbLogger.log_hyperparams(#4320)
Deprecated
- Deprecated
filepathinModelCheckpoint(#4213) - Deprecated
reorderparameter of theaucmetric (#4237) - Deprecated bool values in
Trainer'sprofilerparameter (#3656)
Fixed
- Fixed setting device ids in DDP (#4297)
- Fixed synchronization of best model path in
ddp_accelerator(#4323) - Fixed
WandbLoggernot uploading checkpoint artifacts at the end of training (#4341)
Contributors
@ananthsub, @awaelchli, @carmocca, @ddrevicky, @louis-she, @mauvilsa, @rohitgr7, @SeanNaren, @tchaton
If we forgot someone due to not matching commit email with GitHub account, let us know :]
standard weekly patch release
Detail changes
Added
- Added persistent flag to
Metric.add_state(#4195)
Changed
Fixed
Contributors
If we forgot someone due to not matching commit email with GitHub account, let us know :]
fixes a major logging bug for val in 1.0
Fixes the last major bugs for validation logging.
Also removes duplicate charts for metric / metric_loss.
Doing this minor release because correct validation metrics logging is critical.
Details changes
Added
- Added trace functionality to the function
to_torchscript(#4142)
Changed
- Called
on_load_checkpointbefore loadingstate_dict(#4057)
Removed
- Removed duplicate metric vs step log for train loop (#4173)
Fixed
- Fixed the self.log problem in
validation_step()(#4169) - Fixed
hparamssaving - save the state whensave_hyperparameters()is called [in__init__] (#4163) - Fixed runtime failure while exporting
hparamsto yaml (#4158)
Contributors
@Borda, @NumesSanguis, @rohitgr7, @williamFalcon
If we forgot someone due to not matching commit email with GitHub account, let us know :]
minor jit fixes
Obligatory post 1.0 minor release. Main fix is to make Lightning module fully compatible with Jit (had some edge-cases we had not covered).
1.0.0 - General availability
Overview
...
Detail changes
- Added Explained Variance Metric + metric fix (#4013)
- Added Metric <-> Lightning Module integration tests (#4008)
- Added parsing OS env vars in
Trainer(#4022) - Added classification metrics (#4043)
- Updated explained variance metric (#4024)
- Enabled plugins (#4041)
- Enabled custom clusters (#4048)
- Enabled passing in custom accelerators (#4050)
- Added
LightningModule.toggle_optimizer(#4058) - Added
LightningModule.manual_backward(#4063)
Changed
- Integrated metrics API with self.log (#3961)
- Decoupled Appex (#4052, #4054, #4055, #4056, #4058, #4060, #4061, #4062, #4063, #4064, #4065)
- Renamed all backends to
Accelerator(#4066) - Enabled manual returns (#4089)
Removed
- Removed
outputargument from*_batch_endhooks (#3965, #3966) - Removed
outputargument from*_epoch_endhooks (#3967) - Removed support for EvalResult and TrainResult (#3968)
- Removed deprecated trainer flags:
overfit_pct,log_save_interval,row_log_interval(#3969) - Removed deprecated early_stop_callback (#3982)
- Removed deprecated model hooks (#3980)
- Removed deprecated callbacks (#3979)
- Removed
trainerargument inLightningModule.backward[#4056)
Fixed
- Fixed
current_epochproperty update to reflect true epoch number insideLightningDataModule, whenreload_dataloaders_every_epoch=True. (#3974) - Fixed to print scaler value in progress bar (#4053)
- Fixed mismatch between docstring and code regarding when
on_load_checkpointhook is called (#3996)
Contributors
@ananyahjha93, @Borda, @edenlightning, @hbredin, @rohitgr7, @SkafteNicki, @teddykoker, @williamFalcon
If we forgot someone due to not matching commit email with GitHub account, let us know :]
Buffer release before 1.0
This release is a buffer in case 1.0 breaks any compatibility for people who upgrade. 0.10.0 has all the bug fixes and features of 1.0 but is 100% backward compatible. The 1.0 release following in the next 24 hours.
Overview
The major changes are:
- Results objects are deprecated (we hated them too haha)
- This means dataflow and logging have been decoupled
To log:
def any_step(...):
self.log('something', i_computed)Separately, return whatever you want from methods:
def training_step(...):
return lossor
def training_step(...):
return {'loss': loss, 'whatever': [1, 'want']}Detail changes
Added
- Added new Metrics API. (#3868, [#3921)
- Enable PyTorch 1.7 compatibility (#3541)
- Added
LightningModule.to_torchscriptto support exporting asScriptModule(#3258) - Added warning when dropping unpicklable
hparams(#2874) - Added EMB similarity (#3349)
- Added
ModelCheckpoint.to_yamlmethod (#3048) - Allow
ModelCheckpointmonitor to beNone, meaning it will always save ([3630) - Disabled optimizers setup during testing (#3059)
- Added support for datamodules to save and load checkpoints when training (#3563
- Added support for datamodule in learning rate finder (#3425)
- Added gradient clip test for native AMP (#3754)
- Added dist lib to enable syncing anything across devices (#3762)
- Added
broadcasttoTPUBackend(#3814) - Added
XLADeviceUtilsclass to check XLA device type (#3274)
Changed
- Refactored accelerator backends:
- moved TPU
xxx_stepto backend (#3118) - refactored DDP backend
forward(#3119) - refactored GPU backend
__step(#3120) - refactored Horovod backend (#3121, #3122)
- remove obscure forward call in eval + CPU backend
___step(#3123) - reduced all simplified forward (#3126)
- added hook base method (#3127)
- refactor eval loop to use hooks - use
test_modefor if so we can split later (#3129) - moved
___step_endhooks (#3130) - training forward refactor (#3134)
- training AMP scaling refactor (#3135)
- eval step scaling factor (#3136)
- add eval loop object to streamline eval loop (#3138)
- refactored dataloader process hook (#3139)
- refactored inner eval loop (#3141)
- final inner eval loop hooks (#3154)
- clean up hooks in
run_evaluation(#3156) - clean up data reset (#3161)
- expand eval loop out (#3165)
- moved hooks around in eval loop (#3195)
- remove
_evaluatefx (#3197) Trainer.fithook clean up (#3198)- DDPs train hooks (#3203)
- refactor DDP backend (#3204, #3207, #3208, #3209, #3210)
- reduced accelerator selection (#3211)
- group prepare data hook (#3212)
- added data connector (#3285)
- modular is_overridden (#3290)
- adding
Trainer.tune()(#3293) - move
run_pretrain_routine->setup_training(#3294) - move train outside of setup training (#3297)
- move
prepare_datato data connector (#3307) - moved accelerator router (#3309)
- train loop refactor - moving train loop to own object (#3310, #3312, #3313, #3314)
- duplicate data interface definition up into DataHooks class (#3344)
- inner train loop (#3359, #3361, #3362, #3363, #3365, #3366, #3367, #3368, #3369, #3370, #3371, #3372, #3373, #3374, #3375, #3376, #3385, #3388, #3397)
- all logging related calls in a connector (#3395)
- device parser (#3400, #3405)
- added model connector (#3407)
- moved eval loop logging to loggers (#3408)
- moved eval loop (#3412[#3408)
- trainer/separate argparse (#3421, #3428, #3432)
- move
lr_finder(#3434) - organize args (##3435, #3442, #3447, #3448, #3449, #3456)
- move specific accelerator code (#3457)
- group connectors (#3472)
- accelerator connector methods x/n (#3469, #3470, #3474)
- merge backends (#3476, #3477, #3478, #3480, #3482)
- apex plugin (#3502)
- precision plugins (#3504)
- Result - make monitor default to
checkpoint_onto simplify (#3571) - reference to the Trainer on the
LightningDataModule(#3684) - add
.logto lightning module (#3686, #3699, #3701, #3704, #3715) - enable tracking original metric when step and epoch are both true (#3685)
- deprecated results obj, added support for simpler comms (#3681)
- move backends back to individual files (#3712)
- fixes logging for eval steps (#3763)
- decoupled DDP, DDP spawn (#3733, #3766, #3767, #3774, #3802, #3806)
- remove weight loading hack for ddp_cpu (#3808)
- separate
torchelasticfrom DDP (#3810) - separate SLURM from DDP (#3809)
- decoupled DDP2 (#3816)
- bug fix with logging val epoch end + monitor (#3812)
- decoupled DDP, DDP spawn (#3733, #3817, #3819, #3927)
- callback system and init DDP (#3836)
- adding compute environments (#3837, [#3842)
- epoch can now log independently (#3843)
- test selecting the correct backend. temp backends while slurm and TorchElastic are decoupled (#3848)
- fixed
init_slurm_connectioncausing hostname errors (#3856) - moves init apex from LM to apex connector (#3923)
- moves sync bn to each backend (#3925)
- moves configure ddp to each backend (#3924)
- moved TPU
- Deprecation warning (#3844)
- Changed
LearningRateLoggertoLearningRateMonitor(#3251) - Used
fsspecinstead ofgfilefor all IO (#3320)- Swaped
torch.loadforfsspecload in DDP spawn backend (#3787) - Swaped
torch.loadforfsspecload in cloud_io loading (#3692) - Added support for
to_disk()to use remote filepaths withfsspec(#3930) - Updated model_checkpoint's to_yaml to use
fsspecopen (#3801) - Fixed
fsspecis inconsistant when doingfs.ls(#3805)
- Swaped
- Refactor
GPUStatsMonitorto improve training speed (#3257) - Changed IoU score behavior for classes absent in target and pred (#3098)
- Changed IoU
remove_bgbool toignore_indexoptional int (#3098) - Changed defaults of
save_top_kandsave_lasttoNonein ModelCheckpoint (#3680) row_log_intervalandlog_save_intervalare now based on training loop'sglobal_stepinstead of epoch-internal batch index (#3667)- Silenced some warnings. verified ddp refactors (#3483)
- Cleaning up stale logger tests (#3490)
- Allow
ModelCheckpointmonitor to beNone(#3633) - Enable
Nonemodel checkpoint default (#3669) - Skipped
best_model_pathifcheckpoint_callbackisNone(#2962) - Used
raise .. from ..to explicitly chain exceptions (#3750) - Mocking loggers (#3596, #3617, #3851, #3859, #3884, #3853, #3910, #3889, #3926)
- Write predictions in LightningModule instead of EvalResult [#3882
Deprecated
- Deprecated
TrainResultandEvalResult, useself.logandself.writefrom theLightningModuleto log metrics and write predictions.training_stepcan now only return a scalar (for the loss) or a dictionary with anything you want. (#3681) - Deprecate
early_stop_callbackTrainer argument (#3845) - Rename Trainer arguments
row_log_interval>>log_every_n_stepsandlog_save_interval>>flush_logs_every_n_steps(#3748)
Removed
- Removed experimental Metric API (#3868, #3943, #3949, #3946), listed changes before final removal:
- Added
EmbeddingSimilaritymetric (#3349, [#3358) - Added hooks to metric module interface (#2528)
- Added error when AUROC metric is used for multiclass problems (#3350)
- Fixed
ModelCheckpointwithsave_top_k=-1option not tracking the best models when a monitor metric is available (#3735) - Fixed counter-intuitive error being thrown in
Accuracymetric for zero target tensor (#3764) - Fixed aggregation of metrics (#3517)
- Fixed Metric aggregation (#3321)
- Fixed RMSLE metric (#3188)
- Renamed
reductiontoclass_reductionin classification metrics (#3322) - Changed
class_reductionsimilar to sklearn for classification metrics (#3322) - Renaming of precision recall metric (#3308)
- Added
Fixed
- Fixed
on_train_batch_starthook to end epoch early (#3700) - Fixed
num_sanity_val_stepsis clipped tolimit_val_batches(#2917) - Fixed ONNX model save on GPU (#3145)
- Fixed
GpuUsageLoggerto work on different platforms (#3008) - Fixed auto-scale batch size not dumping
auto_lr_findparameter (#3151) - Fixed
batch_outputswith optimizer frequencies (#3229) - Fixed setting batch size in
LightningModule.datamodulewhen usingauto_scale_batch_size(#3266) - Fixed Horovod distributed backend compatibility with native AMP (#3404)
- Fixed batch size auto scaling exceeding the size of the dataset (#3271)
- Fixed getting
experiment_idfrom MLFlow only once instead of each training loop (#3394) - Fixed
overfit_batcheswhich now correctly disables shuffling for the training loader. (#3501) - Fixed gradient norm tracking for
row_log_interval > 1(#3489) - Fixed
ModelCheckpointname formatting ([3164) - Fixed auto-scale batch size (#3151)
- Fixed example implementation of AutoEncoder (#3190)
- Fixed invalid paths when remote logging with TensorBoard (#3236)
- Fixed change
t()totranspose()as XLA devices do not support.t()on 1-dim tensor (#3252) - Fixed (weights only) checkpoints loading without PL (#3287)
- Fixed
gather_all_tensorscross GPUs in DDP (#3319) - Fixed CometML save dir (#3419)
- Fixed forward key metrics (#3467)
- Fixed normalize mode at confusion matrix (replace NaNs with zeros) (#3465)
- Fixed global step increment in training loop when
training_epoch_endhook is used (#3673) - Fixed dataloader shuffling not getting turned off with
overfit_batches > 0anddistributed_backend = "ddp"(#3534) - Fixed determinism in
DDPSpawnBackendwhen usingseed_everythingin main process (#3335) - Fixed
ModelCheckpointperiodto actually save everyperiodepochs (#363...
synced BatchNorm, DataModules and final API
Overview
The newest PyTorch Lightning release includes final API clean-up with better data decoupling and shorter logging syntax.
Were happy to release PyTorch Lightning 0.9 today, which contains many great new features, more bugfixes than any release we ever had, but most importantly it introduced our mostly final API changes! Lightning is being adopted by top researchers and AI labs around the world, and we are working hard to make sure we provide a smooth experience and support for all the latest best practices.
Detail changes
Added
- Added SyncBN for DDP (#2801, #2838)
- Added basic
CSVLogger(#2721) - Added SSIM metrics (#2671)
- Added BLEU metrics (#2535)
- Added support to export a model to ONNX format (#2596)
- Added support for
Trainer(num_sanity_val_steps=-1)to check all validation data before training (#2246) - Added struct. output:
- Added class
LightningDataModule(#2668) - Added support for PyTorch 1.6 (#2745)
- Added call DataModule hooks implicitly in trainer (#2755)
- Added support for Mean in DDP Sync (#2568)
- Added remaining
sklearnmetrics:AveragePrecision,BalancedAccuracy,CohenKappaScore,DCG,Hamming,Hinge,Jaccard,MeanAbsoluteError,MeanSquaredError,MeanSquaredLogError,MedianAbsoluteError,R2Score,MeanPoissonDeviance,MeanGammaDeviance,MeanTweedieDeviance,ExplainedVariance(#2562) - Added support for
limit_{mode}_batches (int)to work with infinite dataloader (IterableDataset) (#2840) - Added support returning python scalars in DP (#1935)
- Added support to Tensorboard logger for OmegaConf
hparams(#2846) - Added tracking of basic states in
Trainer(#2541) - Tracks all outputs including TBPTT and multiple optimizers (#2890)
- Added GPU Usage Logger (#2932)
- Added
strict=Falseforload_from_checkpoint(#2819) - Added saving test predictions on multiple GPUs (#2926)
- Auto log the computational graph for loggers that support this (#3003)
- Added warning when changing monitor and using results obj (#3014)
- Added a hook
transfer_batch_to_deviceto theLightningDataModule(#3038)
Changed
- Truncated long version numbers in progress bar (#2594)
- Enabling val/test loop disabling (#2692)
- Refactored into
acceleratormodule: - Using
.comet.configfile forCometLogger(#1913) - Updated hooks arguments - breaking for
setupandteardown(#2850) - Using
gfileto support remote directories (#2164) - Moved optimizer creation after device placement for DDP backends (#2904](https://github.com/PyTorchLightning/pytorch-lighting/pull/2904))
- Support
**DictConfigforhparamserialization (#2519) - Removed callback metrics from test results obj (#2994)
- Re-enabled naming metrics in ckpt name (#3060)
- Changed progress bar epoch counting to start from 0 (#3061)
Deprecated
- Deprecated Trainer attribute
ckpt_path, which will now be set byweights_save_path(#2681)
Removed
- Removed deprecated: (#2760)
- core decorator
data_loader - Module hook
on_sanity_check_startand loadingload_from_metrics - package
pytorch_lightning.logging - Trainer arguments:
show_progress_bar,num_tpu_cores,use_amp,print_nan_grads - LR Finder argument
num_accumulation_steps
- core decorator
Fixed
- Fixed
accumulate_grad_batchesfor last batch (#2853) - Fixed setup call while testing (#2624)
- Fixed local rank zero casting (#2640)
- Fixed single scalar return from training (#2587)
- Fixed Horovod backend to scale LR schedlers with the optimizer (#2626)
- Fixed
dtypeanddeviceproperties not getting updated in submodules (#2657) - Fixed
fast_dev_runto run for all dataloaders (#2581) - Fixed
save_dirin loggers getting ignored by default value ofweights_save_pathwhen user did not specifyweights_save_path(#2681) - Fixed
weights_save_pathgetting ignored whenlogger=Falseis passed to Trainer (#2681) - Fixed TPU multi-core and Float16 (#2632)
- Fixed test metrics not being logged with
LoggerCollection(#2723) - Fixed data transfer to device when using
torchtext.data.Fieldandinclude_lengths is True(#2689) - Fixed shuffle argument for the distributed sampler (#2789)
- Fixed logging interval (#2694)
- Fixed loss value in the progress bar is wrong when
accumulate_grad_batches > 1(#2738) - Fixed correct CWD for DDP sub-processes when using Hydra (#2719)
- Fixed selecting GPUs using
CUDA_VISIBLE_DEVICES(#2739, #2796) - Fixed false
num_classeswarning in metrics (#2781) - Fixed shell injection vulnerability in subprocess call (#2786)
- Fixed LR finder and
hparamscompatibility (#2821) - Fixed
ModelCheckpointnot saving the latest information whensave_last=True(#2881) - Fixed ImageNet example: learning rate scheduler, number of workers and batch size when using DDP (#2889)
- Fixed apex gradient clipping (#2829)
- Fixed save apex scaler states (#2828)
- Fixed a model loading issue with inheritance and variable positional arguments (#2911)
- Fixed passing
non_blocking=Truewhen transferring a batch object that does not support it (#2910) - Fixed checkpointing to remote file paths (#2925)
- Fixed adding
val_stepargument to metrics (#2986) - Fixed an issue that caused
Trainer.test()to stall in DDP mode (#2997) - Fixed gathering of results with tensors of varying shape (#3020)
- Fixed batch size auto-scaling feature to set the new value on the correct model attribute (#3043)
- Fixed automatic batch scaling not working with half-precision (#3045)
- Fixed setting device to root GPU (#3042)
Contributors
@ananthsub, @ananyahjha93, @awaelchli, @bkhakshoor, @Borda, @ethanwharris, @f4hy, @groadabike, @ibeltagy, @justusschock, @lezwon, @nateraw, @neighthan, @nsarang, @PhilJd, @pwwang, @rohitgr7, @romesco, @ruotianluo, @shijianjian, @SkafteNicki, @tgaddair, @thschaaf, @williamFalcon, @xmotli02, @ydcjeff, @yukw777, @zerogerc
If we forgot someone due to not matching commit email with GitHub account, let us know :]
Bug fixes and .test() fix + TPU tests
Overview
The point of this release is more bug fixes ahead of v 1.0.0. We now have CI tests on TPU thanks to @zcain117 from Google! 🙂
This means we fixed many TPU bugs we hadn’t caught before because we had no tests.
In addition, we fixed:
- all the file path errors with loggers (txs @awaelchli)
- pickling errors with loggers (txs @awaelchli)
- fixed all the .test() calls
Detail changes
Added
Removed
- Removed auto val reduce (#2462)
Fixed
- Flattening Wandb Hyperparameters (#2459)
- Fixed using the same DDP python interpreter and actually running (#2482)
- Fixed model summary input type conversion for models that have input dtype different from model parameters (#2510)
- Made
TensorBoardLoggerandCometLoggerpickleable (#2518) - Fixed a problem with
MLflowLoggercreating multiple run folders (#2502) - Fixed global_step increment (#2455)
- Fixed TPU hanging example (#2488)
- Fixed
argparsedefault value bug (#2526) - Fixed Dice and IoU to avoid NaN by adding small eps (#2545)
- Fixed accumulate gradients schedule at epoch 0 (continued) (#2513)
- Fixed Trainer
.fit()returning last not best weights in "ddp_spawn" (#2565) - Fixed passing (do not pass) TPU weights back on test (#2566)
- Fixed DDP tests and
.test()(#2512, #2570)
Contributors
@anthonytec2, @awaelchli, @bernardomig, @Borda, @EspenHa, @HHousen, @InCogNiTo124, @rohitgr7, @williamFalcon
If we forgot someone due to not matching commit email with GitHub account, let us know :]
More bug fixing!
Detail changes
Added
- Added reduce ddp results on eval (#2434)
- Added a warning when an
IterableDatasethas__len__defined (#2437)
Changed
- Enabled no returns from eval (#2446)
Fixed
- Fixes train outputs (#2428)
- Fixes Conda dependencies (#2412)
- Fixed Apex scaling with decoupled backward (#2433)
- Fixed crashing or wrong displaying progressbar because of missing ipywidgets (#2417)
- Fixed TPU saving dir (fc26078, 04e68f0)
- Fixed logging on rank 0 only (#2425)