Skip to content

Changing on_step in self.log causes batch_to_device to change #20995

@davidgill97

Description

@davidgill97

Bug description

There seem to be a bottleneck to the training pipeline, so i tried profiling and found that setting on_step=True in self.log causes progress bar to take very long. This makes sense, as logging to progress bar requires calling .item() on GPU tensor, causing cudaStreamSynchronize.

FIT Profiler Report
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
|  Action                                                                                                                                                                     	|  Mean duration (s)	|  Num calls      	|  Total time (s) 	|  Percentage %   	|
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
|  Total                                                                                                                                                                      	|  -              	|  32523          	|  164.08         	|  100 %          	|
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
|  run_training_epoch                                                                                                                                                         	|  8.0065         	|  12             	|  96.078         	|  58.556         	|
|  [Callback]ExportCallback.on_fit_end                                                                                                                                        	|  62.609         	|  1              	|  62.609         	|  38.158         	|
|  [Callback]TQDMProgressBar.on_train_batch_end                                                                                                                               	|  0.24358        	|  144            	|  35.075         	|  21.377         	|
|  [Callback]ModelCheckpoint{'monitor': 'valid/macro_f1', 'mode': 'max', 'every_n_train_steps': 0, 'every_n_epochs': 1, 'train_time_interval': None}.on_validation_end        	|  1.3881         	|  12             	|  16.657         	|  10.152         	|
|  [Strategy]SingleDeviceStrategy.batch_to_device                                                                                                                             	|  0.0040199      	|  1056           	|  4.245          	|  2.5872         	|

But when i change on_step=False, though TQDMProgressBar.on_train_batch_end takes much less time, transfer_batch_to_device takes almost 10 times more. This behavior is completely baffling, and i don't know what to suspect.

Is this expected and what's causing this behavior?

FIT Profiler Report
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
|  Action                                                                                                                                                                     	|  Mean duration (s)	|  Num calls      	|  Total time (s) 	|  Percentage %   	|
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
|  Total                                                                                                                                                                      	|  -              	|  32523          	|  163.98         	|  100 %          	|
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
|  run_training_epoch                                                                                                                                                         	|  7.9402         	|  12             	|  95.282         	|  58.104         	|
|  [Strategy]SingleDeviceStrategy.batch_to_device                                                                                                                             	|  0.036941       	|  1056           	|  39.01          	|  23.789         	|
|  [LightningModule]SingleLabelClassification.transfer_batch_to_device                                                                                                        	|  0.036852       	|  1056           	|  38.916         	|  23.732         	|
|  [Callback]ModelCheckpoint{'monitor': 'valid/macro_f1', 'mode': 'max', 'every_n_train_steps': 0, 'every_n_epochs': 1, 'train_time_interval': None}.on_validation_end        	|  1.3867         	|  12             	|  16.64          	|  10.147         	|

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingneeds triageWaiting to be triaged by maintainersver: 2.5.x

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions