Access a registered buffer is very slow #11493

juliendenize · 2022-01-15T11:01:11Z

juliendenize
Jan 15, 2022

Hello,

I implemented MoCo in Pytorch lightning. I was surprised to see that my lightning version was slower than Pytorch's and I ran the profiler to check which function is slow. I can't share all my code but here are the relevant parts:

class MoCoModel(LightningModule):
    def __init__(
        ...
    ) -> None:
        ...

        self.register_buffer('queue', torch.randn(queue.feature_dim, queue.size))
        self.queue = nn.functional.normalize(self.queue, dim=0)
        self.register_buffer('queue_ptr', torch.zeros(1, dtype=torch.long))
    
    @torch.no_grad()
    def _update_queue(self, x: Tensor) -> None:
        x = self.concat_all_gather_without_backprop(x)

        #batch_size = x.shape[0]
        batch_size = self._get_batch_size(x)

        # for simplicity
        ptr = self._get_ptr()
        #ptr = int(self.queue_ptr)

        self._assert(batch_size)
        #assert self.queue_size % batch_size == 0

        # replace the keys at ptr (dequeue and enqueue)
        x = self._transpose(x)
        self._assign_in_queue(x, ptr, batch_size)
        #self.queue[:, ptr: ptr + batch_size] = x.T

        # move pointer
        ptr = self._compute_ptr(ptr, batch_size)
        self._assign_ptr(ptr)
        #ptr =  (ptr + batch_size) % self.queue_size
    
    def _get_batch_size(self, x):
        return x.shape[0]

    def _get_ptr(self):
        return int(self.queue_ptr)

    def _assert(self, batch_size):
        assert self.queue_size % batch_size == 0
    
    def _assign_ptr(self, ptr):
        self.queue_ptr[0] = ptr
    
    def _compute_ptr(self, batch_size, ptr):
        return (ptr + batch_size) % self.queue_size

    def _transpose(self, x):
        return x.T
    
    def _assign_in_queue(self, x, ptr, batch_size):
        self.queue[:, ptr: ptr + batch_size] = x

    def training_step(self, batch):
        ...
        self._update_queue(k)

Here is the output of running simple profiler:

Action                             	|  Mean duration (s)	|Num calls      	|  Total time (s) 	|  Percentage %   	|
--------------------------------------------------------------------------------------------------------------------------------------
Total                              	|  -              	|_              	|  53.595         	|  100 %          	|
--------------------------------------------------------------------------------------------------------------------------------------
run_training_epoch                 	|  45.224         	|1              	|  45.224         	|  84.381         	|
run_training_batch                 	|  0.21673        	|195            	|  42.262         	|  78.854         	|
optimizer_step_with_closure_0      	|  0.20378        	|195            	|  39.738         	|  74.145         	|
training_step_and_backward         	|  0.19978        	|195            	|  38.957         	|  72.688         	|
model_forward                      	|  0.1909         	|195            	|  37.225         	|  69.457         	|
training_step                      	|  0.19077        	|195            	|  37.201         	|  69.411         	|
backward                           	|  0.0083673      	|195            	|  1.6316         	|  3.0443         	|
on_train_batch_end                 	|  0.0077772      	|195            	|  1.5166         	|  2.8296         	|
get_train_batch                    	|  0.0034326      	|196            	|  0.6728         	|  1.2553         	|
fetch_next_train_batch             	|  0.0034203      	|196            	|  0.67037        	|  1.2508         	|
zero_grad                          	|  0.00049274     	|195            	|  0.096084       	|  0.17928        	|
configure_optimizers               	|  0.093719       	|1              	|  0.093719       	|  0.17486        	|
training_batch_to_device           	|  0.00028381     	|195            	|  0.055342       	|  0.10326        	|
on_train_batch_start               	|  0.00018134     	|195            	|  0.03536        	|  0.065977       	|
on_train_start                     	|  0.033906       	|1              	|  0.033906       	|  0.063264       	|
on_pretrain_routine_start          	|  0.006531       	|1              	|  0.006531       	|  0.012186       	|
on_batch_start                     	|  3.062e-05      	|195            	|  0.0059708      	|  0.011141       	|
on_after_backward                  	|  3.0163e-05     	|195            	|  0.0058817      	|  0.010974       	|
on_before_optimizer_step           	|  2.989e-05      	|195            	|  0.0058285      	|  0.010875       	|
on_batch_end                       	|  2.9087e-05     	|195            	|  0.005672       	|  0.010583       	|
on_before_zero_grad                	|  2.8804e-05     	|195            	|  0.0056167      	|  0.01048        	|
on_before_backward                 	|  2.6982e-05     	|195            	|  0.0052616      	|  0.0098172      	|
on_train_epoch_end                 	|  0.0014064      	|1              	|  0.0014064      	|  0.0026241      	|
training_step_end                  	|  4.9198e-06     	|195            	|  0.00095937     	|  0.00179        	|
on_train_epoch_start               	|  0.00025167     	|1              	|  0.00025167     	|  0.00046957     	|
on_train_end                       	|  0.00017067     	|1              	|  0.00017067     	|  0.00031844     	|
on_before_accelerator_backend_setup	|  6.968e-05      	|1              	|  6.968e-05      	|  0.00013001     	|
setup                              	|  5.0209e-05     	|1              	|  5.0209e-05     	|  9.3682e-05     	|
prepare_data                       	|  4.4779e-05     	|1              	|  4.4779e-05     	|  8.355e-05      	|
on_fit_end                         	|  3.892e-05      	|1              	|  3.892e-05      	|  7.2618e-05     	|
on_epoch_start                     	|  3.332e-05      	|1              	|  3.332e-05      	|  6.2169e-05     	|
on_pretrain_routine_end            	|  3.009e-05      	|1              	|  3.009e-05      	|  5.6143e-05     	|
on_epoch_end                       	|  2.741e-05      	|1              	|  2.741e-05      	|  5.1142e-05     	|
on_configure_sharded_model         	|  2.556e-05      	|1              	|  2.556e-05      	|  4.7691e-05     	|
on_fit_start                       	|  2.0869e-05     	|1              	|  2.0869e-05     	|  3.8938e-05     	|
teardown                           	|  1.9379e-05     	|1              	|  1.9379e-05     	|  3.6158e-05     	|
configure_sharded_model            	|  6.5197e-06     	|1              	|  6.5197e-06     	|  1.2165e-05     	|
configure_callbacks                	|  5.16e-06       	|1              	|  5.16e-06       	|  9.6277e-06     	|
on_train_dataloader                	|  4.2003e-06     	|1              	|  4.2003e-06     	|  7.837e-06      	|

As we can see a large time is spent in training_step and here is the output of advanced profiler for this function:

Profile stats for: training_step rank: 0
         1065072 function calls (862519 primitive calls) in 37.086 seconds

   Ordered by: cumulative time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
      195    0.001    0.000   37.082    0.190 accelerator.py:210(training_step)
      195    0.000    0.000   37.079    0.190 ddp.py:438(training_step)
31980/195    0.053    0.000   37.079    0.190 module.py:1096(_call_impl)
      195    0.015    0.000   37.078    0.190 distributed.py:852(forward)
      195    0.002    0.000   36.629    0.188 base.py:76(forward)
      195    0.009    0.000   36.624    0.188 moco.py:201(training_step)
 1170/390    0.006    0.000   34.429    0.088 grad_mode.py:25(decorate_context)
      195    0.002    0.000   32.216    0.165 moco.py:83(_update_queue)
      195   32.171    0.165   32.171    0.165 moco.py:109(_get_ptr)
      390    0.000    0.000    3.942    0.010 resnet.py:268(forward)
     ...
      195    0.008    0.000    0.008    0.000 moco.py:124(_assign_in_queue)
      195    0.005    0.000    0.006    0.000 moco.py:115(_assign_ptr)
      195    0.002    0.000    0.002    0.000 moco.py:121(_transpose)
      195    0.001    0.000    0.001    0.000 gather.py:44(concat_all_gather_without_backprop)
      195    0.000    0.000    0.000    0.000 moco.py:106(_get_batch_size)
     ...

The function _update_queue is very long and the function taking the most time is _get_ptr which should be really fast in comparison with forwards or computation of MoCo loss. I watched lightning bolts implementation that uses the same kind of operations so I don't really understand why it is this slow.

I tested with DDP and SingleDevice strategy that resulted in the same kind of slow down on a SLURM cluster environment.

Answered by juliendenize

Jan 20, 2022

Fixed it, lightning is now as fast as my previous implementation, the problem was elsewhere but I didn't detect it using the profiler because of the asynchronous computation from GPUs which were not synchronized during profiling.

View full answer

juliendenize · 2022-01-20T08:30:45Z

juliendenize
Jan 20, 2022
Author

Fixed it, lightning is now as fast as my previous implementation, the problem was elsewhere but I didn't detect it using the profiler because of the asynchronous computation from GPUs which were not synchronized during profiling.

2 replies

wangbingnan136 Jun 7, 2022

What a wonder full table.May I ask how do you generate a profiler output like this?

akihironitta Jun 7, 2022

Hi @wangbingnan136 You can simply use cProfile from the standard library (https://docs.python.org/3/library/profile.html)

$ python -m cProfile your_script.py

or one of profilers integrated with PL (https://pytorch-lightning.readthedocs.io/en/stable/advanced/profiler.html)

Trainer(profiler="simple")
Trainer(profiler="advanced")

Also, it'd be nice if you could create another discussion to ask something that's not directly related to existing discussions.

Bhathiya-hw · 2022-12-01T21:25:41Z

Bhathiya-hw
Dec 1, 2022

@juliendenize
Trying to run a model with Moco_V2 with ddp setting, I am facing a similar issue. In my case, training get totally stuck stuck when acessing registered buffer at: https://github.com/allenai/CSR/blob/main/src/lightning/modules/moco2_module.py#L153 . But I was unable to find a fix. May I know, how you manage to fix this. It could be the same issue as mine.

3 replies

akihironitta Dec 1, 2022

@Bhathiya-hw Would you mind creating another discussion?

juliendenize Dec 1, 2022
Author

Just to provide a quick help even if it might be better to create another discussion, I released my code a few weeks ago, and the part with the queue here.

Bhathiya-hw Dec 2, 2022

@juliendenize Thank you very much for the response. I am still unable to find what is wrong in my case (essentially the same thing), making me wonder it is something to do with lightning versions. I will create a separate issue. Thanks again.

@akihironitta I will start a separate discussion.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Access a registered buffer is very slow #11493

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 2 comments 5 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Access a registered buffer is very slow #11493

Uh oh!

Uh oh!

juliendenize Jan 15, 2022

Replies: 2 comments · 5 replies

Uh oh!

juliendenize Jan 20, 2022 Author

Uh oh!

wangbingnan136 Jun 7, 2022

Uh oh!

Uh oh!

akihironitta Jun 7, 2022

Uh oh!

Bhathiya-hw Dec 1, 2022

Uh oh!

akihironitta Dec 1, 2022

Uh oh!

juliendenize Dec 1, 2022 Author

Uh oh!

Bhathiya-hw Dec 2, 2022

juliendenize
Jan 15, 2022

Replies: 2 comments 5 replies

juliendenize
Jan 20, 2022
Author

Bhathiya-hw
Dec 1, 2022

juliendenize Dec 1, 2022
Author