Is your feature request related to a problem? Please describe.
I found that the inside the __call__ of stable video diffusion keeps doing async memcpy between host to device as attached.

Describe the solution you'd like.
The reason for that is actually coming from every time we get self.do_classifier_free_guidance, we compared tensor between int -> get boolean on device -> memcpy that boolean from gpu to cpu.
It'll be good to just assign a variable for it before the loop as the value won't change through the loop.
Additional context.
I'm glad to contribute this by opening a PR