Lower Hunyuan Video LoRA memory requirements

It should be possible to leverage fp8 casted models, or torchao quantization, to support training in under 24 GB upto a reasonable resolution. Or atleast that's the hope when combined with precomputation from #129. Will take a look soon :hugs:

TorchAO docs: https://huggingface.co/docs/diffusers/main/en/quantization/torchao
FP8 casting: https://github.com/huggingface/diffusers/pull/10347