Add support for Group Offloading

Group offloading was added to diffusers in v0.33.0 release. This implementation can cut down the memory requirements for training by a lot and the cost of training speed. If using a cuda device that supports streams, there is negligible overhead to training speed.

https://github.com/huggingface/diffusers/pull/10503