Group offloading was added to diffusers in v0.33.0 release. This implementation can cut down the memory requirements for training by a lot and the cost of training speed. If using a cuda device that supports streams, there is negligible overhead to training speed.
huggingface/diffusers#10503