You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Below we provide additional sections detailing on more options we provide in this repository. They all attempt to make fine-tuning for video models as accessible as possible.
3
39
4
40
## Dataset Preparation
5
41
@@ -43,9 +79,11 @@ As an example, let's use [this](https://huggingface.co/datasets/Wild-Heart/Disne
TODO: Add a section on creating and using precomputed embeddings.
83
+
46
84
## Training
47
85
48
-
TODO
86
+
We provide training script for both text-to-video and image-to-video generation which are compatible with the [Cog family of models](https://huggingface.co/collections/THUDM/cogvideo-66c08e62f1685a3ade464cce).
49
87
50
88
Take a look at `training/*.sh`
51
89
@@ -63,9 +101,10 @@ Note: Untested on MPS
63
101
</table>
64
102
65
103
Supported and verified memory optimizations for training include:
66
-
-`CPUOffloadOptimizer` from [TorchAO](https://github.com/pytorch/ao). You can read about its capabilities and limitations [here](https://github.com/pytorch/ao/tree/main/torchao/prototype/low_bit_optim#optimizer-cpu-offload). In short, it allows you to use the CPU for storing trainable parameters and gradients. This results in the optimizer step happening on the CPU, which requires a fast CPU optimizer, such as `torch.AdamW(fused=True)` or applying `torch.compile` on the optimizer step. Additionally, it is recommended to not `torch.compile` your model for training. Gradient clipping and accumulation is not supported yet either.
67
-
- Low-bit optimizers from [bitsandbytes](https://huggingface.co/docs/bitsandbytes/optimizers). TODO: to test and make [TorchAO](https://github.com/pytorch/ao/tree/main/torchao/prototype/low_bit_optim) ones work
68
-
- TODO: DeepSpeed ZeRO
104
+
105
+
-`CPUOffloadOptimizer` from [`torchao`](https://github.com/pytorch/ao). You can read about its capabilities and limitations [here](https://github.com/pytorch/ao/tree/main/torchao/prototype/low_bit_optim#optimizer-cpu-offload). In short, it allows you to use the CPU for storing trainable parameters and gradients. This results in the optimizer step happening on the CPU, which requires a fast CPU optimizer, such as `torch.optim.AdamW(fused=True)` or applying `torch.compile` on the optimizer step. Additionally, it is recommended to not `torch.compile` your model for training. Gradient clipping and accumulation is not supported yet either.
106
+
- Low-bit optimizers from [`bitsandbytes`](https://huggingface.co/docs/bitsandbytes/optimizers). TODO: to test and make [`torchao`](https://github.com/pytorch/ao/tree/main/torchao/prototype/low_bit_optim) ones work
107
+
- DeepSpeed Zero2: Since we rely on `accelerate`, follow [this guide](https://huggingface.co/docs/accelerate/en/usage_guides/deepspeed) to configure your `accelerate` installation to enable training with DeepSpeed Zero2 optimizations.
69
108
70
109
> [!IMPORTANT]
71
110
> The memory requirements are reported after running the `training/prepare_dataset.py`, which converts the videos and captions to latents and embeddings. During training, we directly load the latents and embeddings, and do not require the VAE or the T5 text encoder. However, if you perform validation/testing, these must be loaded and increase the amount of required memory. Not performing validation/testing saves a significant amount of memory, which can be used to focus solely on training if you're on smaller VRAM GPUs.
@@ -250,8 +289,11 @@ ValueError: Expected a cuda device, but got: cpu
250
289
251
290
- [ ] Make scripts compatible withDDP
252
291
- [ ] Make scripts compatible withFSDP
253
-
- [] Make scripts compatible with DeepSpeed
292
+
- [x] Make scripts compatible with DeepSpeed
254
293
- [x] Test scripts with memory-efficient optimizer from bitsandbytes
255
294
- [x] Test scripts with CPUOffloadOptimizer, etc.
256
295
- [ ] Test scripts with torchao quantization, and low bit memory optimizers, etc.
257
296
- [x] Make 5B lora finetuning work in under 24GB
297
+
298
+
> [!IMPORTANT]
299
+
> Since our goal is to make the scripts as memory-friendly as possible we don't guarantee multi-GPU training.
0 commit comments