- 
                Notifications
    You must be signed in to change notification settings 
- Fork 192
Closed
Description
Motivation
Questions regarding training arguments in here: https://github.com/hao-ai-lab/FastVideo/blob/main/examples/training/finetune/wan_i2v_14b_480p/crush_smol/finetune_i2v.slurm
batch size
- there are 2 batch size arguments: train_batch_size, train_sp_batch_size
- what is the difference between these two?
 
- As far as I understand, dataloader fetches train_batch_sizeper GPU (FSDP) and batch needs to be same across same SP group devices. Thentrain_sp_batch_sizemeans the batch size for same SP group devices ?- FastVideo/fastvideo/utils/communications.py - Line 281 in 65ed588 - def sp_parallel_dataloader_wrapper(dataloader, device, train_batch_size, sp_size, train_sp_batch_size): 
 
num_height / num_width
- Is these two value for sampling validation dataset ?
parallel
- why tp_sizeis fixed to 1?- FastVideo/examples/training/finetune/wan_i2v_14b_480p/crush_smol/finetune_i2v.slurm - Line 69 in 65ed588 - --tp_size 1 
 
Precisoin
- There are 3 precision related arguments: mixed_precision, allow_tf32, dit_precision
- What is the role of each of them?
- mixed_precision is set to bf16 and dit_precisoin is set to fp32. Doens't thie raise conflict??
 
Miscellaneous
- not_apply_cfg_solver- multi_phased_distill_scheduler- ema_start_stepfor distillation?
tp size
- tp_size is set to num_gpus in inference --tp-size $num_gpus \ 
- but is set to 1 in training: --tp_size 1 
- why?
Thanks!
Related resources
No response
Metadata
Metadata
Assignees
Labels
No labels