-
Notifications
You must be signed in to change notification settings - Fork 1.3k
Description
I am fine-tuning a vlm(30B model) using Hugging Face Trainer with several memory optimization techniques:
Batch Size & Gradient Accumulation
Activation Checkpointing
Flash Attention (SDPA)
Optimizers (e.g., AdamW/8-bit)
Observed Issue:
When comparing memory usage, I found that using device_map="auto" (Naive Model Parallelism) results in significantly lower VRAM consumption and better optimization compared to using FSDP (Fully Sharded Data Parallel) v1 or v2.
In theory, FSDP should be more memory-efficient because it shards parameters, gradients, and optimizer states across GPUs, and only reconstructs the full parameters (via all-gather) when needed for a specific layer. It is also supposed to minimize GPU bubbles and maximize throughput compared to naive model parallelism.
However, in my experiments, FSDP consistently triggers OOM (Out of Memory) errors, whereas naive model parallelism with device_map="auto" runs successfully.
Do you know why?
my fsdp config ( h100 8 gpus )
compute_environment: LOCAL_MACHINE
debug: false
distributed_type: FSDP
downcast_bf16: 'no'
enable_cpu_affinity: false
fsdp_config:
fsdp_activation_checkpointing: true
fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP
fsdp_cpu_ram_efficient_loading: true
fsdp_offload_params: false
fsdp_reshard_after_forward: true
fsdp_state_dict_type: FULL_STATE_DICT
fsdp_version: 2
machine_rank: 0
main_training_function: main
mixed_precision: bf16
num_machines: 1
num_processes: 8
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false