Skip to content

why does (Naive MP) show better memory efficiency than FSDP1/2 ??Β #3932

@Lim-Sung-Jun

Description

@Lim-Sung-Jun

I am fine-tuning a vlm(30B model) using Hugging Face Trainer with several memory optimization techniques:

Batch Size & Gradient Accumulation
Activation Checkpointing
Flash Attention (SDPA)
Optimizers (e.g., AdamW/8-bit)

Observed Issue:
When comparing memory usage, I found that using device_map="auto" (Naive Model Parallelism) results in significantly lower VRAM consumption and better optimization compared to using FSDP (Fully Sharded Data Parallel) v1 or v2.

In theory, FSDP should be more memory-efficient because it shards parameters, gradients, and optimizer states across GPUs, and only reconstructs the full parameters (via all-gather) when needed for a specific layer. It is also supposed to minimize GPU bubbles and maximize throughput compared to naive model parallelism.

However, in my experiments, FSDP consistently triggers OOM (Out of Memory) errors, whereas naive model parallelism with device_map="auto" runs successfully.

Do you know why?

my fsdp config ( h100 8 gpus )

compute_environment: LOCAL_MACHINE
debug: false
distributed_type: FSDP
downcast_bf16: 'no'
enable_cpu_affinity: false
fsdp_config:
  fsdp_activation_checkpointing: true
  fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP
  fsdp_cpu_ram_efficient_loading: true
  fsdp_offload_params: false
  fsdp_reshard_after_forward: true
  fsdp_state_dict_type: FULL_STATE_DICT
  fsdp_version: 2
machine_rank: 0
main_training_function: main
mixed_precision: bf16
num_machines: 1
num_processes: 8
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions