why does (Naive MP) show better memory efficiency than FSDP1/2 ??

I am fine-tuning a vlm(30B model) using Hugging Face Trainer with several memory optimization techniques:

Batch Size & Gradient Accumulation
Activation Checkpointing
Flash Attention (SDPA)
Optimizers (e.g., AdamW/8-bit)

Observed Issue:
When comparing memory usage, I found that using device_map="auto" (Naive Model Parallelism) results in significantly lower VRAM consumption and better optimization compared to using FSDP (Fully Sharded Data Parallel) v1 or v2.

In theory, FSDP should be more memory-efficient because it shards parameters, gradients, and optimizer states across GPUs, and only reconstructs the full parameters (via all-gather) when needed for a specific layer. It is also supposed to minimize GPU bubbles and maximize throughput compared to naive model parallelism.

However, in my experiments, FSDP consistently triggers OOM (Out of Memory) errors, whereas naive model parallelism with device_map="auto" runs successfully.

Do you know why?

my fsdp config ( h100 8 gpus )
```
compute_environment: LOCAL_MACHINE
debug: false
distributed_type: FSDP
downcast_bf16: 'no'
enable_cpu_affinity: false
fsdp_config:
  fsdp_activation_checkpointing: true
  fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP
  fsdp_cpu_ram_efficient_loading: true
  fsdp_offload_params: false
  fsdp_reshard_after_forward: true
  fsdp_state_dict_type: FULL_STATE_DICT
  fsdp_version: 2
machine_rank: 0
main_training_function: main
mixed_precision: bf16
num_machines: 1
num_processes: 8
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

why does (Naive MP) show better memory efficiency than FSDP1/2 ?? #3932

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

why does (Naive MP) show better memory efficiency than FSDP1/2 ?? #3932

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions