-
Notifications
You must be signed in to change notification settings - Fork 207
Closed
Labels
Description
Environment
- Hardware: AWS EC2, multi-node (4 nodes × 8 GPUs = 32 ranks)
- CUDA/NCCL: CUDA 13, NCCL 2.28.8+cuda13.0
- PyTorch: 2.9.1 (container-provided)
- Megatron-Core: pip package (from megatron-core), used via Megatron‑Bridge
- Transformer Engine: installed and importable
- Dataset: Energon WebDataset (no videos), Qwen‑VL finetune
- Launcher: torch.distributed.run with nnodes=4, nproc_per_node=8
- Networking: host network, NCCL_DEBUG=INFO, GLOO/NCCL IFNAME configured
Model/Config
- Script: examples/recipes/qwen_vl/finetune_qwen_vl.py
- Recipe: qwen3_vl_8b_finetune_config
- Key overrides:
- model.tensor_model_parallel_size=8
- model.pipeline_model_parallel_size=1
- model.context_parallel_size=2
- model.seq_length=24576
- ddp.use_distributed_optimizer=true
- ddp.data_parallel_sharding_strategy=optim_grads_params
- model.recompute_granularity=selective
- model.attention_backend=flash
- dataset-type=energon with dataset.micro_batch_size=… matching train.micro_batch_size
Symptoms
- Config print on rank0 shows:
- context_parallel_size: 2
- hierarchical_context_parallel_sizes: null
- Runtime error during training:
- Stack points to Megatron Core CP slicing:
- megatron.core.utils.get_batch_on_this_cp_rank()
- Error example 1:
- RuntimeError: shape '[1, 4, 1131]' is invalid for input of size 4526
- Error example 2:
- RuntimeError: shape '[1, 4, 1620]' is invalid for input of size 6483
- This indicates the CP slicer is attempting to reshape with CP=4 (B=1, CP=4) even though the config sets CP=2.
- Stack points to Megatron Core CP slicing:
- The per‑microbatch effective sequence length S is not divisible by 4 in the error cases (e.g., 6483), causing the reshape failure.
Expected Behavior
- With model.context_parallel_size=2 and PP=1, the batch tensors should be padded to model.seq_length (24576), which is divisible by 2. The CP slicer should reshape with CP=2 (not 4), i.e., [B, CP=2, S/2, …], and should not error.
Actual Behavior
- CP appears to be treated as 4 by the slicer despite context_parallel_size=2 in config; reshape fails when S%4 != 0.
- hierarchical_context_parallel_sizes is null; possibly a hierarchical CP split is inferred at runtime.
Repro Steps
- Launch (master on node 0) with:
- --nnodes=4 --nproc_per_node=8 --node_rank set per node
- model.tensor_model_parallel_size=8
- model.pipeline_model_parallel_size=1
- model.context_parallel_size=2
- model.seq_length=24576
- dataset-type=energon (dataset.global/micro batch sizes = train.*)
- model.recompute_granularity=selective
- model.attention_backend=flash
- Observe config printed on rank0:
- context_parallel_size: 2
- hierarchical_context_parallel_sizes: null
- Training crashes on some ranks with the reshape error above.
What we tried
- Verified seq_length=24576 (multiple of 128 and divisible by CP=2).
- Ensured PP=1 padding path (fixed length per microbatch).
- Confirmed no packed/THD (remove-padding) inputs used with CP.
Hypothesis
- Despite context_parallel_size=2, hierarchical CP defaults to a deeper split on this topology (CP=4) when hierarchical_context_parallel_sizes is null. The batch tensors are then sliced using CP=4, but the microbatch length S is not
divisible by 4, causing the reshape error. - Alternately, CP is correctly 2 on config, but some internal path initializes CP groups differently, leading to a mismatch between CP world size used in slicing and the model config.
Reactions are currently unavailable