-
Notifications
You must be signed in to change notification settings - Fork 3.5k
Description
Summary
Request improved user experience for multi-datacenter LLM training via dedicated CLI options, documentation, and example configs. The HSDP foundation exists in Megatron Core v0.11.0, but users must manually configure complex DeviceMesh setups without guidance.
Motivation
Multi-datacenter training is becoming essential for scaling. The NVIDIA blog demonstrates 96% scaling efficiency training a 340B model across data centers ~1000 km apart. However:
- High configuration complexity - Users must manually configure HSDP with DeviceMesh, understand
dp_outer/dp_shardrelationships, and tune communication parameters for WAN environments - Missing documentation - "N/S connection" mentioned in CHANGELOG v0.11.0 but never explained; no user guide or examples for cross-datacenter setups
- No bandwidth/latency-aware tuning - No automatic adjustment of chunk sizes or communication strategies based on network topology
- Limited observability - No metrics distinguishing inter-DC vs intra-DC communication patterns
Current State
Megatron Core has HSDP (Hybrid Sharded Data Parallel) implementation in megatron/core/distributed/fsdp/ and basic CLI but lacks user-friendly abstractions for common cross-datacenter deployment patterns.
Ask
-
New CLI arguments for intuitive multi-datacenter configuration
-
Documentation - User guide at
docs/user-guide/features/multi_datacenter_training.mdexplaining architecture, configuration, and tuning -
Example scripts -
examples/cross_datacenter/with working configurations for common setups (2-DC, multi-DC with HSDP) -
Monitoring - wandb metrics for inter-DC vs intra-DC communication breakdown