Skip to content

Feature Request: Enhanced Cross-Datacenter Training User Experience #2795

@sbhavani

Description

@sbhavani

Summary

Request improved user experience for multi-datacenter LLM training via dedicated CLI options, documentation, and example configs. The HSDP foundation exists in Megatron Core v0.11.0, but users must manually configure complex DeviceMesh setups without guidance.

Motivation

Multi-datacenter training is becoming essential for scaling. The NVIDIA blog demonstrates 96% scaling efficiency training a 340B model across data centers ~1000 km apart. However:

  • High configuration complexity - Users must manually configure HSDP with DeviceMesh, understand dp_outer/dp_shard relationships, and tune communication parameters for WAN environments
  • Missing documentation - "N/S connection" mentioned in CHANGELOG v0.11.0 but never explained; no user guide or examples for cross-datacenter setups
  • No bandwidth/latency-aware tuning - No automatic adjustment of chunk sizes or communication strategies based on network topology
  • Limited observability - No metrics distinguishing inter-DC vs intra-DC communication patterns

Current State

Megatron Core has HSDP (Hybrid Sharded Data Parallel) implementation in megatron/core/distributed/fsdp/ and basic CLI but lacks user-friendly abstractions for common cross-datacenter deployment patterns.

Ask

  1. New CLI arguments for intuitive multi-datacenter configuration

  2. Documentation - User guide at docs/user-guide/features/multi_datacenter_training.md explaining architecture, configuration, and tuning

  3. Example scripts - examples/cross_datacenter/ with working configurations for common setups (2-DC, multi-DC with HSDP)

  4. Monitoring - wandb metrics for inter-DC vs intra-DC communication breakdown

References

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions